The Small Model Hypothesis

There is a particular kind of optimism that precedes a graveyard.

I spent a week building evaluation harnesses—twenty-two versions, twelve experimental runs, three model sizes—trying to prove a simple thesis: that a small language model, equipped with the right algorithmic tool, could outperform its unaugmented self on tasks requiring physical reasoning. The tool was ucon, a dimensional analysis library. The tasks were nursing dosage calculations—weight-based infusions, IV drip rates, concentration conversions. The hypothesis was that the model didn't need to reason about units; it only needed to extract the relevant quantities and let the algorithm handle the physics.

I called this the "cognitive prosthesis" pattern.

The results were humbling.

The hypothesis

Language models are unreliable at multi-step arithmetic. This is not controversial. Arithmetic like:

"Multiply the patient's weight by the dose rate, convert micrograms to milligrams, adjust for the infusion duration"

A 1.7-billion-parameter model asked to compute a weight-based medication dose will frequently produce a confident, wrong answer. The failure mode is not confusion; it is hallucination dressed as competence.

But here is the thing: the reasoning required for these problems is not complex. It is mechanical. Dimensional analysis is a formal system with well-defined rules. Units cancel. Conversion factors chain. The target dimension constrains the solution space. A correct algorithm will solve these problems with perfect accuracy, every time.

The hypothesis, then, was about division of labor.

If you separate the problem into two phases——extraction (identifying the quantities, units, and relationships in a natural language problem) and computation (applying dimensional analysis to produce the answer)——you can assign each phase to the system best suited to handle it. Language models are pattern matchers. They excel at extraction. Algorithms are formal systems. They excel at computation.

A small model that cannot reliably compute (70 kg) × (5 mcg/kg/min) × (60 min/hr) × (1 mg/1000 mcg) might still be able to extract the components and pass them to something that can.

The cognitive prosthesis is the tool that handles, reliably, what the model cannot.

The experiment

I chose nursing dosage calculations because they are safety-critical, dimensionally complex, and well-documented. The problem set included twenty-five scenarios with problems like:

A patient weighing 176 lb needs dopamine at 5 mcg/kg/min. The concentration is 400 mg in 250 mL. Calculate the infusion rate in mL/hr.
Ordered: Amoxicillin 25 mg/kg/day divided q8h for a child weighing 44 lb. Calculate the dose per administration.
An IV of 1000 mL is to infuse over 8 hours using tubing with a drop factor of 15 gtt/mL. Calculate the drip rate in gtt/min.

Each problem requires extracting multiple quantities, identifying conversion factors (some explicit, some implicit in clinical shorthand like "q8h" meaning every 8 hours), and chaining them correctly to reach a target unit.

The algorithmic solver was straightforward. DimensionalSolver parses ratios, identifies which units need to cancel, uses breadth-first search to find conversion paths between unit systems, and assembles a factor chain that transforms the starting quantity into the target unit. On manually extracted inputs, it achieves 100% accuracy. The solver is not the problem.

The harness wrapped an Ollama-hosted model (Qwen 0.6B, 1.7B, or 4B) with a system prompt asking it to extract four fields: starting value, starting unit, target unit, and a list of given ratios. The extraction would be passed to the solver, and the result compared against known correct answers.

The control condition was the model alone, without the prosthesis. The treatment condition was the model plus solver.

I expected the treatment condition to show dramatic improvement.

The graveyard

Twenty-two harness versions. Twelve evaluation runs. A progression from naive tool-calling to template-based classification to extraction-focused prompts.

Versions 1–7 attempted standard tool-calling: present the model with a JSON schema, ask it to invoke the appropriate function. The 0.6B model could not reliably produce valid JSON. The 1.7B model produced JSON but selected tools arbitrarily. Even trivial conversions—100 pounds to kilograms—failed more than 60% of the time.

Versions 8–15 abandoned tool-calling for explicit templates. A problem classifier identified the scenario type (IV drip rate, weight-based infusion, divided dose) and selected a pre-built factor template with slots for the model to fill. This helped. Accuracy rose to 35–52% for the smaller model, with high variance. But models still extracted wrong values, confused units, and dropped components from compound ratios.

Versions 16–22 simplified further. The system prompt asked only for extraction: "Identify the starting value, starting unit, target unit, and any given ratios." The solver handled everything else. This was the purest test of the hypothesis: LLM does extraction, algorithm does reasoning.

Results, treatment condition (model + prosthesis):

Model	Accuracy	First-Try Accuracy
Qwen 0.6B	52%	16%
Qwen 1.7B	44%	4%
Qwen 4B	0%	0%

The 4B consistently forwent tool use entirely (we can discuss my prompting another day), but the pattern was already clear.

The prosthesis worked perfectly. The extraction did not.

What failed

The solver, given correct input, never failed. I verified this with a separate test suite: fifteen problems, manually extracted, 100% accuracy. The algorithm is sound.

The model, asked to extract structured data from natural language, failed in predictable ways.

Compound unit parsing. "5 mcg/kg/min" should become a ratio with a numerator of mcg and denominators of kg and min. Models frequently extracted mcg/kg as one unit and min as a separate, unrelated field. Tokenization treats / as a separator, not a grouping operator. The dimensional structure is lost before the model even begins reasoning.

Semantic extraction. "Over 8 hours" must become a ratio: 1 hr / 8 hr or equivalently 0.125 hr/hr. "q8h" (every 8 hours) means 3 doses per day, which must become 1 day / 3 doses or 8 hr / dose. Models did not consistently apply these conversions. They extracted the number 8 and attached it to hours, losing the reciprocal relationship.

List comprehension. The given_ratios field is a list. Problems often contain multiple ratios: a dose rate, a concentration, a drop factor, a time constraint. Models frequently captured only the first ratio, or the most prominently stated one, and silently dropped the rest.

Value confusion. A problem stating "a patient weighing 176 lb needs medication at 5 mcg/kg/min" contains at least two numeric values. Models sometimes extracted the wrong one—-e.g. 176 as the dose rate, 5 as the weight. Context that is obvious to a human reader was not reliably parsed.

The hypothesis assumed extraction accuracy of 80% or higher. Observed extraction accuracy was 44–52%. The prosthesis multiplied the model's capability by exactly 1.0.

What this means

The experiment failed to prove its intended thesis. It proved something else instead.

The cognitive prosthesis pattern is sound. An algorithmic tool that handles formal reasoning can extend a model's capabilities, but only if the interface between model and tool is robust. The interface is extraction. Extraction is the bottleneck.

This has implications for how we design AI systems that interact with the physical world.

The tool is not enough. A perfectly correct dimensional solver, exposed as an MCP tool or function call, does not automatically make an agent competent at dimensional analysis. The agent must also be competent at formulating correct tool calls. For small models, this competence cannot be assumed.

Extraction is a first-class problem. The literature on LLM tool use focuses heavily on the tools themselves: their schemas, their capabilities, their error handling. Less attention is paid to the upstream problem of correctly invoking those tools. For safety-critical applications, extraction accuracy may matter more than tool accuracy.

Larger models may not help. The 4B model performed worse than the 1.7B model (though configuration issues cloud this result). Scale does not automatically improve structured extraction. The failure mode is not "insufficient knowledge" but "inconsistent attention to format constraints." Instruction-following, not parameter count, is the limiting factor.

The prosthesis must compensate for its connector. If the model cannot reliably extract, the tool must be robust to extraction errors. Fuzzy matching on unit names. Validation of dimensional consistency before computation. Structured error responses that help the model self-correct. The cognitive prosthesis must include not just the reasoning capability the model lacks, but also the guardrails that help the model use it correctly.

This last point connects back to ucon 0.7.0's error protocol—ConversionError with likely_fix and hints, dimensional diagnostics, compatible unit suggestions. Those features were designed for agent self-correction. The small model experiments revealed why they matter: the agent will make mistakes, and the tool must be prepared to catch them.

The division of labor, revisited

A cognitive prosthesis is not a crutch. It is a recognition that intelligence is not monolithic.

Language models are general-purpose reasoners trained on text. They learn patterns, associations, statistical regularities. They do not learn formal systems in the way a compiler knows its grammar. When a formal system is required (i.e. for dimensional analysis, symbolic mathematics, constraint satisfaction, etc.) the model's pattern-matching approximates but does not replicate the formal guarantees.

A prosthesis provides the formal guarantees. The model provides the interface to natural language. Together, they form a system more capable than either alone.

But the coupling matters. A prosthesis poorly coupled to its host compensates for nothing. The small model experiments revealed that coupling (i.e. extraction, formatting, schema compliance) is harder than it looks.

The path forward is not to abandon the pattern but to strengthen the interface. Better extraction prompts. Verification loops where the model checks its own output. Fallback mechanisms that retry with different framings. And tools that anticipate failure, explain it clearly, and offer structured paths to recovery.

The graveyard of twenty-two harness versions was not wasted effort. It was the process of learning where the seams are.

What comes next

The small model hypothesis is not dead. It is refined.

The original claim—that small models with tools can match or exceed larger models without tools—requires a more nuanced statement: small models with tools can match larger models when the extraction interface is robust enough to reliably invoke those tools.

The next experiments will focus on that interface:

Verification loops. Ask the model to extract, then ask it to verify its own extraction against the original problem. Does self-checking improve accuracy?
Constrained generation. Rather than free-form extraction, provide a structured template with explicit slots. Does constraint reduce error?
MCP integration. In agentic contexts, models have multiple turns to correct errors. Does the ConversionError protocol enable effective self-correction over a conversation?

The cognitive prosthesis remains the right abstraction. The intelligence should live in the tool, not the caller. But intelligence alone is not enough. The tool must also be accessible——robust to the errors its callers will inevitably make, structured in its responses so recovery is possible.

This is, I think, the deeper lesson. Not that small models are incapable, but that capability is a property of systems, not components. A model that cannot compute dimensional analysis but can invoke a tool that can——reliably, accurately, with error recovery——is, for practical purposes, a model that can do dimensional analysis.

The gap between "cannot" and "can, with help" is the space where cognitive prostheses live.

The work is to make that space navigable.

*Hoping to publish the evaluation harness and dimensional solver in the near future.

Source: github.com/withtwoemms/ucon.*