Thorsten Meyer AI Foundations · 02 / 08
Parameters, tokens, and context window — the dials behind almost every AI behavior
Three questions people ask about AI, all with the same shape of answer:
Why did it forget what I told it ten messages ago?
Why is the bigger model slower, and not obviously smarter?
Why can’t it count the r’s in “strawberry”?
They sound unrelated. They aren’t. Each one points at a different dial — three dials that together govern most of what any model can or can’t do. Learn these three, and most of what looks mysterious about AI stops being mysterious.
The three dials are parameters, tokens, and context window.


AI-Augmented SQL Server Performance Tuning: A Framework for Humans and Their AI Co-Pilots
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three dials
Think of every model as a machine with three dials.
Parameters — the size of the brain. How much the model was able to compress during training.
Tokens — the unit of thought. Not letters, not words. Something in between.
Context window — the working memory. The scratchpad the model sees while reasoning about your input.
Every surprising behavior I listed above falls out of one of these three. Let’s take them one at a time.

MTG Abilities Keywords Counter Wheel, Black Token Tracker 7.5 inch Diameter,123-Piece Keyword and Life Counter Bulk Tokens MTG, TCG, Cards Gaming Accessories
COMPLETE COUNTER SET: MTG abilities keywords counter wheel 123-piece MTG counter set includes keyword tokens and numeric (+X/-X)counters…
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Parameters: the size of the brain
Parameters are the numerical weights inside the model. Frontier models have hundreds of billions of them. More parameters means more capacity to compress the statistical structure of the training data — more patterns the model can fit, more subtleties it can preserve.
Bigger isn’t monotonically better. Past a certain point, diminishing returns set in. Speed drops. Cost rises. A small, well-trained model on a narrow task can outperform a much larger general one. And at the very frontier, the gains from adding more parameters have been getting smaller — the interesting progress now comes from better training recipes, better data, and better post-training, not just from stacking more parameters.
This is why a 2026 70-billion-parameter model can match a 2024 500-billion-parameter one. The parameters are smaller, but the recipe improved. If you pick models by parameter count alone, you’ll systematically misread the market.
For day-to-day decisions, parameters affect cost and speed more than they affect what you’ll perceive as “intelligence” at the frontier. A giant model is expensive and slow; a mid-tier frontier model is cheap and fast. For most tasks the output is indistinguishable, and the mid-tier is better value. Pick by task fit, not by size.
AI context window extension software
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Tokens: the unit of thought
Tokens are what the model actually sees. Not characters. Not words. Something more granular than words and less granular than letters — chunks of text that are statistically common in the training corpus.
In English, a token is roughly three to four characters on average. “strawberry” might split into ["straw", "berry"] or ["st", "raw", "berry"] depending on the tokenizer. The model sees these chunks as atomic units. It does not see the individual letters inside them.
This is why counting letters is hard for LLMs. The model can’t introspect the characters inside a token — it has to reason about what the token probably contains. Sometimes it gets that right. Often it doesn’t. The “strawberry problem” — asking how many r’s are in the word — is funny because the task is trivial for a human (look, count) and surprisingly hard for a model (infer the letter composition of a chunk it can’t see through).
Tokens also drive cost. You pay per input token and per output token. The same document costs wildly different amounts depending on language: a 500-word passage might be roughly 650 tokens in English, over 1,000 in German (long compound words generate more subtokens), and multiples more in many non-Latin scripts. If your users write in languages other than English, your per-call cost is not what the English pricing page suggests.
Operationally: if your task is character-level — counting, transposing, reversing, spelling — you’re working against the grain. The fix is almost never “try a better prompt.” The fix is to give the model tools (a Python cell, a regex, a calculator) or to restructure the task so introspecting token internals isn’t required.

XTOOL D7S Bidirectional Scan Tool, 2026 AI-Assisted OBD2 Scanner Diagnostic Tool with FCA AutoAuth, All System Car Scanner, 39+ Resets, ECU C0ding, PMI, Upgrade of D7, Crankshaft Relearn, CAN FD/DoIP
Top Reasons to Choose the D7S OBD2 Scanner: XTOOL D7S car scan tool, an upgrade of XTOOL D7,…
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Context window: the working memory
The context window is the amount of text the model can consider in a single pass. It’s shared between input and output — if you have a 200,000-token window and you’ve used 180,000 tokens of input, you have 20,000 tokens of budget left for the response.
The window is working memory, not long-term memory. Each inference call starts fresh. Nothing persists between calls unless the application re-sends it. Everything you experience as “the model remembering a past conversation” is the application stuffing your past conversation back into the window at every turn.
To make 200,000 tokens concrete: that’s roughly 150,000 English words, or about 300 pages of a novel, or the source of a medium-sized codebase, or a month of a busy Slack channel. Frontier windows are now at a million tokens and climbing. That sounds infinite. It isn’t.
When a model “forgets” something you said ten messages ago, one of two things is usually happening. Either the application truncated your history because the window filled up. Or the information is buried in the middle of a very long context, where models reliably perform worse than they do at the start or the end — the “lost in the middle” effect. A bigger window doesn’t automatically buy you better recall across that window. Window size and attention quality are different things.
Operationally: before stuffing a huge document into context, test whether the model is actually using the middle of it. Often you’ll get better answers with retrieval — pulling in only the relevant chunks — than with a maximal-context approach.
The three dials together
Most AI decisions collapse into three questions, one per dial.
Which model? At the frontier, parameters drive cost and speed more than they drive capability. Pick by task fit.
How do I prompt it? Tokens are the unit the model sees. Don’t fight them. If the task is character-level, add tools instead of trying harder prompts.
How do I structure context? Window size sets the ceiling on what can fit. Attention quality across that window sets the floor on what the model will actually use. Retrieval often beats context-stuffing.
Three dials, three questions. Almost every operational decision about a model falls into one of them.
Why benchmarks mislead
When you read a benchmark number, you’re reading a weighted sum of all three dials plus a fourth — inference efficiency, which governs how fast and how cheaply a model runs. Different labs optimize for different combinations. Some push parameters. Some push long-context quality. Some push enormous windows. Open-weight players tend to push efficiency and portability.
This is why “Model X beats Model Y” claims almost never survive contact with your actual workload. The benchmark was tuned on particular task types, at particular context lengths, with particular tokenization. Two models with near-identical benchmark scores can feel radically different in production, because the dials that dominated the benchmark are not the dials that dominate your use case.
The useful move isn’t to trust public leaderboards. It’s to build a small, private benchmark from five or six tasks that actually matter to you, and run every model you’re considering against it. The answer you get is worth more than any public score.
Next in Thorsten Meyer AI Foundations: numbers tell you what a model can do mechanically. They don’t tell you what it’s actually good at. Capability is jagged, not graded — and the shape of that jaggedness is more useful than any benchmark.