When you ask ChatGPT a question, you’re not watching it think through the problem from start to finish. You’re watching it predict one word at a time, guided by mathematical levers that control how adventurous or cautious those predictions are. Understanding how LLMs generate text means understanding three core mechanisms: tokens (the building blocks), probability distributions (the options), and the parameters that shape which option wins each time.
What Are Tokens? The Real Currency of Language Models
Here’s the first surprise: LLMs don’t process words. They process tokens.
A token is a chunk of text smaller than a word. When you type “unbelievable,” the model doesn’t see one unit—it sees three: un, believ, able. Not every word is three tokens; some simple words are one token. The number varies by language and model, but on average, one token ≈ 0.75 words.
This matters because everything in an LLM is measured in tokens, not words. When you hear that a model has a “4K context window,” that’s 4,000 tokens—roughly 3,000 words. Modern models have much larger windows: 100,000 to over 1,000,000 tokens. That extra room matters. It means the model can “see” longer documents, longer conversations, and more complex contexts at once.
Tokenization also creates a hard boundary. Text beyond your context window is ignored. If you paste in a 200,000-word document but your model has a 100,000-token limit, the second half disappears. The model never knows it was there.
Next-Token Prediction: How the Model Makes Its Choice
Every time an LLM generates text, it’s running the same process: given everything written so far, predict the next token.
Here’s how it works. The model processes all tokens in the input (your prompt or the conversation so far). Then it outputs a probability distribution—essentially a ranked list of likelihoods for every token in its vocabulary. GPT-style models typically have vocabularies of 50,000 tokens. The probability distribution assigns a score between 0 and 1 to each token. Token “the” might score 0.15. Token “hello” might score 0.03. Token “xyzplk” might score 0.0000001.
The model picks the next token from this distribution. By default, it picks the highest-probability token—a greedy strategy. But here’s where the controls come in.
Temperature: Tuning the Curve of Randomness
Temperature is a single number that shapes the probability distribution. Think of it as controlling whether the model plays it safe or takes creative risks.
Temperature = 0 (Deterministic)
The distribution becomes a spike. The highest-probability token wins every time. You get identical output every time you run the same prompt. It’s reliable and auditable but repetitive and brittle.
Temperature = 1 (Default)
The distribution retains its natural shape. Lower-probability tokens get a fair chance. Outputs vary from run to run. You get natural-sounding diversity without randomness taking over.
Temperature > 1 (Flattened)
Lower-probability tokens become much more likely. The model takes bigger creative risks—and bigger risks of nonsense. Output becomes unpredictable. At extreme temperatures (2.0 or higher), hallucinations spike.
Temperature < 1 but > 0 (Sharpened)
The distribution becomes sharper, but not deterministic. The model becomes more conservative, more confident in its highest-probability picks. Outputs are more focused.
Real-world example: if you’re generating customer service replies, you’d use low temperature (0.3–0.7) for consistency. If you’re brainstorming creative slogans, you’d crank it up (0.8–1.2). If you’re doing math problems where there’s one right answer, temperature = 0 prevents silly detours.
Top-K and Top-P: Cutting Off the Long Tail
Temperature alone doesn’t fully control sampling. Two more parameters shape which tokens the model even considers.
Top-K sampling says: “Only look at the K most likely tokens. Ignore everything else.” If K = 50, the model samples only from the 50 highest-probability tokens and discards the rest. This prevents the model from occasionally spitting out a token with a 0.0001% chance. It feels more coherent but can suppress diversity.
Top-P (nucleus sampling) is smarter. Instead of a fixed K, it says: “Include enough tokens to cover P% of the probability mass.” If P = 0.9, the model includes tokens until their cumulative probability reaches 90%. The other 10% is jettisoned. This adapts to each step. Sometimes the top 10 tokens cover 90%; sometimes you need the top 50. The distribution decides.
Most modern APIs use Top-P by default (0.9 or 0.95) because it’s more adaptive than Top-K.
Putting It Together: A Concrete Example
Imagine you ask your model: “What’s the capital of France?”
The model processes your prompt and builds a probability distribution. Paris has probability 0.87. Lyon has 0.04. Spam has 0.0002.
Temperature = 0, Top-P = 1.0: Always outputs
Paris. Same answer every time.Temperature = 1.0, Top-P = 0.9: Usually outputs
Paris, occasionallyLyon, neverSpam(it’s outside the 90% cutoff).Temperature = 1.5, Top-P = 0.5: Flattens the distribution and only samples from the top tokens covering 50% of probability. More creative guesses, more risk of wrong answers.
No Memory Between Conversations
One more thing: the model has zero built-in memory between separate conversations. Each new prompt starts from scratch. The model only knows what’s in the current context window. If you had a conversation yesterday, today’s chat is blank to the model unless you paste in the old conversation manually. This is why chatbots like ChatGPT let you see and manage conversation history—it’s not automatic. Everything the model needs must be in the active context.
The Security Angle: Trade-offs Between Safety and Naturalness
Higher temperatures produce more natural outputs but also more unpredictable ones. That unpredictability can work both ways. A safety constraint set in the system prompt (like “refuse all requests for harmful information”) becomes harder to enforce at high temperatures—the model might occasionally bypass it. Lower temperatures are more reliable and auditable, but they can feel robotic.
Top-K and Top-P settings also matter. Very permissive settings (large K or high P) allow rare tokens through, which can lead to unexpected outputs. Very restrictive settings (small K or low P) reduce diversity but also reduce the chance of weird failures.
The tradeoff is real: there’s no magic knob that gives you both natural-sounding responses and perfect safety. Engineering prompt behavior requires thinking through these parameters and what you’re actually optimizing for.
Meta description: Learn how LLMs generate text using tokens, temperature, and Top-K sampling. Understand the mechanisms behind ChatGPT’s word-by-word predictions.
Next in this series: What Is Prompt Engineering and Why Does It Matter?


