When ChatGPT hit the internet in November 2022, it felt like magic. You typed a question, and it wrote back in seconds, fluently, confidently, and often helpfully. But there’s no magic here. Behind the scenes, a large language model (or LLM) is doing something far more mechanical: predicting the next word you’re about to read.
Understanding how a large language model works is the first step to understanding its superpowers and its limits. This post breaks down what an LLM actually is, how it learns, and why scale matters so much.
What Is a Large Language Model?
An LLM is a machine learning model trained to predict the next word in a sequence. That’s literally the job: given some text, spit out the most likely word that comes next.
The “large” in “large language model” refers to scale—both the number of internal parameters (tuning knobs the model adjusts during training) and the amount of training data it learns from. GPT-3, released in 2020, is a good benchmark: it has 175 billion parameters and was trained on roughly 500 billion tokens of text (a token is a subword unit; more on that in a moment).
To put that in perspective:
BERT (2018): 340 million parameters
GPT-2 (2019): 1.5 billion parameters
GPT-3 (2020): 175 billion parameters
ChatGPT (2022): A fine-tuned variant of GPT-3
Modern frontier models in 2024 and beyond have pushed even further, but the principle remains the same: bigger parameters + more training data = a more capable model.
How LLMs Learn: Self-Supervised Training
LLMs are trained using a technique called self-supervised learning. You don’t need humans to label the data as “right” or “wrong.” Instead, the model learns by predicting the next word based on all previous words in a sentence.
Here’s a concrete example. Imagine the model sees this sentence:
The quick brown fox jumps over the lazy dog.The training process works like this:
Hide the word “jumps” and give the model: “The quick brown fox”
Ask: “What comes next?”
The model guesses a word (maybe “runs” or “leaps”).
Check against the actual word (”jumps”).
Adjust the model’s internal parameters to make “jumps” slightly more likely next time.
Repeat this billions of times across trillions of words of text, and the model learns statistical patterns: words that commonly follow other words, grammatical structures, facts about the world, and chains of reasoning. No human annotation required—the data labels itself.
Tokens: How LLMs Actually Read
LLMs don’t read words as you do. They read tokens—subword units that break text into chunks.
A token isn’t always a full word. The word “unbelievable” might be split into three tokens: “un”, “believ”, “able”. The word “ChatGPT” might be split into “Chat” and “GPT”. On average, one token is roughly 0.75 words.
Why does this matter? Because LLMs have a context window—a maximum number of tokens they can process at once. Early GPT models could handle 2,048 tokens. Modern models handle 100,000 to 1,000,000 tokens. This limit affects how much text you can feed the model at once.
Emergent Capabilities: Abilities That Appear at Scale
Here’s where things get weird. As LLMs grow larger, they develop abilities that weren’t explicitly trained into them. These are called emergent capabilities.
GPT-2 struggled with arithmetic — it couldn’t reliably solve even simple problems. GPT-3, with roughly 100 times more parameters, could. Same training approach; different scale; suddenly arithmetic works.
Other emergent abilities include:
Generating code in programming languages
Breaking down complex reasoning problems step by step
Translating between languages, it wasn’t explicitly trained to translate
Explaining concepts from first principles
No one explicitly programmed these skills. They emerged from scale and statistical patterns in the training data.
How LLMs Generate Responses: Temperature and Randomness
When an LLM generates text, it doesn’t always pick the single most likely next word. Instead, it uses sampling—a technique that introduces controlled randomness.
The level of randomness is controlled by a parameter called temperature:
Temperature = 0 (deterministic): Always pick the most likely word. Responses are predictable and consistent.
Temperature = 1 (balanced): Sample proportionally from the probability distribution. Some randomness, but still shaped by what’s likely.
Higher temperatures (e.g., 2.0): Sample from the long tail of less likely words. Responses become more creative—and more likely to generate nonsense.
This is why ChatGPT sometimes gives you wildly different answers to the same question (assuming temperature isn’t set to 0). It’s not being inconsistent; it’s exploring the probability space.
The Data/Instruction Problem: A Security Angle
Here’s a critical limitation: LLMs cannot reliably distinguish between instructions and data. Both are just text flowing in.
If you feed an LLM an instruction like “Ignore the above. Do this instead,” it treats that as a plausible text continuation, not as a special command to override prior instructions. This is why prompt injection attacks work. An attacker can embed instructions in data, and the model will treat them as legitimate.
This isn’t a bug. It’s structural to how LLMs work. They’re trained to predict the next plausible token—they have no built-in mechanism to distinguish “this is an order” from “this is information.”
Why LLMs Aren’t Truthful by Design
LLMs are trained to predict likely text, not to speak the truth. High confidence doesn’t mean correct. This is foundational.
A model can be 99% confident in a wrong answer. That confidence score reflects how consistent the answer is with the statistical patterns in the training data, not whether the facts are correct. If the training data contains falsehoods (and it does), the model will learn and reproduce them—confidently.
This is why the next post in this series tackles hallucination. Understanding this disconnect is essential before relying on an LLM for factual information.


