If you've used ChatGPT, Claude, Gemini, or any modern AI assistant, you've interacted with a large language model. You've probably noticed they can write code, explain complex topics, debate philosophy, draft emails, and occasionally hallucinate things with convincing confidence.
What you might not know is why they can do this — what's actually happening inside these systems when they produce a response. The standard explanation ("it just predicts the next word") is technically accurate but deeply incomplete. Let's go deeper.
The Foundation: Everything Is a Prediction Problem
At its most fundamental level, a large language model is trained to predict what text comes next given some preceding text. That's it. Feed it "The capital of France is," and it should predict "Paris." Feed it a half-written email, and it should predict a plausible completion.
This sounds too simple to explain GPT-4's apparent reasoning abilities. But the key insight — one that surprised even many researchers — is that when you train a sufficiently large model on sufficiently large amounts of text, prediction accuracy requires the model to learn an extraordinary amount about the world implicitly.
To reliably predict the next word in a medical textbook, the model needs to "understand" physiology. To predict the next line in a Python tutorial, it needs to understand code execution. To predict the next sentence in a persuasive essay, it needs to understand rhetoric and argumentation. You can't cheat your way to good predictions on diverse, complex text without developing something functionally resembling world knowledge.
What "Training" Actually Means
Modern LLMs are trained on massive corpora of text — books, websites, academic papers, code repositories, forums, and more. The training dataset for a model like GPT-4 likely encompasses hundreds of billions of words, representing a significant fraction of the written knowledge that existed on the internet up to its cutoff date.
Training involves running text through the model, comparing its predictions to the actual next word, measuring the error, and adjusting millions (or billions) of internal numerical parameters to reduce that error. Repeat this billions of times across hundreds of billions of text examples. The process is extraordinarily compute-intensive — training a frontier model can cost tens of millions of dollars in cloud computing resources.
The model itself is a neural network — specifically, almost all modern LLMs are based on a design called the Transformer, introduced in a 2017 Google paper titled "Attention Is All You Need." The Transformer architecture's core innovation is a mechanism called self-attention, which allows the model to weigh how relevant each word in a sequence is to every other word when building its representation. This lets the model capture long-range dependencies — understanding that "it" in a sentence refers to a noun mentioned ten words earlier, for instance.
Scale Changes Everything
The surprising empirical finding of the last several years is that model capability doesn't improve linearly as you add parameters and training data — it improves in jumps. Researchers call these "emergent abilities." Abilities that were essentially absent in smaller models appear somewhat suddenly as scale crosses certain thresholds.
Chain-of-thought reasoning — the ability to work through a problem step by step — appears to emerge at scale. Multilingual capability without explicit multilingual training emerges at scale. The ability to follow complex instructions improves dramatically with scale.
This was not predicted from first principles. It was observed empirically, and it means that we don't fully understand why LLMs can do what they do. That's not a comfortable admission, but it's an honest one.
Context Windows: The Model's Working Memory
Every LLM has a context window — the amount of text it can "see" at once when generating a response. Early GPT-3 had a context window of about 4,096 tokens (roughly 3,000 words). Modern models like GPT-4o and Claude 3.5 support windows of 128,000 tokens or more. Google's Gemini 1.5 Pro reached 1 million tokens.
Everything in the context window — the system prompt, the conversation history, the document you pasted in — influences the model's output. Nothing outside the context window is accessible (without retrieval tools). This is why very long conversations can cause models to "forget" things said at the beginning once those messages scroll out of the window.
The model has no persistent memory between separate conversations. Every new chat is a fresh context. The apparent "personality" and consistent behavior you observe is encoded in the model's weights, not remembered from your previous sessions.
Instruction Tuning and RLHF: Making Models Useful
Raw language model training produces a model that's good at completing text but not at following instructions or being helpful in a chat interface. A model trained purely on internet text might complete "How do I make a bomb?" with a helpful continuation, because the internet contains such text.
Two additional training phases transform a base model into an assistant:
Instruction tuning (Supervised Fine-Tuning, or SFT): The model is trained on a dataset of (instruction, good response) pairs. This teaches the model to adopt an assistant-like stance — answering questions, following directions, explaining things clearly.
Reinforcement Learning from Human Feedback (RLHF): Human raters compare pairs of model responses and indicate which is better. These preferences train a reward model, which then guides further training. RLHF is a major reason modern LLMs are more helpful, less offensive, and better at following nuanced instructions than their base model counterparts.
Constitutional AI (Anthropic's approach for Claude) and Direct Preference Optimization (DPO) are newer variants that achieve similar goals with different mechanics.
What LLMs Cannot Do
Understanding the limits is as important as appreciating the capabilities:
They don't "know" things in the human sense. LLMs store statistical patterns over text. They have no grounded understanding of the physical world, no sensory experience, no mental model that corresponds to reality the way a human's does. Their "knowledge" is an emergent property of having been trained to predict text from sources that described the world.
They can't reason reliably about novel logical problems. Complex multi-step reasoning still trips up even frontier models. Chain-of-thought prompting improves this significantly, but fundamental reasoning limitations remain.
They hallucinate. Because the model is fundamentally a prediction system, it will sometimes predict confident-sounding text that is factually incorrect — especially on obscure topics, recent events past the training cutoff, or requests that require precise factual recall. Retrieval-augmented generation (RAG) is one engineering approach to mitigate this.
They have no real-time awareness. Unless connected to external tools or the web, an LLM's knowledge ends at its training cutoff. It doesn't know what happened yesterday.
Why This Matters Beyond the Hype Cycle
LLMs are genuinely a significant technological development — not because they're intelligent in the philosophical sense, but because they've dramatically lowered the cost of generating and processing text at scale. That has real implications for software development, content creation, customer service, medical documentation, legal research, and education.
The more precisely you understand what these systems are — statistical pattern-matching engines trained on human text, capable of remarkable mimicry, with specific and predictable failure modes — the better equipped you are to use them well and to evaluate the extraordinary claims that surround them.
They are powerful tools. Not oracles. Not thinking beings. Tools — and increasingly, some of the most useful ones we have.
