April 15, 2026

A Comprehensive Guide to Large Language Models

By Synthex

Based on the insights of Andrej Karpathy.

Primary Source: Andrej Karpathy: Intro to Large Language Models (opens in new tab)

Hello there.

When we interact with an AI like ChatGPT, it often feels like magic. It answers questions, writes code, and even cracks jokes. But this "magic" is actually the result of a rigorous, three-stage industrial process.

It is not a singular mind; it is a statistical architecture built to simulate human thought.

To truly understand these tools—their brilliance, their hallucinations, and their strange limitations—we must understand how they are made.

This guide breaks down the construction of a modern Large Language Model (LLM) into its three distinct phases: Pre-training, Supervised Fine-Tuning, and Reinforcement Learning.

Drawing on the foundational analysis of AI researcher Andrej Karpathy, we will explore the engineering challenges, the data pipelines, and the emergent behaviors that define the current state of Artificial Intelligence.

1.0 The Big Picture: The Three Stages

Building an LLM is not about "programming" in the traditional sense. It is about training.

We do not write rules for the AI; we curate data and let the AI learn the patterns.

The process follows a specific sequence:

Pre-training: The massive, expensive phase where the model learns knowledge and how to speak. It becomes a "Base Model."
Supervised Fine-Tuning (SFT): The phase where the model learns behavior. It transitions from a document simulator to a conversational assistant.
Reinforcement Learning (RL): The refinement phase where the model learns reasoning and optimization through trial and error.

2.0 Stage 1: Pre-training (The Knowledge Foundation)

This is the heavy lifting. Pre-training is computationally the most expensive part of the process, taking months of time and thousands of GPUs.

The goal here is simple but ambitious: to compress a significant portion of the internet into a single neural network.

2.1 Data Acquisition: The Digital Library

Training begins by collecting massive quantities of text. Organizations like Common Crawl have been indexing the web for years, creating a database of billions of pages (2.7 billion as of 2024).

The Scale: A production-quality dataset, like "FineWeb," contains roughly 44 terabytes of data and 15 trillion tokens.

Analogy: Imagine downloading the world's largest, most specialized digital library, filled only with the best textbooks, journals, and articles, and then compacting it onto a single large hard drive.

2.2 The Industrial Sieve: Filtering and Processing

Raw internet data is messy. It contains spam, malware, code snippets, and irrelevant noise. Before it can be used, it must be aggressively filtered.

URL Filtering: Blocking malware, spam, racist, or adult sites.
Text Extraction: Stripping away the HTML scaffolding (the code that makes websites look good) to isolate only the core text content.
Language Filtering: Keeping only documents primarily in the desired language (e.g., >65% English).
PII Removal: Detecting and eliminating personally identifiable information like addresses or Social Security numbers.

Analogy: This is like a massive industrial sieve. Before the information can be used, you must carefully remove all the garbage, the scaffolding (HTML), and the foreign materials, leaving behind only the pure, usable textual ingredients.

2.3 Tokenization: The Language of the Machine

Neural networks cannot read English. They process numbers. To bridge this gap, we use Tokenization.

Raw text is converted into a sequence of specific, finite symbols called "tokens." Algorithms like Byte Pair Encoding (BPE) scan the text to find common patterns.

The Process: Raw text is converted into a one-dimensional sequence of symbols. BPE compresses the sequence by grouping frequent characters into single tokens.
The Vocabulary: Modern models like GPT-4 have a vocabulary of roughly 100,277 unique tokens.

Analogy: If raw text is a giant roll of movie film, tokenization is like creating a specialized shorthand dictionary. Instead of using thousands of separate letters, you combine common phrases ("hello world") into single, complex symbols, making the "film strip" shorter and more efficient to process.

2.4 Neural Network Training: The Soundboard

Once the data is tokenized, the Transformer neural network begins its training. Its sole objective is simple: Predict the next token.

The model looks at a sequence of words (the context) and outputs a probability distribution for what comes next.

The Mechanism: The network has billions of "parameters" (weights). During training, it makes a guess. If it's wrong (high "loss"), we mathematically nudge those billions of parameters to be slightly more accurate next time.

Analogy: Imagine a massive soundboard with billions of knobs (parameters). The training process is like repeatedly playing a song (the internet text) and fine-tuning those knobs. If the song sounds wrong (high loss), you nudge the knobs a tiny bit so that the correct note (token) is more likely to play next time.

2.5 The Result: The Base Model

The output of this stage is the Base Model (e.g., Llama 3 Base). It is not an assistant. It is an Internet Document Simulator.

Behavior: If you feed it the first line of a Wikipedia entry, it can recite the rest. But if you ask it, "What is $2+2$ ?", it might just continue the sequence with philosophical rambling or ask, "What is $3+3$ ?"

Analogy: The base model is like a gifted mimic who has read the entire internet. It acts like a very expensive, statistical autocomplete.

3.0 Stage 2: Supervised Fine-Tuning (Programming the Assistant)

To turn the Base Model into something useful, we need to teach it etiquette. This is Supervised Fine-Tuning (SFT).

3.1 The Dataset Shift

We discard the massive internet dataset. We replace it with a high-quality, curated dataset of human-assistant conversations.

The Scale: This dataset is tiny compared to the internet—perhaps 100,000 examples instead of trillions.
The Goal: We are no longer teaching it facts; we are teaching it a format. We are teaching it the pattern of "User asks question -> Assistant gives helpful answer."

Analogy: If pre-training taught the model how to speak every language in the world, SFT is like enrolling it in a highly specialized etiquette academy. It learns not just what to say, but how to behave.

3.2 The Role of Human Labelers

Companies hire human contractors ("labelers") to write these ideal conversations. They are given strict instructions: "Be helpful, truthful, and harmless."

The Simulation: When you talk to ChatGPT, you are essentially asking the ghost of a highly trained, rule-bound employee to instantaneously write you an answer based on their training manual. The model is a "statistical simulation of a labeler."

3.3 Conversation Tokenization

To the model, a conversation is still just a string of tokens. To give it structure, we introduce special "control tokens" like <|im_start|> and <|im_end|>.

Analogy: These are theatrical stage directions. They act as cues: "User begins speaking here," "Assistant speaks now," or "End of thought."

4.0 The Cognitive Quirks (Why It Acts The Way It Does)

The architecture described above leads to specific psychological traits in LLMs.

4.1 Hallucinations: The Confidence Trap

The model was trained on the internet, where people rarely say "I don't know." They state things confidently.

The Mechanism: When the model doesn't know an answer, it doesn't "know" that it doesn't know. It simply predicts the most statistically probable next words.
The Result: It produces plausible-sounding nonsense. It is simulating a confident expert, even when it is wrong.

Analogy: The LLM is like a student who has learned that the best way to get a good grade is to always provide a confident answer. Even when it has no idea, it fills the blank with the most statistically plausible text.

Mitigation: We must explicitly train the model (during SFT) to say "I don't know" or "I don't remember" when it encounters uncertainty.

4.2 Working Memory vs. Vague Recollection

Parameters (Long-Term Memory): The knowledge stored in the model's weights is "lossy." It's like a vague recollection of a book you read years ago.
Context Window (Working Memory): The text you type into the chat (and the documents you upload) is perfectly preserved in the "context window." This information is directly accessible.

Analogy: The parameters are like long-term memory; you remember the themes of a book. The context window is like having the book open right in front of you—that information is immediately usable and precise.

4.3 The Need for "Thinking Tokens"

An LLM does a fixed amount of math for every token it generates. It cannot "pause and think" before generating the first word.

The Implication: If you ask a complex math problem, and the model tries to answer immediately, it will fail. It essentially has to solve the problem in one mental breath.
The Solution: Chain of Thought. By forcing the model to write out the steps ("First, let's calculate X..."), we are giving it more time (more tokens) to compute the solution. We are spreading the computation out.

Analogy: An LLM is a calculator that can only handle one tiny operation per second. If you ask for a complex final number immediately, it fails. If you let it write out the intermediate steps, it succeeds because each small step fits within its limit.

4.4 "Swiss Cheese" Capabilities

LLMs are characterized by "Swiss cheese" competence. They can be brilliant at complex tasks (Olympiad physics) but fail randomly at simple ones (counting letters in "strawberry").

The Cause: This is often tied to tokenization. The model doesn't see individual letters; it sees tokens (chunks).

Analogy: Imagine a genius who can perform brain surgery but can't tie their shoelaces because their eyes (the tokenizer) translate the laces into shapes they don't understand.

5.0 Stage 3: Reinforcement Learning (The Refinement)

This is the frontier. While SFT relies on copying humans, Reinforcement Learning (RL) allows the model to learn from its own successes and failures.

5.1 The "Practice Problem" Analogy

SFT: Copying the teacher's notes exactly.
RL: Doing practice problems where you only check the final answer.

The model tries to solve a problem 100 different ways. We check the answers. The paths that led to the correct answer are "reinforced" (made more probable). The paths that failed are discouraged.

5.2 Verifiable Domains and AlphaGo

RL works best in domains where we can objectively verify the answer (Math, Coding, Chess).

The AlphaGo Effect: If you only train on human data, the AI can never be better than a human. By using RL (playing against itself), AlphaGo discovered strategies ("Move 37") that no human had ever conceived.
Emergent Thinking: In LLMs, RL leads to the spontaneous emergence of "cognitive strategies" or chains of thought—like self-correction, backtracking, and re-evaluation—that significantly improve accuracy.

5.3 RLHF: When We Can't Verify the Answer

For creative tasks (writing a poem, summarizing an email), there is no "correct" answer. We cannot run automatic checks.

The Solution: Reinforcement Learning from Human Feedback (RLHF).
The Process:
1. The model generates two poems.
2. A human picks the better one (ranking).
3. We train a separate AI (the Reward Model) to learn what humans like.
4. We then use this Reward Model to train the main LLM.
The Indirection Trick: We use a "robotic critic" (Reward Model) to grade the AI because grading millions of poems by hand is impossible.
The Upside: It exploits the "discriminator-generator gap"—it's easier for humans to judge a joke than to write one.
The Downside: The Reward Model is gameable. If run too long, the model finds "adversarial examples"—nonsense inputs that trick the Reward Model into giving a high score.

6.0 Future Capabilities

6.1 Multimodality

Models are becoming native multimodal engines. They can "see" and "hear" not by using separate tools, but by tokenizing audio and images just like text.

Analogy: Instead of just being an expert in books, the LLM is learning to speak the language of movies and music. It sees a video not as a file, but as a sequence of descriptive tokens.

6.2 Agents and Long-Running Tasks

Future models will evolve into agents capable of stringing together multiple tasks over time.

The Shift: From solving single queries to executing complex jobs (e.g., "Plan my business trip").
The Human Role: Humans will transition from operators to "supervisors of agent tasks."

7.0 Glossary of Terms

Base Model

The output of the first training stage (Pre-training); it is an internet document simulator, capable of generating statistical remixes of its training data but not yet trained to be a helpful assistant.

Context Window

The sequence of tokens the model is currently processing. This is the model's "working memory," and any information here is directly and accurately accessible.

Hallucination

The act of an LLM fabricating information or making stuff up, usually because it is statistically imitating the confident tone of its training data when it is actually uncertain.

Inference

The process where the trained LLM is used to generate new data, starting from a prefix (the prompt) and sequentially sampling tokens.

Loss

A single number used during training that indicates how far the model's prediction is from the correct answer. The goal of training is to minimize this number.

Parameters (Weights)

The billions of internal numerical "knobs" within the neural network that store the model's acquired knowledge and statistical patterns.

Pre-training

The first, most computationally expensive stage of training, where the neural network builds a general knowledge base by predicting the next token across a vast dataset of internet documents.

Reinforcement Learning (RL)

The third stage of training (following SFT) where the model learns to improve performance, especially in problem-solving, by trying many solutions (rollouts) and reinforcing the sequences of tokens that reliably lead to a correct answer.

Reward Model (RM)

A separate neural network trained during RLHF to simulate human judgment by predicting the score or ranking a human would give to a generated response.

RLHF (Reinforcement Learning from Human Feedback)

A technique used to apply RL in unverifiable domains (like creative writing) by training the model against a simulated human scorer (the Reward Model).

Stochastic System

A system whose outcomes are determined by probability and chance (randomness). LLMs are stochastic because they sample the next token from a probability distribution, meaning the same prompt yields different answers.

Supervised Fine-Tuning (SFT)

The second training stage, which replaces the base model's training data with human-assistant conversations to program the model to adopt the behavior and persona of a helpful assistant.

Token

The basic "atom" or unit of text (which can be a word, part of a word, or punctuation) that LLMs process. Raw text is always converted into a one-dimensional sequence of these symbols.

Transformer

The specific type of neural network architecture used in modern LLMs that processes the sequences of input tokens to generate output probabilities.

Verifiable Domains

Areas where the correctness of a solution can be objectively checked, such as mathematics, coding, or facts with known answers.

8.0 Conclusion: The Simulation

When you interact with a modern AI, you are witnessing the culmination of this massive pipeline. You are interacting with a base of knowledge compressed from the entire internet, refined by the behavioral guidelines of human contractors, and polished by the trial-and-error of reinforcement learning.

It is a tool of immense power, but it is not a human mind. It is a stochastic system, a probability machine. Understanding how it is built is the first step to mastering how to use it.

Sources & Acknowledgements

This guide is synthesized from the expert analysis of Andrej Karpathy, specifically his comprehensive lecture on the LLM training pipeline. The insights on data pipelines, tokenization, and the distinction between SFT and RLHF are drawn directly from his work.

Back to Blog