How do LLMs generate text?

Every day, billions of prompts are sent to ChatGPT, Claude, and other large language models. Each prompt returns fluent, human-like text. However, most people don't realize that the model doesn't actually understand meaning and plan its response like a human would. At every step, it is simply performing mathematical operations to predict the next token.

And it all starts with the tokenizer...

Tokenizer

LLMs don't read words. They read tokens.

A token is just an integer. When you enter a prompt your text is first broken into tokens, and then each token is mapped to a number. Those numbers are then converted into numerical vectors (eg. [-0.33, 0.15, 0.01, 0.32, 0.00]), called embeddings, which are what the model actually operates on.

But why does this step exist at all? Why can't an LLM just "see" words?

At the lowest level, every neural network layer is just linear algebra. A simplified version looks like this:

y = f(Wx + b)

This computation requires all inputs to be numeric, and of a fixed-size.

Whereas text is variable-length, ambiguous (words, spaces, punctuation) and totally open-ended. Tokenization allows us to run computations on text inputs by turning them into a sequence of integers, ensuring every possible input can be represented as fixed-length vectors.

Enter text to tokenise:

Now the tokenizer does not "understand" meaning either. Instead it follows a defined, set of rules designed to turn text into pieces that are efficient and reusable, splitting text into frequently occurring character patterns.

Before the LLM itself is trained, the tokenizer is trained on a massive text dataset. The process looks a bit like this:

Start with very small units, often individual characters or bytes
Count how often adjacent units appear together
Merge the most frequent pairs into a new token.
Repeat until a fixed vocabulary (number of tokens) is reached.

Over time, common character sequences become single tokens.

For example:

"ing" appears often → becomes a token
"tion" appears often → becomes a token
"pre" appears often → becomes a token

Common words like "the" become a single token. Uncommon or long words, such as "Multidisciplinary", are broken into multiple subword tokens.

"Multidisciplinary"

Multidisciplinary

3 tokens

"the"

the

1 token

Embeddings

At this point, the model still has nothing but integers. This is the role of embeddings.

A token ID is just a number. For the model to "understand" what it means, every token gets converted into a vector with thousands of dimensions, called an embedding. An embedding vector is a list of numbers representing its meaning.

The exact number of numbers per token is called the embedding dimension. This is a fixed choice when the model is designed. GPT-3 has 12,288 dimensions, with much smaller models like LLaMA-3.1 8B having ~4,096 dimensions.

So does size actually matter?

The embedding dimension is both the length of the embedding vector and, conceptually, the number of degrees of freedom the model has to represent how a token behaves. Lower-dimensional embeddings are more compressed and force many meanings to overlap, while higher-dimensional embeddings give the model more room to represent subtle distinctions. Larger embeddings allow the model to encode more semantic relationships at once, but increasing the size also raises memory and compute costs. Beyond a certain point, the returns begin to diminish.

Rather than behaving like a dictionary lookup, an embedding evolves as it passes through each layer, with context continuously refining it, similar to how a thought takes shape. During training, the model adjusts embeddings so that tokens used in similar contexts get similar vectors, and tokens that behave differently move farther apart.

Embedding Matrix

Colours + numbers are tanh-normalised into [-1, 1] for visualisation. Real embeddings are real-valued and not naturally bounded. Hover over tokens to see related words highlighted.

Progress:0%

Dimensions:26(Real LLMs: 4,096 - 12,288)

Random initialisation: all rows look chaotic with no visible pattern.

royalty

movement

emotion

city

Random Init

king

queen

prince

run

running

jog

happy

joyful

sad

Paris

London

Tokyo

"king"(royalty)26 dimensions

Embedding vector:

-0.330.150.010.320.000.080.07-0.090.080.080.11-0.05-0.140.150.300.090.050.340.05-0.13-0.22-0.200.060.03-0.32-0.19

After training on massive amounts of text, the embedding space develops structure. Relationships emerge naturally as a result of optimization, not explicit rules.

A famous king–queen analogy in word embeddings came from early Word2Vec research. It showed, for the first time, that embeddings capture relationships, not just similarity, and that meaning can be manipulated using linear algebra. Concepts such as gender, tense, and countries appear as consistent vector directions. These clusters are how the model "understands" that words are related, not because anyone explicitly told it, but because they appear in similar contexts.

Transformers

At this stage, each word, token, and embedding is still isolated. The embedding for "light" is the same whether the sentence is "Turn on the light" or "This weight is very light". These vectors are still standalone and context-free; they do not know what is around them.

If the model treated each token independently, it would always represent "light" in the same way, making it impossible to choose the correct next word.

So, the model must combine tokens and allow them to influence each other; updating each token's vector using information from the surrounding tokens.

This architecture was introduced in Attention Is All You Need. Transformers have no built-in sense of order. Instead of processing text left-to-right like older sequential models, they process all tokens in parallel and compute relationships between tokens.

In self-attention, each token produces three vectors:

a query (Q): what am I looking for?
a key (K): what do I offer?
a value (V): what information do I carry?

Each token compares its query against the keys of all other tokens. The strongest matches receive the most attention, and their values are combined to update the token's representation.

In the sentence "The dog chased the cat because it was fast", self-attention lets the token "it" draw information from nearby tokens like "dog", "cat", and "fast", so the model can infer what "it" refers to based on context, rather than order.

This attention calculation runs multiple times in parallel using multi-head attention. Each head can focus on different types of relationships simultaneously, such as syntax, semantics, or long-range dependencies.

Long-range dependencies occur when a word needs to relate to another word far away in the sequence. Consider the sentence:

The key relationship is between "book" (the subject) and "was published" (the verb). These two parts are far apart, with a lot of extra detail in between. The model needs to understand that "was published" refers to "the book", not "the professor" or "the students".

In older sequential models like RNNs, information about "the book" would fade as the sentence gets longer, making it hard to connect distant tokens. With attention, one head can focus directly on the relationship between "book" and "published", even though they're many tokens apart. Meanwhile, other heads might focus on syntax (who did what), local phrases ("the students admired"), or semantics (who is an author vs reader).

Each attention pass can only mix information once. For this reason, the same computation is repeated many times through repeated refinement across multiple layers. The output of one layer becomes the input to the next.

Early layers tend to encode local word interactions and short-range syntax. Middle layers capture sentence structure and relationships between phrases. Later layers encode higher-level semantics and more abstract patterns useful for prediction. Each layer makes a small update, but stacked together, these updates produce rich and expressive representations. Depth gives the transformer its expressive power.

The Loop

At each step, the model produces a probability distribution over all possible next tokens, given the full context so far. One token is selected from this distribution and appended to the input.

After that, the exact same computation runs again. Nothing new is introduced, each iteration only differs because the context is one token longer.

The loop repeats until the model stops. From the outside, this looks like the model is writing sentences, but really it's just performing next-token prediction over and over again.

When you zoom out, nothing magical is happening. An LLM repeatedly predicts the next token based on everything it has seen so far. The power comes from scale: with enough data, parameters, and compute, this simple loop produces representations rich enough to support language, reasoning, and creativity.