Why do LLMs need GPUs?
In the last few years, the world's largest tech companies have rebuilt their infrastructure around GPUs. They are ordering chips years in advance, constructing new data centres, and rewriting their software stacks to run almost entirely on GPU clusters.
This did not happen because CPUs suddenly became bad. It happened because modern workloads, especially large language models, demand a kind of computation that CPUs were never designed to handle efficiently.
CPU vs GPU
Before GPUs, CPUs were designed to do everything. Their goal was flexibility: running operating systems, handling branching logic, making decisions, and executing many different kinds of instructions.
They were optimized for low-latency, sequential work, typically executing one instruction stream per core. For decades, faster computers meant increasing CPU clock speeds, improving instruction pipelines, and adding smarter cache hierarchies. Clock speeds rose steadily from a few megahertz in the 1980s to around 3–4 GHz by the early 2000s.
But around the mid-2000s, this approach hit a physical wall. Higher clock speeds caused power consumption and heat dissipation to grow too quickly, a problem often summarised as the power wall and Dennard scaling breakdown. By roughly 2005, clock speeds largely stopped increasing, and CPUs could no longer get faster simply by running at higher frequencies.
Instead, CPUs evolved by adding more cores. Single-core processors gave way to dual-core and quad-core CPUs, and modern desktop CPUs now commonly have 8–32 cores. However, each core remained complex and optimised for control-heavy workloads, meaning CPUs could only exploit limited parallelism compared to more specialised hardware.
NVIDIA released the GeForce 256 in 1999. This came to be the first widely recognised GPU, a term NVIDIA themselves coined. It was different from other graphics cards at the time because it was the first chip to integrate hardware Transform and Lighting (T&L). Before this, the CPU handled geometry maths and lighting calculations, things like matrix–vector multiplication, vector addition, and dot products, while the graphics card mostly focused on drawing pixels.
With the release of the GeForce 256, this geometry and lighting work could now be performed directly on the graphics hardware, shifting a large amount of mathematical computation away from the CPU. This marked a fundamental shift in how workloads were divided between the CPU and the GPU, with the GPU taking responsibility for large, highly parallel mathematical operations.
Watch the Mythbutshters give the best explanation of the difference between a CPU and GPU: Video
What computation do LLMs actually perform?
Now that we have a solid understanding of CPUs and GPUs, lets look at what a large language model actually does to produce text.
A LLM is a neural network trained to predict the next piece of text given everything it has seen so far. When you prompt into an LLM, the model does not reason in words or sentences, and it does not understand concepts like meaning or intent in the way humans do. Instead, everything you write is converted into numbers, and the model applies the same set of mathematical operations to those numbers over and over again.
When you prompt an LLM, your words are transformed into a high-dimensional vector space where semantic relationships become mathematical relationships.
Text is not processed as text. For natural language processing tasks, text must first be converted into numbers.
The tokeniser splits text into tokens, which may be whole words, sub-words, spaces or symbols. Each token maps to an integer ID, and these IDs are specific to a given model.
Inside the model is a learned embedding matrix. This embedding table maps each token ID to a vector of floating-point numbers. It has the shape:
[vocab_size × embedding_dim]Each row corresponds to a token, and each row is a learned vector.
You can think of the embedding table as a lookup table that turns discrete token IDs into continuous numerical representations. These embeddings start as static vectors, but as they pass throughout the network, they become dynamic and contextual. Self-attention layers repeatedly update them based on the surrounding tokens.
At this point, everything inside the model is numerical.
Once text has been converted into vectors, the entire model reduces to matrix multiplication, addition, and simple nonlinear functions applied at massive scale.
During training, the model is shown enormous amounts of text and adjusts its internal numbers so that its predictions improve over time. After enough training, the model becomes extremely good at continuing text in a way that appears coherent, informative, and sometimes even creative.