
Sign in to join the discussion.

Mistral's latest open-weight release consolidates its reasoning, vision, and coding model lines into a single 119B MoE - a deliberate bet that versatility beats specialization. We examine whether the tradeoffs hold up.
Every time you use a chatbot, ask an AI to summarize a document, or see an image generated from a text prompt, you are interacting with the same underlying idea: a transformer. First described in a 2017 research paper by eight researchers, the majority based at Google, the transformer is not just another incremental improvement in machine learning. It is the architectural leap that made modern AI possible, and understanding how it works is the key to understanding AI itself.
This article is a complete guide to transformer neural networks, written for anyone curious enough to want to understand what is actually happening inside the systems reshaping the world. No prior machine learning knowledge required.
To appreciate what transformers changed, it helps to understand what came before them. For most of the 2010s, the dominant tools for processing language and sequences were Recurrent Neural Networks (RNNs) and their more capable variant, Long Short-Term Memory networks (LSTMs), introduced by Sepp Hochreiter and Jürgen Schmidhuber in 1997.[1] These models read text the way a person might read a sentence out loud: one word at a time, left to right, carrying a kind of "memory" of what came before.
The problem was fundamental: that memory was fragile. By the time an RNN had read a long paragraph, the signal from the opening sentence had faded and distorted, a phenomenon researchers call the vanishing gradient problem. LSTMs were designed to mitigate this, using internal "gates" to decide what to remember and what to forget, but they still processed text sequentially. That meant they were slow to train, and they still struggled with very long texts.
Then, in 2014, Dzmitry Bahdanau and colleagues introduced the attention mechanism, originally as an add-on to sequence-to-sequence models for machine translation.[2] The idea was elegant: rather than forcing the model to compress an entire input into a single memory vector, let it look back at all parts of the input when producing each part of the output. Translation quality improved dramatically.
Three years later, a team at Google asked a provocative question: what if you removed the RNN entirely and built a model entirely out of attention? The result was the paper "Attention Is All You Need," published in 2017, which introduced the transformer.[3] The answer turned out to reshape artificial intelligence entirely.
Key Milestones in Transformer History | |
Year | Milestone |
|---|---|
1997 | Hochreiter & Schmidhuber introduce Long Short-Term Memory (LSTM) networks |
2014 | Bahdanau et al. introduce the attention mechanism for sequence-to-sequence models |
2017 | Vaswani et al. publish "Attention Is All You Need," introducing the transformer |
2018 | Google releases BERT (encoder-only); OpenAI releases GPT-1 (decoder-only) |
2020 | OpenAI releases GPT-3 (175B parameters); Dosovitskiy et al. introduce Vision Transformer (ViT); Kaplan et al. publish scaling laws |
2021 | DeepMind's AlphaFold 2 achieves breakthrough protein structure prediction using attention-based Evoformer |
2022 | OpenAI releases ChatGPT, bringing transformer-based AI to mass public use |
2023 | GPT-4 released; Gu & Dao propose Mamba as a linear-time alternative to attention |
2024-2026 | Frontier models (Claude, Gemini, LLaMA, GPT-4 family) dominate; hybrid and MoE architectures gain traction |
Before going further, a quick grounding. A neural network is a computational system loosely inspired by the human brain. It consists of layers of simple mathematical units called neurons (or nodes), each of which receives inputs, applies a mathematical transformation, and passes the result forward to the next layer. When a network is trained, it adjusts the strength of the connections between neurons, called weights, so that its outputs better match the correct answers. Do this with enough data and enough compute, and the network learns surprisingly sophisticated patterns.
A transformer is a particular kind of neural network with a specific internal structure. Its key innovation is the way it decides which parts of its input to pay attention to at any given moment.
Computers do not understand words. Before a transformer can do anything with language, text must be converted into numbers. This happens in two steps.
First, the text is broken into small units called tokens. A token is usually a word or a fragment of a word: for example, the word "unhappiness" might be split into "un," "happi," and "ness." Tokenization is a pragmatic compromise: whole words produce too large a vocabulary to be manageable, while individual letters produce sequences so long that patterns become hard to learn.
Each token is then mapped to a vector - a list of numbers typically several hundred or even thousands of values long - called an embedding. Think of an embedding as a coordinate in a vast, high-dimensional space, where similar words occupy nearby positions. The word "king" and the word "queen" would sit close to each other; "king" and "bicycle" would be far apart. These positions are learned during training and carry real semantic meaning.
Here is where a subtle problem arises. Unlike RNNs, which process tokens one by one and therefore inherently know that token 3 came after token 2, a transformer processes all tokens simultaneously. This is what makes it fast, but it also means the model has no built-in sense of word order. The sentence "the dog bit the man" would look identical to "the man bit the dog" without an additional signal.
The solution is positional encoding: a mathematical pattern added to each token's embedding that encodes its position in the sequence. The original transformer used sine and cosine functions of different frequencies to generate these position signals, ensuring each position gets a unique fingerprint.[3] More recent architectures use learned positional encodings or more sophisticated schemes like RoPE (Rotary Position Embedding), but the principle is the same: inject position information into the token representation before processing begins.
The transformer's defining feature is self-attention, a mechanism that allows every token in a sequence to directly consider every other token when computing its representation. This is the breakthrough that made transformers so powerful.
Consider the sentence: "The trophy didn't fit in the suitcase because it was too big." What does "it" refer to - the trophy or the suitcase? A human reader instantly resolves this from context. Self-attention is the mechanism that enables a transformer to make the same kind of inference.
For each token, the transformer creates three vectors from its embedding:
Query (Q): "What am I looking for?" A representation of what this token needs from the rest of the sequence.
Key (K): "What do I offer?" A representation of what each token has to contribute.
Value (V): "What is my actual content?" The information that will be passed forward if this token is deemed relevant.
To compute attention, each token's Query is compared against every other token's Key using a dot product (a standard mathematical measure of similarity). The results are normalized into a set of weights that sum to one - a softmax operation. Each token then collects a weighted sum of all other tokens' Values, where the weights reflect those attention scores.[3] The result is a new, context-enriched representation of each token that reflects its relationships with every other token in the sequence.
In plain terms: every word gets to ask every other word "are you relevant to me right now?" and updates its own meaning accordingly. The word "it" in the trophy sentence will assign high attention weight to "trophy" because the surrounding context makes that connection strong.
One round of self-attention captures one type of relationship. But language is rich with overlapping relationships: grammatical, semantic, co-referential, and more. A transformer addresses this with multi-head attention, running several self-attention operations in parallel, each with its own set of Q, K, and V weight matrices.
Each "head" learns to focus on a different kind of relationship. One head might track which nouns are subjects of which verbs. Another might track pronoun co-references. A third might detect negation patterns. The outputs of all heads are concatenated and projected back to the model's working dimension, giving the model a rich, multi-perspective representation of every token.[3] The original transformer used 8 attention heads; modern large models use many more.
The original transformer was designed for translation and was built in two halves:
The encoder reads the input sequence (e.g., an English sentence) and builds a rich contextual representation of it. Each encoder layer applies self-attention followed by a feed-forward network, with several such layers stacked on top of each other.
The decoder generates the output sequence (e.g., the French translation) one token at a time. It uses a variant called cross-attention to look at the encoder's representations while generating each token, and masked self-attention so it can only attend to tokens it has already produced.
The key is that each half solves a fundamentally different problem. The encoder's job is comprehension: it reads the entire input at once, and because every token can attend to every other token in both directions simultaneously, it builds a rich, context-saturated representation of what the input means. That representation is ideal for tasks where the goal is to understand or classify text rather than produce it. A model doing sentiment analysis, for instance, does not need to generate anything - it just needs to read a sentence and output a label.
The decoder's job is generation: it produces output one token at a time, and to prevent it from "cheating" by looking at future tokens it has not generated yet, its self-attention is masked so each token can only attend to previous positions. This left-to-right, autoregressive structure makes the decoder a natural text generator but a poor standalone text understander, since it never sees the full input context at once. Tasks like translation and summarization benefit from both halves working together - the encoder to understand the source, the decoder to generate the target - but when the task is purely one or the other, the unused half is simply unnecessary weight.
Not all modern transformers use both halves. BERT (Bidirectional Encoder Representations from Transformers), introduced by Google in 2018, uses only the encoder, making it excellent at understanding and classifying text.[4] GPT (Generative Pre-trained Transformer), introduced by OpenAI in 2018, uses only the decoder, making it excellent at generating text.[5] This architectural fork produced two distinct families of models that persist to this day.
One of the most striking discoveries of the transformer era is that performance improves predictably as you scale three things: the number of model parameters, the volume of training data, and the amount of compute used for training. This was formalized in a 2020 paper by Jared Kaplan and colleagues at OpenAI, showing that loss decreases as a smooth power law across all three dimensions.[6]
This finding had enormous practical consequences. It meant that building a better AI was, to a significant degree, an engineering and resource problem, not just a research one. It gave the industry a roadmap and accelerated the arms race in model scale that produced GPT-3 (175 billion parameters), GPT-4, Claude, Gemini, and the current generation of frontier models.
The transformer's reach extends far beyond text. In 2020, Alexey Dosovitskiy and colleagues at Google demonstrated that a pure transformer could match or outperform convolutional neural networks on image classification tasks when pre-trained on sufficiently large datasets. The approach divided an image into a grid of patches and treated each patch as a token, yielding a model called the Vision Transformer (ViT).[7] This unlocked a wave of vision and multimodal applications.
Today, transformers process images, audio, video, protein sequences, and code, often within the same model. The architecture proved to be a universal pattern-learner, not merely a language tool.
The table below summarizes how transformers compare to the two main architectures they largely displaced: Recurrent Neural Networks (RNNs/LSTMs) and Convolutional Neural Networks (CNNs).
Transformers vs. RNNs vs. CNNs | |||
Feature | Transformer | RNN / LSTM | CNN |
|---|---|---|---|
Processes input | All at once (parallel) | One step at a time (sequential) | In local patches (parallel) |
Long-range dependencies | Excellent (direct attention) | Poor to moderate (vanishing gradient) | Limited (fixed receptive field) |
Training speed | Fast (highly parallelizable) | Slow (sequential bottleneck) | Fast (parallelizable) |
Memory usage | High (scales with sequence length²) | Low to moderate | Moderate |
Positional awareness | Requires positional encoding | Built-in (sequential order) | Partial (local structure only) |
Best suited for | Language, vision, multimodal tasks | Time series, older NLP tasks | Image recognition, local pattern detection |
Scalability | Scales very well with data & compute | Diminishing returns at scale | Scales well for vision tasks |
Interpretability | Attention maps offer some insight | Difficult to interpret | Moderate (feature maps) |
Parallelism: Because all tokens are processed simultaneously rather than one at a time, transformers can be trained on modern GPU and TPU hardware extremely efficiently. This made it feasible to train on internet-scale datasets.
Long-range context: Self-attention connects any two tokens in a sequence directly, regardless of how far apart they are. This eliminates the vanishing gradient problem that plagued RNNs.
Flexibility: The same core architecture has proven effective for text, images, audio, video, code, and scientific data - a generality unusual in AI research.
Transferability: A transformer pre-trained on a large corpus can be fine-tuned on a smaller, specialized dataset with excellent results. This pre-train/fine-tune paradigm democratized access to powerful AI capabilities.
Scalability: Performance improves reliably with scale, giving researchers and engineers a clear north star.
Quadratic memory scaling: Self-attention requires comparing every token to every other token, meaning memory and compute costs grow with the square of the sequence length. A sequence twice as long requires four times the resources, spawning an entire field of research into "efficient attention."
Data hunger: Transformers need enormous amounts of training data, raising real concerns about data sourcing, copyright, and environmental cost.
Compute intensity: Training frontier models requires thousands of specialized chips running for weeks or months, consuming vast electricity and costing hundreds of millions of dollars. This concentrates AI capability in a handful of well-resourced organizations.
Opacity: Despite the interpretive utility of attention maps, we still do not fully understand what large transformer models learn or how they arrive at specific outputs - a serious concern for high-stakes applications.
Hallucination: Transformers generate text by predicting likely next tokens based on statistical patterns. They do not verify facts, leading to confident and sometimes entirely fabricated outputs.
Static knowledge: A trained transformer's knowledge is frozen at its training cutoff. It does not learn from new information unless retrained or augmented with retrieval systems.
As of 2026, transformers are not merely the leading architecture: they are the architecture. Every major AI lab's frontier model - OpenAI's GPT-4 family, Anthropic's Claude, Google's Gemini, Meta's LLaMA - is built on transformer foundations, with varying modifications to address efficiency, context length, and multimodality. Transformer-based models have achieved strong performance across a growing range of professional benchmarks. GPT-4, for instance, passed a simulated bar exam at approximately the top 10% of test takers, according to OpenAI's own technical report,[8] while attention-based architectures, including AlphaFold's specialized Evoformer, have achieved superhuman accuracy on protein structure prediction.[9]
At the same time, researchers are actively exploring architectures that address the transformer's known weaknesses. State space models like Mamba use selective state spaces to achieve linear-time sequence modeling, offering a potential alternative to quadratic attention.[10] Mixture-of-experts (MoE) architectures selectively activate only a subset of model parameters for each input, improving efficiency at scale. And hybrid architectures that blend attention with other mechanisms are gaining traction in frontier labs.
Whether transformers will eventually be superseded by something fundamentally different, or simply continue to evolve, remains open. What is clear is that the core insight of the original 2017 paper - that attention, applied richly and in parallel across an entire sequence, is an extraordinarily powerful computational primitive - has proven more generative than even its authors likely anticipated.
The transformer's authors titled their paper "Attention Is All You Need." Nine years on, that title reads less like a hypothesis and more like a prophecy.
Hochreiter & Schmidhuber (1997) - Long Short-Term Memory, Neural Computation Inline ↗
Bahdanau et al. (2014) - Neural Machine Translation by Jointly Learning to Align and Translate, arXiv:1409.0473 Inline ↗
Vaswani et al. (2017) - Attention Is All You Need, arXiv:1706.03762 Inline ↗
Devlin et al. (2018) - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, arXiv:1810.04805 Inline ↗
Radford et al. (2018) - Improving Language Understanding by Generative Pre-Training (GPT-1), OpenAI Inline ↗
Kaplan et al. (2020) - Scaling Laws for Neural Language Models, arXiv:2001.08361 Inline ↗
Dosovitskiy et al. (2020) - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT), arXiv:2010.11929 Inline ↗
OpenAI (2023) - GPT-4 Technical Report, arXiv:2303.08774 Inline ↗
Jumper et al. (2021) - Highly Accurate Protein Structure Prediction with AlphaFold, Nature Inline ↗
Gu & Dao (2023) - Mamba: Linear-Time Sequence Modeling with Selective State Spaces, arXiv:2312.00752 Inline ↗