From DeepMind to OpenAI to Anthropic: The Landmark Papers That Changed the History of AI

From the 2017 Transformer to the 2024 Claude internal analysis, we take a sweeping look at 10 papers that shaped the skeleton of modern AI — read through the lens of a Silicon Valley AI researcher. Structured in three acts — Google/DeepMind's "Architecture and Reinforcement Learning," OpenAI's "Scale and Emergence," and Anthropic's "Safety and Interpretability" — each paper is explained as concretely and plainly as possible with examples, followed by an overview of the broader arc and a look ahead. As of June 2026, the companies led by the researchers who wrote these papers — headed by Anthropic (valuation approximately $965 billion) and OpenAI (approximately $852 billion) —

Introduction——Reading Contemporary AI as a "Three-Act Story" Drawn by 10 Papers

Working in AI research in Silicon Valley, one occasionally experiences a peculiar sensation: that nearly all the technology we now take for granted can be traced back to just around ten papers. Chatbots, protein structure prediction, the program that surpassed humanity at Go, the reasoning models that "think before they answer" — all of it is built atop a small number of decisive ideas. The ten papers examined here are precisely those gems.

These papers become far easier to grasp as a coherent story of modern AI when read in three acts. The first act belongs to Google and DeepMind. Google's 2017 paper "Attention Is All You Need" gave birth to the Transformer architecture that underlies every generative AI system today. That same year, DeepMind introduced AlphaGo Zero, which taught itself Go without any human game records, and in 2021 published AlphaFold, which solved the fifty-year grand challenge of protein structure prediction. The themes here are new architectures, "self-improvement" through reinforcement learning, and application to science.

The second act belongs to OpenAI. OpenAI took the naïve — yet at the time almost universally disbelieved — hypothesis that "bigger means smarter," formalized it as a law in their 2020 scaling laws paper, and demonstrated it empirically with GPT-3 that same year. It was here that the world came to know a curious phenomenon called in-context learning: the ability to perform new tasks simply by being shown a handful of examples. Then in 2024, OpenAI unveiled o1, a reasoning model that "thinks before it answers," extending the axis of scaling from training time to inference time.

The third act belongs to Anthropic. Anthropic was founded in 2021 by researchers who left OpenAI under the banner of "understand and make models safe before scaling up their capabilities." They introduced mechanistic interpretability — dissecting the internals of the Transformer as circuits — along with Constitutional AI, which uses the AI's own feedback to make models harmless; many-shot learning, which scales in-context learning to hundreds of examples; and Scaling Monosemanticity, which extracts human-interpretable "features" from production Claude models. The story of capability folds back into a story of understanding and control.

The aim of this essay is not a mere parade of paper summaries. It is to stitch together, from an insider's perspective, how these ten papers form a chain — how they cite one another, what movements of people and clashes of ideas they produced within the Silicon Valley research community. The attentive reader will notice two threads running through all three acts. One is reinforcement learning — a thread connecting AlphaGo Zero's self-play, to Constitutional AI's RLAIF, to o1's reasoning training. The other is in-context learning — discovered in GPT-3, its mechanism unpacked through Transformer circuits, extended by many-shot learning, and made visible through Monosemanticity. With that, let us raise the curtain on the first act.

Attention Is All You Need (2017, Google) — The Foundation on Which All Generative AI Stands

I want to start with the most cited paper in modern AI. "Attention Is All You Need," published in 2017 by eight researchers at Google Brain, discarded the "Recurrent Neural Networks (RNNs)" that had dominated machine translation and other tasks, and introduced a new architecture called the Transformer that processes text using only "attention mechanisms." The title translates literally as "Attention is everything." At the time it seemed like a provocative joke, but today it has become literally true.

Consider a concrete example. For a machine to understand the sentence "He fished at the bank," it needs to determine whether "bank" refers to a financial institution or a riverbank — and to do so by looking at the distant word "fishing." Traditional RNNs read words sequentially from left to right, one at a time, making it difficult to capture relationships between distant words, and their sequential nature prevented parallel computation. The Transformer's self-attention mechanism allows every word in a sentence to "survey" all other words simultaneously, directly computing weights for how much attention each word should pay to every other. The word "fishing" looks at "bank" and weights it toward "riverbank" — that is the intuition. This is done simultaneously from multiple perspectives (multi-head attention), and word-order information is added separately via "positional encoding."

This design had two revolutionary implications. First, because the entire sentence can be processed in parallel all at once, it makes full use of GPU capabilities. The paper's large model, trained on just 8 NVIDIA P100 GPUs for only 3.5 days, achieved a BLEU score of 28.4 on the WMT 2014 English-German translation benchmark and 41.8 on English-French — state-of-the-art results at the time, achieved with far less computation. Second, this parallelism is precisely what physically enabled the later scaling strategy of "just make it bigger." Without the Transformer, neither GPT-3 nor Claude could exist.

What is particularly interesting from an inside-Silicon-Valley perspective is what happened to the eight authors of this paper after its publication. Every one of them left Google and went on to become founders and researchers at the core of the modern AI industry. Noam Shazeer co-founded the conversational AI company Character.AI (later returning to Google to lead Gemini); Aidan Gomez became CEO of Cohere; Ashish Vaswani and Niki Parmar co-founded Essential AI; Llion Jones co-founded Sakana AI; Jakob Uszkoreit co-founded Inceptive, which designs mRNA; Illia Polosukhin moved to the blockchain project NEAR Protocol; and Łukasz Kaiser joined OpenAI. The author list of a single paper became, quite literally, a "family tree" of AI startups in the 2020s. It is also worth noting that Google Brain and DeepMind — the organizations that produced this paper — merged in April 2023 and now operate as a single entity called "Google DeepMind." The work by DeepMind discussed in the next chapter is thus a story unfolding under the same roof.

Mastering the game of Go without human knowledge (2017, DeepMind) — A "genius from a blank slate" that imitates humans in no way whatsoever

In October 2017, DeepMind published "Mastering the Game of Go without Human Knowledge" in the journal *Nature*. The paper introduced AlphaGo Zero, the successor to the original AlphaGo that had defeated world top player Lee Sedol the previous year — but with one decisive difference. Where the original AlphaGo had trained on a vast library of professional human games, AlphaGo Zero was given only the rules of Go and became strong solely through self-play, without using any human game data whatsoever.

To appreciate just how extraordinary this is, consider an analogy. Imagine a person who, taught by no one, having seen not a single game record, is handed only a board, stones, and a rulebook. They lock themselves in a room, play against themselves for a few days, then emerge to defeat the greatest players in history 100 games to none. That is precisely what AlphaGo Zero did. Starting from a blank slate — placing stones at random — it used only the experience generated through self-play as its teacher, rewriting itself incrementally from within. According to the paper, within just three days of beginning training it surpassed the version that had beaten Lee Sedol (AlphaGo Lee) by 100 games to 0, and after forty days it reached an estimated Elo rating of 5,185, eclipsing every prior version.

The technical heart of the achievement lies in a masterful fusion of reinforcement learning and search. AlphaGo Zero uses a single neural network to predict both the probability distribution over next moves and the win rate from any given position. During each game it performs lookahead via Monte Carlo Tree Search (MCTS), then uses the results of that search as a "superior teacher" to train the network. As the network grows stronger, the search grows sharper; sharper search produces better training data — and this self-reinforcing loop gave rise to superhuman strength without any external scaffold of human knowledge. Notably, AlphaGo Zero independently rediscovered *joseki* (the patterns humans had refined over centuries) and went further still, inventing new joseki that humans had never found.

From a Silicon Valley perspective, the paper's true reach extends far beyond Go. It is a proof of principle: *given a well-defined reward, self-play reinforcement learning alone can surpass human performance*. DeepMind generalized this approach into AlphaZero — which mastered Go, chess, and shogi with the same algorithm — and then into MuZero, which learns without even being given the rules of the game. The spirit of "transcendence through self-improvement" recurs, in various forms, throughout the second half of this essay. Anthropic's Constitutional AI, in which an AI generates its own feedback to reduce harmful outputs, and OpenAI's o1, which generates chains of reasoning and refines them through reward signals, both carry AlphaGo Zero's DNA. Reinforcement learning is the first thread running through everything that follows.

Highly accurate protein structure prediction with AlphaFold (2021, DeepMind) — "Biology's 50-year grand challenge" solved by AI

Another milestone demonstrated by DeepMind is the paper "Highly accurate protein structure prediction with AlphaFold," published in *Nature* in 2021. Unlike a game such as Go, this achievement carries an entirely different historical significance — AI had solved a 50-year-old unsolved problem in biology itself. The weight of that accomplishment is reflected in the 2024 Nobel Prize in Chemistry awarded to DeepMind's Demis Hassabis and John Jumper (half the prize went to David Baker of the University of Washington for computational design of novel proteins).

What made the problem so difficult in the first place? A protein is a "string" of 20 types of amino acids linked in a chain, but that string folds almost instantaneously inside a cell into a complex three-dimensional structure — and that shape directly determines its function. Enzymes, antibodies, muscles: shape gives rise to function. Yet predicting the final three-dimensional structure from an amino acid sequence — the "protein folding problem" — involves astronomically large combinations of possibilities, and since the Nobel Prize acknowledged this challenge in 1972, it had been considered the greatest unsolved problem in biology for half a century. With conventional methods such as X-ray crystallography, determining a single structure could take months to years and enormous expense.

The breakthrough of AlphaFold2 lies in a novel neural network architecture called Evoformer. It takes two types of information — a collection of sequences of similar proteins accumulated through evolution (multiple sequence alignment, MSA) and a table of pairwise distance relationships between amino acids — and refines them through repeated passes using an attention mechanism (the same Transformer ideas from the previous chapter apply here), ultimately outputting three-dimensional coordinates in one shot. The key innovation was a geometric trick: correcting the relationship between two amino acids using the consistency of a "triangle" passing through a third amino acid. At CASP14, the 2020 world championship for protein structure prediction, AlphaFold2 achieved a median GDT score of 92.4 — virtually indistinguishable from experimentally determined structures on a 100-point scale — crushing all other competitors and earning the verdict that "the problem has essentially been solved."

What sets this paper apart from ordinary technical achievements is the sheer scale of its subsequent social impact. DeepMind released the predicted structures freely to the world; the AlphaFold Protein Structure Database now contains approximately 200 million structures covering nearly all known proteins and is used by more than two million researchers in over 190 countries. The underlying "assumptions" across every corner of the life sciences — drug discovery, enzyme design, antibiotic resistance, malaria research — have been transformed. As a researcher in Silicon Valley, what I want to emphasize most strongly is that AlphaFold delivered the clearest possible demonstration that "AI is not merely a toy that manipulates language, but a tool capable of solving the unsolved hard problems of natural science." The fact that Hassabis used AlphaFold as a springboard to found the drug discovery company Isomorphic Labs, and in 2024 advanced to AlphaFold 3 — which predicts not just proteins but complexes involving DNA, RNA, and small molecules as well — speaks to the extraordinary breadth of that vision.

Scaling Laws for Neural Language Models (2020, OpenAI) — Turning "bigger means smarter" into a law

Now we move to Act Two: the story of OpenAI. In January 2020, Jared Kaplan and colleagues at OpenAI published a paper titled "Scaling Laws for Neural Language Models" — ostensibly unremarkable, yet one that would decisively shape the strategy of modern AI. In a single sentence, its claim was this: "The intelligence of a language model (measured by the smallness of its prediction error) improves continuously according to a surprisingly clean 'power law' with respect to model size, data volume, and compute."

What makes this discovery so remarkable? Research and development is ordinarily a gamble — you don't know what will happen until you try. Yet Kaplan and colleagues trained more than 200 models spanning seven orders of magnitude in parameter count, plotted their performance, and found that the points fell nearly on a straight line (a straight line on a log-log graph being the signature of a power law). In other words, from experiments on small models, one could predict in advance the performance of a much larger model not yet built. Like a weather forecast, one could estimate: "Invest this much compute, and the model will become this much smarter." This also became a tool for business decisions that justified enormous investments.

The concrete implications were equally striking. The paper suggested that, to make the most efficient use of a given compute budget, resources should be directed toward making the model larger rather than increasing data volume (the optimal parameter count should scale as roughly the 0.73 power of compute, while data should scale as the 0.27 power). It further argued that "larger models learn more from less data — they are more sample-efficient." This message of "when in doubt, scale up" directly encouraged the bet on GPT-3, then the largest model ever built. GPT-3, discussed in the next chapter, was the first grand empirical test of these scaling laws.

There is, however, an honest postscript that intellectual integrity demands be added. In 2022, Hoffman and colleagues at DeepMind published research known as "Chinchilla," arguing that Kaplan et al.'s optimal allocation was skewed. For a given compute budget, they contended, the most efficient approach is to scale parameters and data in roughly equal proportion (each scaling as approximately the 0.5 power of compute) — meaning that GPT-3 and the giant models of that era were "too large and trained on too little data." Indeed, the 70-billion-parameter Chinchilla outperformed Gopher, which was four times larger at 280 billion parameters. The primary sources of this discrepancy were later analyzed to be Kaplan et al.'s practice of counting parameters while excluding embedding layers, as well as differences in learning rate settings. Scaling laws are not a monolithic truth but a body of knowledge refined through successive corrections — and it is precisely that process of self-correction that I believe attests to the health of this field.

Language Models are Few-Shot Learners (2020, OpenAI) — The Giant That Learns "Just by Being Shown a Few Examples"

The theory of scaling laws was demonstrated to the world in a jaw-dropping fashion by the 2020 paper introducing GPT-3: "Language Models are Few-Shot Learners." Awarded Best Paper at NeurIPS 2020, this research showed that a massive language model with what was then an astronomical 175 billion parameters——ten times larger than any previous non-sparse model——could acquire capabilities no one had anticipated.

That capability is the second throughline of this article: in-context learning. Let me explain with an analogy. In conventional machine learning, if you want a model to translate, you have to redo "additional training (fine-tuning)" on translation data. GPT-3 was different. Simply by writing a few examples in the prompt——"sea otter → loutre de mer, cheese → fromage"——and then ending with "dog →", the model would complete it with "chien" without any additional training. Without updating a single weight, it merely read the given context and figured out on the spot: "Ah, this is an English-to-French translation task." The paper evaluated this systematically across three tiers: "zero-shot," where no examples are shown; "one-shot," where a single example is shown; and "few-shot," where 10–100 examples are shown.

The tricks GPT-3 demonstrated were wide-ranging. Beyond translation, question answering, and fill-in-the-blank, it handled tasks requiring on-the-fly reasoning——solving word anagrams, using newly coined words in sentences, performing three-digit addition. Nobody had explicitly "taught it arithmetic," yet through reading vast amounts of text, it had internalized the regularities of arithmetic on its own. This phenomenon——where scaling up causes abilities that were never trained to suddenly appear——later dubbed emergence——was the greatest shock GPT-3 delivered to the research community.

Looking back from a Silicon Valley perspective, GPT-3 was also a paper that dissolved the boundary between "research" and "product." The concept of a general-purpose API connected directly to ChatGPT, and with the release of ChatGPT at the end of 2022, generative AI became a mainstream societal phenomenon. At the same time, GPT-3 left two open questions for the latter half of this article. First: "Why does in-context learning occur, and what is its internal mechanism?"——answered in subsequent chapters by Anthropic's interpretability research. Second: "What happens when you scale the 'handful' of few-shot examples to 'hundreds'?"——which leads into the chapter on many-shot learning. GPT-3 was an answer, and at the same time, a vast treasure trove of questions.

Learning to Reason with LLMs (2024, OpenAI) — "Think Before Answering" Opened a New Axis of Scaling

As the third entry in this series, I want to discuss the technical report "Learning to Reason with LLMs," which introduced the reasoning model o1 announced by OpenAI in September 2024. This report added an entirely new dimension to the prevailing scaling wisdom — "make the model bigger, increase training compute, and it gets smarter" — namely, "the longer you let the model think before answering (increasing inference-time compute), the smarter it gets."

Consider an intuitive example. When a human tackles a difficult math problem, the success rate is dramatically different between answering reflexively on the spot versus spending ten minutes working through intermediate steps on paper. Conventional language models were, in effect, answering every question reflexively. What o1 did was allow the model to develop a long internal "chain of thought" before producing an answer — forming hypotheses, checking calculations, noticing mistakes, and changing course. Moreover, rather than training this reasoning style by having the model imitate human-written examples, large-scale reinforcement learning was used. The model was given problems to solve, rewarded for sound reasoning paths, and left to discover on its own how to think "productively." Note here as well the lineage of "self-improving reinforcement learning" that traces back to AlphaGo Zero.

The results were dramatic. On AIME 2024, the qualifying exam for the American Mathematics Olympion, the previous-generation GPT-4o solved an average of only 12% of problems (1.8 out of 15), whereas o1 achieved 74% with a single attempt, 83% with a majority vote across 64 attempts, and 93% when sampling 1,000 times and re-selecting with a trained scorer. On the competitive programming platform Codeforces, it ranked in the top 11% (89th percentile), and it matched expert performance on PhD-level science questions. The most important graph in the paper showed a log-linear relationship: exponentially increasing thinking time (inference-time compute) yields linearly increasing accuracy. It was here, for the first time, that it was clearly demonstrated that a model can be made smarter along two independent axes — training-time compute and inference-time compute.

As a researcher, I want to emphasize two points about the significance of this paper. First, at a time when the industry was increasingly anxious that "training data is running out and scaling may be hitting a ceiling" in the post-Chinchilla era, o1 opened up an entirely new avenue for growth in the form of "inference-time compute." This reshaped both the logic of fundraising and the demand for semiconductors. Second, the lineage of o1 was inherited by subsequent reasoning models such as o3, and as of 2026, flagship models from every major lab are designed with "thinking" as a baseline assumption. Anthropic's Claude Opus 4.8, discussed later, and OpenAI's GPT-5.5 both inhabit this world of "inference-time scaling." What the second act of OpenAI sketched out was a richer map of scaling — one where scale does not move in a single direction, but along multiple axes.

A Mathematical Framework for Transformer Circuits (2021, Anthropic) — Deciphering the Black Box as "Circuits"

Here begins Act Three, the story of Anthropic. Anthropic was founded in 2021 by researchers who had led GPT-3 and the scaling laws at OpenAI — including siblings Dario Amodei and Daniela Amodei, and Jared Kaplan, lead author of the scaling laws paper — united by the conviction that models must first be understood and made safe before capabilities are blindly scaled up. The purest expression of that philosophy is "A Mathematical Framework for Transformer Circuits," published in December 2021.

Let me explain the paper's core concern with an analogy. A large language model is a mass of hundreds of billions of numbers: you feed in an input, an output comes out, but no one knows what is happening inside — it is a massive black box. What lead author Nelson Elhage and his colleagues set out to do was to reverse-engineer that black box into human-interpretable "circuits" — much like disassembling a compiled program back into readable source code. This field is called mechanistic interpretability, and Anthropic became its standard-bearer.

Rather than tackling real large-scale models, the paper began by thoroughly dissecting extremely small toy models — attention-only models with zero, one, and two layers. The conceptual lens it introduced is elegant. Inside a Transformer there is a shared communication channel called the "residual stream," from which each attention head reads information and writes back its results — functioning, in effect, as an internal bulletin board. Each individual attention head's behavior, the paper showed, can be decomposed into two circuits: one that decides which tokens to attend to (the QK circuit), and one that decides what to read from those tokens and write back (the OV circuit). The black box began to look like a combination of interpretable components.

The paper's most important discovery is the "induction head." This is a circuit that first appears in two-layer models, and it operates like copy-and-paste: "if the model just saw a pattern 'A followed by B,' then when A appears again, predict B." It sounds unassuming, but this is a leading candidate for the true mechanism behind the in-context learning that GPT-3 demonstrated in the previous act. Indeed, in follow-up research in 2022, Anthropic showed that the moment induction heads form inside a model coincides precisely with the moment in-context learning ability emerges. In this way, the present chapter delivers on a planted thread: the strange phenomenon that OpenAI "discovered" in Act Two receives a "mechanistic explanation" from Anthropic in Act Three. This paper marks the turning point where the story of capability folds back into a story of understanding.

Constitutional AI: Harmlessness from AI Feedback (2022, Anthropic) — The Invention of a "Constitution" That Lets AI Train AI

Anthropic's second landmark work is "Constitutional AI: Harmlessness from AI Feedback," published in December 2022. This is the foundational training methodology behind Anthropic's subsequent product Claude, and it represented an important shift — both practical and philosophical — demonstrating that "making AI safe does not require humans to continuously label harmful outputs one by one."

Let me explain the background. The standard safety technique used in ChatGPT and similar systems is Reinforcement Learning from Human Feedback (RLHF), in which humans manually perform tens of thousands of harmful/harmless judgments. However, this is costly, raises ethical concerns about exposing human workers to large volumes of harmful content, and the criteria for what counts as harmful remained opaque. Anthropic's question was this: could the standards be given in advance as an explicit "constitution", and could the training itself be delegated to the AI?

The approach consists of two stages. In the first stage (supervised learning), the model is deliberately prompted with harmful questions to elicit problematic responses; the model itself is then asked to self-critique — "this response is problematic in light of principle X in the constitution" — and rewrite the answer. The model is then fine-tuned on these revised, harmless responses. In the second stage (reinforcement learning), the model generates pairs of responses, and the AI itself judges which response better conforms to the constitution, producing preference data that is used as a reward signal for further training. Because the reward is generated from AI feedback rather than human labels, this method is called RLAIF (Reinforcement Learning from AI Feedback). The constitution consists of roughly 16 principles drawn from sources such as the Universal Declaration of Human Rights, covering perspectives including legality, harmfulness, fairness, and tone.

What makes this paper remarkable is that it offers a new solution to the trade-off between safety and helpfulness. With conventional methods, increasing harmlessness tends to push models into over-refusal — declining everything with "I cannot answer that question." Models trained with Constitutional AI, rather than simply going silent in response to harmful requests, became "harmless but not evasive" assistants that engage in dialogue explaining why they cannot comply. From a researcher's perspective, the spirit of "self-improvement" pioneered since AlphaGo Zero is at work here as well — the model critiques its own outputs, revises them, and trains itself on its own preferences. Anthropic later extended this methodology into "Collective Constitutional AI," experimenting with incorporating the views of ordinary citizens into the constitution, venturing into the governance question of who decides AI values and how.

Many-Shot In-Context Learning (2024, DeepMind) and Many-shot Jailbreaking (2024, Anthropic) — The Light and Shadow of In-Context Learning

This chapter covers "many-shot learning," which pushed in-context learning to a new scale in 2024.

Let us first establish the phenomenon itself. The few-shot learning demonstrated by GPT-3 in Act II involved placing "10 to 100" examples in the prompt. By 2024, however, the context windows (the input length that can be processed at once) of various companies had grown explosively, enabling the handling of hundreds of thousands of tokens. Google DeepMind then conducted a straightforward experiment — what would happen if the number of examples were increased to hundreds or even thousands? The results showed that performance continued to improve significantly across a wide range of tasks, including translation, summarization, and reasoning. Furthermore, to address the problem of human-prepared examples running out, they demonstrated that "Reinforced ICL," which uses the model's own generated chain-of-thought as examples, as well as "Unsupervised ICL," which presents large numbers of problems without even providing answers, could also be effective. Without relying on fine-tuning, simply feeding a large number of examples into the context allows the model to adapt to new tasks.

So what is Anthropic's "many-shot jailbreaking"? This is the dangerous flip side of the same principle. Anthropic's researchers discovered that by stuffing hundreds of rounds of fabricated dialogues — in which "dangerous questions are answered politely" — into the prompt of a model that had ostensibly been safety-trained, the model would be pulled along by that context and comply even with harmful requests it should have refused. What is alarming is that the effectiveness increases as a power law with respect to the number of examples — this is precisely the universal property that in-context learning possesses. Moreover, this attack worked not only on Anthropic's own Claude, but also on models from OpenAI and Google DeepMind. It is a weighty lesson from safety research: a "convenient feature" in the form of a long context window becomes a new attack surface in its own right.

Reading these two papers side by side reveals the essence of modern AI. In-context learning was discovered with GPT-3 (Act II), its mechanism was elucidated through Transformer circuits (the induction heads of this act), and with many-shot learning it was confirmed to be "a power-law phenomenon that grows more powerful the more the scale is increased." Just as scaling laws govern the "training" of models, power laws govern "in-context learning" as well. And that same force can be used for both the expansion of capability (DeepMind) and the subversion of safety (Anthropic). It is precisely this dual nature that led Anthropic — an organization that keeps both capability and safety simultaneously in view — to go so far as to publicly disclose the attack method as a warning to the industry.

Scaling Monosemanticity (2024, Anthropic) — Extracting "Units of Meaning" from Production Claude

The closing work of Act Three, and the tenth paper in this series, is "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet," published by Anthropic in May 2024. It is a landmark study in which the ambitions of mechanistic interpretability — first planted in the Transformer Circuits chapter — finally came to fruition on Claude 3 Sonnet, a real, large-scale model running in production.

At the heart of the problem lies a troublesome property called superposition. Individual neurons in a neural network do not cleanly correspond to a single concept like "dog" or "sadness" as one might hope. A single neuron responds simultaneously to dozens of unrelated concepts, a polysemantic condition that had been the greatest obstacle to decoding these models. In a 2023 precursor study, "Towards Monosemanticity," Anthropic had shown — using small models — that a technique called a sparse autoencoder (SAE) could untangle the intertwined activity of neurons into features, each corresponding to a single meaning. The question this paper posed was: "Does this technique scale from toy models to genuinely large ones?"

The answer was yes. Using the principles of dictionary learning, Anthropic succeeded in extracting millions of monosemantic features from the activations of Claude 3 Sonnet's intermediate layers. These features proved remarkably abstract, cutting across languages and modalities. For instance, the feature corresponding to "Golden Gate Bridge" responds to that concept whether expressed in English or Japanese, in text or in photographs of the bridge. More importantly, these features did not merely allow observation of the model's internal state — artificially amplifying their activation made it possible to steer its behavior. When the research team turned the "Golden Gate Bridge feature" up to maximum, Claude became convinced it *was* the bridge regardless of what it was asked, weaving every topic back to that structure — a demo briefly made public under the name "Golden Gate Claude," which drew considerable attention.

What the researchers consider most significant is the discovery of features directly tied to safety. Anthropic identified features corresponding to exactly the behaviors one would want to monitor: deception, sycophancy (flattery), bias, the synthesis of dangerous materials, and code vulnerabilities. If the internal state of a model "attempting to lie" can be captured as a feature and manipulated, AI safety could advance from "censoring outputs after the fact" to "directly reading and controlling internal intent." The paper is candid about its limitations, however. Even a feature labeled "Golden Gate Bridge" activates in contexts unrelated to bridges the vast majority of the time; the feature genuinely represents a bridge only in the small fraction of cases where activation is extremely high — the act of giving a feature a human-readable name carries the trap of an illusory sense of understanding. Even so, this paper proved that the dream articulated in the Transformer Circuits chapter — "reading the black box as a circuit" — can become reality even in state-of-the-art models. Act Three completed the story of capability as a story of understanding and control.

Review of the overall flow and perspectives on what lies ahead

Now that I have finished reading all ten papers, let me step back and survey the whole. The three-act story was not a loose collection of independent discoveries, but a single broad river whose currents quote, critique, and carry forward one another. In Act One, Google laid the foundation with the Transformer, and DeepMind demonstrated two fundamental principles: that "self-play reinforcement learning can surpass humans" (AlphaGo Zero), and that "AI can solve hard problems in the natural sciences" (AlphaFold). In Act Two, OpenAI took that foundation, formalized the law that "scale begets intelligence" (the scaling laws), proved it empirically (GPT-3), and opened a new axis of scale — "thinking at inference time" (o1). In Act Three, Anthropic confronted the enormous power that Act Two had unleashed and built a framework of understanding and control: "read what is happening inside as circuits (Transformer Circuits, Monosemanticity), align behavior through the AI's own feedback (Constitutional AI), and look squarely at the dual nature of that power (many-shot)."

Two threads running through this river were resolved with remarkable elegance. Reinforcement learning flowed continuously, shape-shifting from AlphaGo Zero's self-play, to Constitutional AI's RLAIF, to o1's reasoning-time training — installing at the core of modern AI the idea that "a model evaluates its own outputs in order to improve itself." In-context learning was discovered in GPT-3, its mechanism explained by induction heads, extended as a power law in the many-shot work, and visualized as features in Monosemanticity — racing through the ideal scientific cycle of discovery, explanation, extension, and observation in just a few years. And throughout, the Transformer remained the foundation for everything — not only text, but proteins (Evoformer) as well. "Attention Is All You Need" was, quite literally, true.

Watching from inside Silicon Valley, what strikes me most forcefully is that this was not only a "history of papers" but equally a "history of people in motion." The eight authors of the Transformer left Google and became the genealogical tree of the entire industry. The researchers who led the scaling laws and GPT-3 left OpenAI and founded Anthropic. Those who chase capability and those who interrogate safety came from the same lab, cite each other's papers, and yet plant different flags — and it is precisely this tension that has driven the field's evolution. That tension is now reflected, without distortion, in the capital markets. In May 2026, Anthropic raised $65 billion in a Series H round, reaching a valuation of approximately $965 billion, surpassing its long-time rival OpenAI (most recent raise approximately $122 billion, valuation approximately $852 billion) for the first time to become the world's most valuable AI startup, with reports indicating it has begun preparations for an IPO. The intellectual quest that began with ten papers is now moving capital on a scale comparable to the GDP of a nation.

So where does the road lead from here? I want to offer three observations. First, the race for "understanding" to catch up with "capability" will intensify in earnest. The interpretability frontier opened by Monosemanticity has illuminated only a tiny fraction of what happens inside these models. Yet the deeper AI penetrates consequential societal decisions, the more valuable it becomes to explain from the inside *why* a given answer was produced — and to detect and control dangerous internal states. Whether the exponential curve of understanding can be made to run alongside the exponential curve of capability is the central question of the next five years. Second, the axes of scaling will continue to multiply. After training-time and inference-time scaling, the next battlefield is the "temporal axis of action" — agents autonomously trying and failing over extended periods. Indeed, Claude Opus 4.8, which appeared in May 2026, is equipped with the ability to run up to 1,000 sub-agents in parallel and is competing with GPT-5.5 on long-horizon task completion. Beyond the "time to think" that o1 opened lies the "time to keep acting."

Third, and most importantly, I want to emphasize that what these ten papers demonstrated is not a "destination" but a "methodology." The courage to place enormous bets on clean power laws, the tenacity to refuse to accept the black box and read it as circuits, the discipline to interrogate safety with the same intensity as capability — even as individual techniques grow obsolete, this methodology will go on producing the next ten papers, and the hundred after that. What was passed from DeepMind to OpenAI to Anthropic was not a specific architecture or a set of equations, but an *orientation*: confronting the deepest mysteries of nature and intelligence head-on, with computation as the tool. The next landmark paper destined to change the history of AI is being written somewhere in a lab right now. Trace its headwaters, and you will surely find your way back to these ten.