By a senior engineer who watched the hype, the failure modes, and the real architectural shifts.


The story of AI in production starts long before 2022 — decades of neural network research, and the deep learning boom of the 2010s, when cheap GPUs and massive labeled datasets finally made those networks practical at scale. But the real turning point came in 2018 with BERT (Bidirectional Encoder Representations from Transformers, Google's breakthrough that could read text and understand context but couldn't write new text). BERT made transfer learning practical: take a model Google had already pre-trained on billions of words, fine‑tune it on your own labeled data, deploy it as a microservice. For the first time, you could ship useful AI without owning a research team.

What "without a research team" actually meant: If you wanted production NLP before BERT, your roster looked like this — 2–5 ML PhDs who could design neural network architectures (CNNs, RNNs, LSTMs) deeply enough to invent new ones, people who could read fresh arXiv papers and decide what was worth trying, engineers writing custom training loops and loss functions in early TensorFlow / Theano / early PyTorch (no high‑level abstractions yet), data labelers producing hundreds of thousands of labeled examples per task, and ML infra engineers running multi‑week training jobs on GPU clusters. Total cost: easily $1–3M per year in salaries plus compute. Only Google, Meta, Microsoft, IBM, Amazon, and a handful of others had this. Smaller companies basically couldn't ship modern NLP — they fell back on keyword matching, regex, hand‑coded rules, or just didn't try.

Google did the expensive part once: pre‑trained BERT on billions of words from Wikipedia and BookCorpus, and released the weights publicly. After that you downloaded BERT, added a tiny task‑specific layer on top (a "classification head"), fine‑tuned on your own much smaller labeled dataset — maybe a few thousand examples instead of half a million — and deployed the result as a microservice. Fine‑tuning ran in a few hours on a single GPU instead of weeks on a cluster. You didn't need to design the architecture or understand the math behind why it worked — you just needed to call the library.

Sentiment classifier for customer reviewsPre‑2018 (no BERT)Post‑2018 (with BERT)
Team needed3–5 ML researchers + engineers1 backend engineer
Labeled data200K–500K reviews~5,000 reviews
ComputeGPU cluster, weeks of trainingSingle GPU, hours of fine‑tuning
Time to first deploy3–6 months1–2 weeks
Cost (incl. salaries)$500K–$1M+$5K–$20K

50× to 100× cheaper and 10× faster. The moment ML stopped being a FAANG privilege and became something a single backend engineer at a mid‑size startup could do over a long weekend.

To understand why BERT was such a leap, you need to know what the Transformer actually changed. Before 2017, neural networks for text read one word at a time — models called RNNs and LSTMs that struggled with long sentences and couldn't use GPUs efficiently. The 2017 paper Attention Is All You Need introduced the Transformer: a model that processes entire sequences in parallel and uses self‑attention to weigh how every word relates to every other word in the input. That unlocked context, long‑range dependencies, and nuance at a level older models couldn't touch. Every major model in this post — BERT, GPT‑3, GPT‑4, Claude, LLaMA — is built on this architecture.

Two years later came GPT‑3 (Generative Pre-trained Transformer 3, OpenAI's 2020 release: 175 billion parameters — the learned weights a neural network tunes during training, a rough proxy for how much pattern knowledge the model can absorb — and the first model that could generate remarkably fluent and coherent text at scale). It hinted at something bigger, but was too brittle for production: you had to format few‑shot examples in every prompt, and the slightest change (a missing colon, an extra line break) would break the model unpredictably.

For most of 2022, putting AI into production still meant BERT‑style classification — and praying your real‑world data didn't drift from your training set. These models were workhorses for sentiment analysis, spam detection, and intent classification — but they had no generative ability. Generative AI remained a research curiosity.

Over the next four years, that changed completely. Here's the progression as I lived it, one year at a time.


2022 – The Year the Terminal Started Talking Back

Two breakthroughs in 2022 changed everything — both built on the same GPT‑3 foundation.

First, GitHub Copilot reached general availability in June. Copilot was powered by Codex, a version of GPT‑3 fine‑tuned on billions of lines of public code.

Unlike raw GPT‑3, which required fragile few‑shot prompts to do anything useful, Codex had been fine‑tuned on public code repositories. That meant it could complete a function call or suggest a loop without any example prompts. The brittleness of few‑shot prompting vanished for this one specialized domain.

The raw Codex model had no idea about your cursor position, your open files, or your recent edits. GitHub's R&D team built the context assembly system — when you paused typing, the extension looked at the code before and after your cursor, other open files, and even the filename to assemble a rich prompt. They designed the inline ghost text suggestions that felt like a natural extension of the developer. That context awareness was the real innovation. Copilot wasn't generating code in a vacuum; it was reasoning about your current abstraction.

Then November brought ChatGPT. It wasn't just a better model – it was a new interaction paradigm. Reinforcement Learning from Human Feedback (RLHF) turned a raw next‑token predictor into something that followed instructions, admitted mistakes, and — crucially — stayed on task across multiple turns. The interface changed everything: a chat window replaced the prompt engineering notebook.

What RLHF actually changed: GPT‑3 was trained to predict the next word. That's it. It had no notion of "helpful" or "harmful." RLHF added a second training phase: humans ranked the model's outputs from best to worst, then a reward model learned those preferences. The result was a model that didn't just complete text — it tried to be useful.

But RLHF had a side effect no one anticipated at the time. By training the model to sound helpful and confident, it also became confident when it was wrong. The industry would later call this hallucination — the model stating something false with complete certainty, no hesitation, no disclaimer. ChatGPT could fabricate citations, invent API methods that didn't exist, and present made‑up statistics as fact. It didn't "know" it was lying — it was just generating the most plausible next words. This wasn't a bug that could be patched. It was baked into how language models worked. The hallucination problem would haunt every generation of AI applications that followed.

2022 was a wake‑up call. Suddenly, the tools we used every day — terminals, editors, logs — could have an AI layer. But no one knew how to build that yet. We had two working examples (Copilot and ChatGPT), but both were built by OpenAI and GitHub — not by us. The path forward was unclear.


2023 – The Open Source Explosion and the RAG Paradigm

That uncertainty didn't last long. In early 2023, the floodgates opened — and it was a direct response to the gaps left by 2022.

Gap #1: Access. OpenAI's models were powerful but proprietary, expensive per token, and required sending your data to a third party. That changed when Meta released LLaMA (Large Language Model Meta AI) in February 2023. Within weeks, the open source community had fine‑tuned LLaMA into instruction‑following models that ran on a single GPU. Everyone quickly learned the difference between a base model (raw text predictor that just rambles) and an instruct model (fine‑tuned to actually follow instructions). You could host your own LLM without paying OpenAI per token.

Base ModelInstruct Model
TrainingPredict next token on raw textAdditional RLHF / fine‑tuning phase
BehaviorCompletes text (often rambles)Follows instructions, stays on task
ExampleLLaMA baseLLaMA‑Chat, Alpaca, Vicuna
Use caseResearch, further fine‑tuningProduction applications

Gap #2: Knowledge. Both ChatGPT and Copilot were brilliant, but they only knew what they were trained on. If you wanted an LLM to answer questions about your internal documents, you had two bad options: fine‑tuning (expensive, slow, and outdated as soon as you changed a page) or cramming everything into the prompt (impossible once your documents exceeded a few thousand words).

That's where RAG — Retrieval‑Augmented Generation — came in.

RAG solved the problem by decoupling knowledge from reasoning. The idea was simple: first, search your documents for the most relevant ones. Then, paste those documents into the prompt. Finally, ask the LLM to answer based only on that context.

This small shift had huge implications:

  • Updating knowledge became a database change, not a retraining effort. Example: Your company's vacation policy changes from 10 days to 15 days. With fine‑tuning, you'd have to retrain the entire model (costly, slow). With RAG, you simply delete the old document from your vector database and add the new one. Next time an employee asks "How many vacation days do I get?" the search finds the new document, and the LLM answers correctly. No retraining needed.

  • You could cite sources – the LLM's answers became verifiable. Example: A user asks "What's the return policy for electronics?" RAG retrieves a document titled "Electronics Return Policy – 2024" and includes it in the prompt. The LLM answers: "You can return electronics within 30 days." Now you can also show the user: "Source: Electronics Return Policy, section 2." If the answer is wrong, you can check the source and fix it – the LLM is no longer a black box.

  • Search engines for internal documents suddenly became a must‑have. Example: Before RAG, most companies didn't have a good way to search their internal wikis, Slack archives, or ticket systems. After RAG, they realized: if the search step returns irrelevant documents, the LLM will give bad answers. So companies had to build or buy vector search engines (like Pinecone, Weaviate, Qdrant, Chroma) – specialized internal search tools that find documents by meaning, not just keywords.

Once developers realized that RAG needed a fast, meaning‑based search engine, dozens of startups and open‑source projects rushed to build vector databases – Pinecone, Weaviate, Qdrant, Chroma, and many more. Under the hood, these databases used a clever index structure called HNSW (Hierarchical Navigable Small World) – essentially a way to organize vectors so that searching millions of them takes only milliseconds. Exact search would have been far too slow beyond a few thousand vectors – HNSW was the key to large‑scale retrieval.

So where did RAG end up by late 2023? Two things happened.

First, RAG became the default architecture for any LLM application that needed private or up‑to‑date knowledge. Chatbot over company docs? RAG. Support assistant over past tickets? RAG. Search engine over internal wikis? RAG.

Second, the industry learned a hard lesson: RAG was easy to prototype and hard to productionize. Most teams built a demo in a weekend using two tools:

LangChain: The "Glue" for LLM Apps

LangChain is an open-source framework that acts like "glue," connecting different parts of an AI application. It provides standard, reusable pieces (called "abstractions") that let you easily link LLMs with other components, like vector databases or web search APIs. Think of it as a set of Lego blocks for building LLM apps.

Building a RAG system involves a specific chain of steps: taking a user's query, searching a vector database for relevant documents, and then having an LLM generate an answer based on those documents. LangChain provided the pre-built code to handle this whole sequence, allowing developers to focus on the unique parts of their application rather than reinventing the basic mechanics.

Chroma: The Lightweight Vector Database

Chroma (often called ChromaDB) is an open-source, lightweight vector database that was purpose-built for AI applications. It is designed to be incredibly easy to set up and use, especially for local development. Its main job is to store and search "embeddings" (the numerical representations of your documents).

When building a demo over a weekend, you want tools that are simple and quick to get running. Chroma's lightweight nature and seamless integration with LangChain made it the perfect choice for the storage and retrieval step of a prototype RAG pipeline.

These two tools represented the classic "fast and easy" path to building a RAG prototype. But when teams tried to scale those prototypes into production systems, they hit three walls:

WallProblemWhy It Hurt
LatencyVector DB + LLM + reranking2–5 second response times
CostMillions of vectors + per‑call LLM feesBills scaled faster than usage
QualityBad chunking → bad retrieval → hallucinationsThe LLM confidently made things up

A note on hallucinations: Remember the hallucination problem from 2022? RAG reduced it by grounding answers in real documents, but it didn't eliminate it. If the retrieval step returned the wrong documents, the LLM would hallucinate from those wrong documents — now with citations that looked trustworthy. This problem never fully went away.

By December 2023, the landscape had settled: RAG was a proven pattern, but not a solved problem. The vector databases had consolidated (Pinecone for managed, Chroma for local, Weaviate for hybrid search). LangChain was widely used but increasingly criticized as over‑abstracted. And every serious team realized that retrieval quality — how you chunk documents (splitting them into small, meaningful pieces), how you embed them (turning text into meaning‑capturing numbers), and whether you rerank (a second pass to fix bad search results) — mattered more than which LLM you chose.

The wake‑up call of 2022 had found its answer: open source instruct models + RAG + vector databases became the stack. But the hangover of 2023 was that making it fast, cheap, and reliable was still an art, not a science.


2024 – The Year AI Got Boring (and That Was Great)

At the end of 2023, the industry had a working stack: open source instruct models + RAG + vector databases. But making it fast, cheap, and reliable was still an art. Prototypes worked; production systems struggled. 2024 was when AI stopped being magic and started being engineering. The wild experimentation gave way to standardization, cost optimization, and practical integration.

Three fundamental shifts happened.

The cost collapse

Remember the three production walls from 2023 — latency, cost, garbage retrieval? Cost was the first to crumble. GPT‑3.5‑Turbo became 90% cheaper. Open source models like Mixtral and Llama 3 matched GPT‑4's quality but cost one‑tenth as much to run.

The real game changer was quantization. Normally, a model stores numbers with high precision (32‑bit). Quantization rounds those numbers to lower precision (4‑bit or 8‑bit). Think of it like a high‑resolution photo versus a compressed thumbnail – you lose some detail, but the file size shrinks dramatically. With quantization, a model that once needed an expensive GPU could now run on a regular CPU.

The economics flipped overnight. AI went from a scary experimental cost (finance asking "why did we spend $10K on API calls?") to a normal, predictable operating expense – like paying for cloud servers or database licenses.

Function calling became a standard

Before 2024, an LLM could only talk. You asked "What's the weather in Berlin?" and it would say "I would look that up for you" – but it couldn't actually do it. You had to write extra code to guess what the model meant and then call the weather API yourself. It was messy and unreliable.

Function calling fixed this by giving the model a menu of actions. You told it: "Here are the tools you have – for example, get_weather(city). If the user asks about weather, just tell me the city." The model then responded with a structured JSON request like { "tool": "get_weather", "city": "Berlin" } – no extra words, no guessing. Your code ran that request automatically.

That's it. The model stopped describing what it would do and started telling you exactly what to execute. Reliable, every time.

AI moved into the CI/CD pipeline

Teams started using LLMs for practical tasks:

  • Summarize what a pull request actually does.
  • Auto‑label and route issues.
  • Suggest fixes for flaky tests.

The key insight: AI didn't need to be perfect. A 70% accurate suggestion that saved an engineer ten minutes was a huge win. Unlike 2023's obsession with 95% accuracy, "good enough" was now acceptable.

Meanwhile, RAG matured

Teams fixed the three walls with three improvements:

  • Reranking – a second model reordered search results to push relevant ones higher.
  • Hybrid search – combined keyword matching (exact words) with vector search (meaning).
  • Query rewriting – the LLM cleaned up messy user questions before searching.

A typical pipeline in 2024 looked like:

User query → rewrite → hybrid search (top‑20) → rerank (top‑5) → LLM → answer with citations

The best RAG systems stopped trying to be smart. They became fast, accurate document retrievers with a chat wrapper.

So where did 2024 leave us? A single engineer could ship something useful in an afternoon with just an API key and a vector database. Everyone realized that clean data mattered more than the model – garbage in, garbage out.

The art of 2023 became the boring reality of 2024: standard patterns, predictable costs, CI pipelines that just worked. And boring was great.


2025 – The Agent Era Begins

At the end of 2024, AI was reliable and cheap, but still turn‑based: you asked, it answered, then stopped. You couldn't say "fix the staging server" and let it work.

2025 changed that. Agents ran in a loop: observe → plan → act → repeat until done. A RAG system answered one question. An agent pursued a goal.

Making agents work required three things.

1. Hybrid memory. Agents needed to remember what they tried earlier (episodic memory), track current plans (working memory), and still access long‑term facts (vector DB). Layered memory solved this.

2. Planning algorithms. LLMs normally commit to the first path they see. New techniques like Tree‑of‑Thoughts let agents explore multiple options before deciding – but at a cost of more tokens. Teams used cheap models for planning, expensive ones for execution.

3. Tool safety and standards. An agent with 50 tools can delete production data. Every tool needed authentication, rate limits, and human approval for destructive actions. But there was another problem: every agent framework had its own way of registering tools – a fragmented mess. Then Anthropic introduced MCP (Model Context Protocol) – an open standard for tool discovery and invocation. Think of it as USB‑C for AI agents: any model that speaks MCP can use any tool that exposes an MCP interface. No custom code.

Tool safety gave you control. MCP gave you interoperability.

What about observability? Teams quickly learned that traditional logging wasn't enough. When an agent did something wrong, you needed to know why – which thought led to which action, which document was retrieved, which tool call was made. Observability (tracing every step of the agent's reasoning) was essential, but no standard tool existed yet. Most teams hacked together custom logs. It would take another year to become a proper stack layer.

Separately, multi‑agent systems emerged as a different pattern. Instead of one agent doing everything, teams built separate agents for writing code, reviewing it, and running tests. When one failed, the others kept working – reducing the blast radius of any single mistake.

By late 2025, agents were triaging alerts and proposing infrastructure changes. But without budget controls, teams got $500 overnight bills. Guardrails became mandatory.


2026 – The Agentic Stack and What Hasn't Changed

By the end of 2025, agents were doing real work. But every team that built one had to solve the same problems from scratch:

  • Orchestration – How to manage the agent's loop (when to stop, retry, or hand off)?
  • Memory – How to remember what happened five steps ago without overflowing the context window?
  • Tool registry – How to control what tools the agent can call, who pays for it, and how to audit it?
  • Observability – How to trace why the agent made a decision when something goes wrong?

2026 was the year these problems crystallized into a standard stack. Not everything was solved, but the industry now had a shared language and clear patterns.

Here are the four layers, what they solved, and what still hurts.

Layer 1: Orchestration — "What do we do next?"

An agent ran in a loop: observe, plan, act, repeat. Something had to manage that loop — decide when to stop, when to call a tool, when to hand off to another agent. This was orchestration.

By 2026, two dominant patterns emerged, each solving a different problem. LangGraph (from the LangChain ecosystem) became the standard for managing a single agent's internal cycle — perfect for tasks requiring loops, retries, and persistent memory. CrewAI, a separate framework, specialized in orchestrating teams of agents, where a researcher, writer, and reviewer collaborate like a well-oiled crew.

However, many teams still built their own orchestration from scratch. Off‑the‑shelf frameworks never quite fit their specific needs, whether it was custom authentication or existing task queues.

Layer 2: Memory — "What have we done so far?"

Agents needed to remember past steps, but context windows filled up fast. Memory compaction solved this by summarizing long histories into short notes – like turning meeting minutes into bullet points. The agent could then recall "already tried rebooting the cache" without storing every word.

Layer 3: Tool Registry — "What can the agent actually do?"

In 2025, MCP (Model Context Protocol) gave agents a standard way to call tools – the plumbing. But two problems remained: discovery (how does an agent know what tools exist?) and governance (who can call what, how often, and who pays?).

The tool registry solved both. It became a live catalog where agents could dynamically discover available tools – like an app store for MCP‑compatible functions. An agent could ask the registry, "What tools can I use?" and get back a list with descriptions and parameters. No more hardcoding server addresses.

On top of discovery, the registry added governance controls for each tool: authentication (who can call it?), rate limits (how often?), cost accounting (who pays?), and audit logs. Every tool call was billed back to a team. Without this, a runaway agent could still delete data or burn through budget – discovery alone wasn't enough.

By 2026, public registries (like the MCP Registry) hosted thousands of tools, and enterprises built internal ones. The combination of MCP for communication and a registry for discovery + governance became the standard.

Layer 4: Observability — "Why did the agent do that?"

Traditional monitoring tracked speed and errors – not enough for agents. You needed to log every thought, every document retrieved, every tool call, and every decision. Without this, agents were black boxes. When an agent deleted the wrong file, you had to replay its steps to find out why – bad retrieval? faulty plan? hallucinated tool call? This remained the hardest unsolved layer.

The key shift: from prompt engineering to goal specification

Before 2026 (prompt engineering): You told the AI exactly what to do, step by step.

"Step 1: Check the staging server's CPU usage. Step 2: If CPU is above 80%, restart the cache service. Step 3: Wait 30 seconds. Step 4: Check CPU again. Step 5: If still high, ping the on‑call engineer on Slack."

That worked, but you had to think of every possible scenario upfront. It was brittle and verbose.

In 2026 (goal specification): You told the AI the outcome, not the steps.

"Keep the staging environment healthy. Budget $5 per day. If something breaks, try to fix it automatically. Ping the on‑call engineer on Slack after two failed attempts."

The agent figured out the steps itself – which tools to call, in what order, and how to recover from failures.

That was powerful, but it came with two costs: planning algorithms were still expensive (exploring multiple reasoning paths cost tokens), and guardrails were still manual (you had to set budget limits, retry caps, and approval gates yourself).


Where We Stand at the End of 2026

Here's the honest truth about what works and what still hurts – in plain English.

What Works (You can rely on this)What Still Hurts (No good solution yet)
Classifying text into categories (spam, sentiment, etc.)Figuring out why an agent did something wrong – logging everything is expensive and messy
Searching your company documents with RAG (retrieve + answer)Agents planning more than a few steps ahead – too slow and uses too many tokens
Calling tools like weather APIs or databases (function calling + MCP)Telling an agent a goal and letting it figure out the steps without you setting every safety rule manually
Running open source models on a regular computer (no expensive GPU needed)Measuring if an agent's answer is "good" – no standard way yet
AI costs being predictable (not a scary surprise bill)Summarizing long agent histories without losing important details

The simple takeaway:

You can build useful AI features today without being a researcher. The basic pieces work.

But the hard problems – debugging agents, planning efficiently, setting goals safely, and evaluating quality – are still messy. That's where the next five years of work will go.

AI didn't change what good engineering looks like. It just raised the cost of ignoring it.


The 2026 Model Landscape

By the end of 2026, "which model should I use?" depends on the task — not the brand. The frontier shifted every three to six months across this period, and any specific rank-ordering will be wrong by next quarter. Here's where the major players sat in early 2026.

OpenAI — The Default Workhorse

GPT‑5.4 (March 2026) is OpenAI's flagship, with four variants: Thinking (the default reasoning model for Plus/Team/Pro), Pro (enterprise, deeper reasoning), and the lighter mini and nano released two weeks later. Standard context is 272K tokens, with experimental 1M‑token support in Codex.

By early 2026, OpenAI had been shipping computer-use agents for over a year — first Operator (January 2025), then ChatGPT Agent (July 2025), then ChatGPT Atlas (October 2025), an entire browser with the agent baked in. GPT‑5.4 also introduced Thinking mode (transparent reasoning previews) and tool search (loading tools on demand to save tokens).

Position: Still the default for most production workloads. The base tier is reasonably priced, the Pro tier is brutally expensive, and the agentic browser story is the deepest end-to-end product on the market.


Anthropic — The Reasoning Specialist

Claude Opus 4.6 and Sonnet 4.6 (March 2026) brought 1M‑token context to general availability — and crucially, at standard per-token pricing. The previous 2× input / 1.5× output premium above 200K tokens is gone. A 900K‑token request now costs the same per token as a 9K one, which made entire codebases, long contracts, and extended agent sessions actually affordable for the first time.

ModelWhat it's for
Opus 4.6The most capable. Long‑horizon agentic tasks across thousands of steps, hardest reasoning, hardest code. Hits 78.3% on MRCR v2 at 1M tokens — the highest recall of any frontier model as of March 2026.
Sonnet 4.6The workhorse. Balances quality, cost, and speed. The model most teams default to for tool use and as the sub‑agent in multi‑agent systems.
Haiku 4.5Fast, cheap, high‑volume. Use for classification, simple tool calls, and the mundane stuff at the bottom of the funnel.

Anthropic also still leads on agentic coding (Claude Code) and the Computer Use API they shipped first back in October 2024 — which set the precedent OpenAI later followed with Operator and Atlas.

Position: The premium choice for complex, multi‑step agentic workflows. More expensive at the top than OpenAI's base tier, but justifies it on reasoning depth and tool‑use reliability. The flat‑pricing 1M context is a genuine inflection in what's affordable.


Google — The Long‑Context King

Gemini 3.1 Pro (February 2026) is the current flagship: 1M‑token context window, 65K output tokens, 114 tokens/second output speed, and aggressive per-token pricing well below GPT‑5.4's base tier. Native multimodal training across text, images, audio, video, and code remains its differentiator. Tight integration with Google Workspace, Search, and Vertex AI gives Google a distribution moat no other lab has.

Position: The pricing-and-distribution leader. If you're already on GCP or in the Workspace ecosystem, the integration story is hard to beat. The context window is enormous, multimodal is best-in-class, and the per-token cost undercuts the other frontier labs.


Meta — The Open‑Weight Champion

LLaMA 4 (April 2025) was Meta's first mixture-of-experts (MoE) generation. Three variants:

VariantContextParametersNotes
Scout10M tokens109B total / 17B active, 16 expertsIndustry‑leading context window. Fits an entire book series, a full code repo, or months of logs. Runs on a single H100 in INT4 quantization.
Maverick1M tokens400B total / 17B active, 128 expertsThe flagship production model. Best-in-class multimodal; beats GPT‑4o and Gemini 2.0 on coding, reasoning, and image benchmarks.
BehemothStill trainingMeta's bid for open‑weight frontier capability.

Maverick costs an order of magnitude less than the closed‑weight frontier models on per-token pricing through providers like Replicate and Together.

Position: If you need self‑hosting, fine‑tuning, or zero data leakage, LLaMA 4 is no longer a compromise — it's a real choice. Scout's 10M context is a technical marvel, though the cost of actually processing 10 million tokens at a time is non‑trivial.


DeepSeek — The Reasoning Disruptor

DeepSeek‑R1 (January 2025) was the moment open-source reasoning caught up with the frontier. R1 used reinforcement learning with verified rewards (the GRPO framework) to teach a base model to think step by step — and the resulting model approached OpenAI's o‑series and Gemini 2.5 Pro on math, coding, and general logic benchmarks. At a fraction of the cost. Open‑weight.

By late 2025 / early 2026, DeepSeek had moved to a hybrid architecture: V3.1 (August 2025) merged the strengths of V3 and R1 into a single 671B‑parameter MoE model with 37B active params, supporting 128K context. V3.2 followed for general daily tasks, and R2 became the dedicated reasoning specialist.

Position: The model that proved frontier reasoning could be built and shipped openly. By 2026, DeepSeek is the default open‑weight choice when reasoning quality matters more than raw parameter count or context length.


Alibaba Qwen — The Open‑Weight Multimodal Player

Qwen‑3.5 (February 2026) is Alibaba's current flagship. The Qwen3 family covers dense models from 0.6B to 32B parameters plus MoE variants up to 235B total / 22B active. Trained on 36 trillion tokens — double Qwen 2.5. The hybrid thinking mode lets the model switch between deep reasoning and quick responses based on task complexity. Qwen3‑Omni adds true multimodal output across text, images, audio, and video. All released under Apache 2.0.

Position: The strongest open‑weight multimodal player by 2026, and the model that most often tops open‑weight leaderboards. If you need open weights AND multimodal AND a permissive license, this is the answer.


The Rest of the Pack

ModelPosition
Mistral / MixtralThe European open‑weight player. Strong MoE work, competitive pricing. Solid mid‑tier choice for teams that want a non‑US, non‑Chinese option.
Cohere Command R+Enterprise-focused and retrieval-optimized. Niche but genuinely good for RAG‑heavy workflows where the model needs to be tightly bound to citation behavior.

A note on Microsoft Copilot, GitHub Copilot, and Cursor: these are products built on top of the foundation models above, not foundation models themselves. They're worth using, but evaluating them is a different question — you're picking a UX and an integration, not a model. The model under the hood is usually GPT, Claude, or both, depending on the task.


What This Means for Engineers

NeedRecommendation
General-purpose workhorseGPT‑5.4 base tier or Claude Sonnet 4.6 — pick one, build a fallback to the other.
Deep reasoning, multi‑step planningClaude Opus 4.6, OpenAI o‑series, or DeepSeek R2 in that order.
Massive context (whole codebases, long videos)LLaMA 4 Scout (10M) for self‑hosted, Gemini 3.1 Pro (1M) for managed.
Cost‑sensitive, self‑hostedLLaMA 4 Maverick or Qwen‑3.5 — both run cheaply on commodity hardware.
Agentic browser / computer useChatGPT Atlas (deepest end‑to‑end product) or Claude with Computer Use API (most flexible).
High‑volume, low‑latency classificationGemini Flash, Claude Haiku 4.5, or any small open‑weight model.
Open‑weight reasoningDeepSeek R2 or Qwen3‑Thinking.

The honest takeaway: by 2026, the model is rarely the bottleneck. Bad data, bad retrieval, missing guardrails, and weak evaluation kill more AI projects than picking the "wrong" model. Pick a frontier model with an SLA, a couple of open‑weight backups for sovereignty and cost ceilings, and a benchmark that reflects your actual use case. Iterate from there.


The Journey So Far

Here's how we got from "classification only" to where we stand today:

YearWhat HappenedWhat It Meant
2022Copilot and ChatGPT arrivedAI started helping us code and chat – but we couldn't build it ourselves
2023Open source models (LLaMA) and RAGWe could finally build our own AI apps using vector databases
2024Costs crashed and function calling arrivedAI became a predictable, cheap tool – just another part of the stack
2025Agents with memory and MCPAI stopped waiting for instructions – it started pursuing goals
2026The four‑layer agentic stackWe finally had standard patterns for building production agents

Four years from the classification‑only days of 2022, the dust settled. You no longer needed a research team. The stack worked.

What didn't change? The old engineering disciplines – idempotency, human approvals, cost monitoring, evaluation – only became more valuable. AI didn't eliminate them; it exposed them.

What comes next? Not smarter models. Three things:

  • Better observability — standard tools for tracing why an agent made each decision, not custom logs hacked together per team.
  • Cheaper planning — agents that explore multiple options without burning $50 in tokens per task.
  • Reliable evaluation — a way to measure if an agent's answer is actually good, not just "feels right."

The research problem is mostly solved. The engineering problem is just getting started.


Further Reading

The papers, posts, and tools that defined the four-year arc above. Skim the abstracts; the diagrams in the originals usually do more for intuition than any summary.

Foundational papers

Tools and protocols

  • Model Context Protocol — The open standard for tool discovery and invocation that became "USB‑C for AI agents" in 2025.
  • LangChain — The LLM framework that defined 2023's developer experience.
  • LangGraph — The orchestration framework for single‑agent loops.
  • CrewAI — The orchestration framework for multi‑agent collaboration.
  • Chroma — The lightweight vector database that powered most weekend RAG prototypes.

Visual explainers

  • The Illustrated Transformer by Jay Alammar — The canonical visual walkthrough of self‑attention. The diagrams will teach you more in 15 minutes than the paper does in an hour.

Product launches worth re‑reading

2025–2026 launches

  • DeepSeek‑R1 (January 2025) — The reinforcement‑learning paper that proved open‑source reasoning could match the frontier.
  • LLaMA 4 (April 2025) — Meta's mixture‑of‑experts generation, including Scout's industry‑leading 10M context window.
  • Gemini 3.1 Pro (February 2026) — Google's current flagship: 1M context, aggressive per‑token pricing, native multimodal.
  • Qwen 3.5 (February 2026) — Alibaba's open‑weight multimodal flagship under Apache 2.0, with hybrid thinking mode.
  • GPT‑5.4 (March 2026) — OpenAI's current flagship with Thinking, Pro, mini, and nano variants.
  • Claude 4.6 family (March 2026) — Anthropic's 1M‑context release at flat per‑token pricing, removing the 200K premium tier.