← Back to The Prompt Engine

Concrete and Steel: The Convergence That Commoditises AI

By Sami Mandeel · March 2026

Two days ago, Alibaba released a 9-billion-parameter AI model that performed competitively with models more than thirteen times its size on a graduate-level reasoning benchmark.

That shouldn't be happening.

For the past four years, the rule of AI has been simple: bigger models win. More parameters, more compute, more data. That's how you build the best model. And it's why the best AI has lived exclusively in the cloud, on hardware most people will never own.

But something has changed. And Qwen isn't an isolated case. Across multiple benchmarks over the past year, smaller open-weight models have begun closing the gap with much larger proprietary systems.

Technology tends to move through the same cycle. At first, capability comes from scale: bigger machines, more power, more resources. Over time, engineers discover more efficient architectures, better algorithms, and smarter ways to allocate compute. What once required a data centre eventually runs on a desktop, then a laptop, then a phone. We've seen it with compute, with storage, with graphics, and with machine learning itself.


Concrete and Steel

Think about how we build skyscrapers. Concrete is strong in compression. Steel is strong in tension. Neither one alone can support a tall structure. But combine them into reinforced concrete and you get something stronger than either material alone.

AI models are starting to work the same way.

The model that competed with a giant 13x its size, Alibaba's Qwen 3.5-9B, doesn't use one monolithic neural network. It uses an architecture called Mixture of Experts (MoE): hundreds of specialist sub-models, each tuned to different types of reasoning, with a routing layer that picks the right experts for each task. The rest stay dormant.

Only a fraction of the total parameters fire on any given prompt. You get much of the capability of a larger model while only activating a small fraction of its parameters at a time.

MoE is only part of the story. Better training data, distillation from frontier models, improved post-training techniques, and architecture refinements are all compounding at once. The result is that a model running on your laptop is approaching the capability of models people currently pay $20 a month to access.

Two curves are converging. And most people are only watching one of them.


The Convergence

Open-source models are catching proprietary ones. Fast.

In 2023, the gap between the best open-source model and the best proprietary model on common academic benchmarks like MMLU was 17.5 percentage points. By early 2025, that gap had collapsed to 0.3 points. The lag time tells the same story: open-source models used to take 27 weeks to match a new proprietary release. Now it's 13 weeks. Epoch AI puts the average at roughly 3 months, and shrinking.

In February 2026, four major coding models launched in six days: two proprietary, two open-source. The benchmark gap between the best and worst? 2.6 percentage points.

Consumer hardware is catching up too.

Apple's M5 chip, released late 2025, delivers 4x the AI compute of the M4. It runs a 14-billion-parameter model with time-to-first-token under 10 seconds. A 30-billion-parameter MoE model? Under 3 seconds. Quantisation breakthroughs mean a model that would have needed 32GB of memory now fits in 6.6GB.

The likely future isn't “everyone runs local models” or “everyone uses cloud APIs.” It's a hybrid: a capable local model handling 90% of your daily tasks (fast, private, free) with occasional calls to the cloud for the 10% that genuinely requires frontier-scale capability.

Think of it like photography. Professional cameras didn't disappear when smartphone cameras got good enough. But 95% of photos are now taken on phones. The expensive gear still exists for specialists, but the default shifted.

AI is heading the same way.


When the Lines Cross

Let's put real numbers on it.

Anthropic's Claude Opus 4.6, one of the best cloud models available today, scores 91% on the MMLU benchmark. Qwen 3.5-9B, running locally on a Mac Mini with 16GB of RAM, scores 88%. That's a 3-point gap. And the local model uses just 6.6GB of memory.

To close that gap by just scaling up the same MoE architecture, you'd need roughly a 30-billion-parameter MoE model. At Q4 quantisation, that's about 18GB. It would fit comfortably on a machine with 24GB of unified memory.

Apple is expected to ship an M5 Mac Mini with 24GB base RAM sometime in late 2026 or early 2027. Alibaba, Meta, and Mistral are all shipping MoE models on roughly six-month cycles. A 30B MoE model already exists: Qwen 3's 30B-A3B, released in 2025, ran with only 3 billion active parameters and outperformed the 32-billion dense QwQ model.

The architecture exists. The hardware is coming. By 2027, a base-spec Mac Mini could plausibly run a local model that approaches today's cloud frontier for many common tasks.

Dense model
MoE model
Claude Opus 4.6

The punchline: In 2025, a 14B dense model scored 84.8%. In 2026, a 9B MoE model scores 88%. The model got smaller but the score went up. That's the MoE inflection, and it's why the line is about to cross Opus.

The chart tracks the best model runnable on a base Mac Mini since 2021. Notice what happens in 2026: the model gets smaller (9B vs the previous year's 14B) but the score jumps up. That's the concrete-and-steel inflection. And the dashed line shows where the trend is heading.

Not on every task. Not for every use case. But for the vast majority of what people actually use AI for (writing, coding, analysis, summarisation, brainstorming) the gap is already narrow for many everyday tasks.

When intelligence itself becomes free, the business model of selling access to intelligence starts to collapse.

We've seen this pattern before. Compute used to be scarce and expensive. Now it's a commodity. Storage used to be scarce. Now it's effectively free. When a foundational technology becomes abundant, the value shifts up the stack.


From Chatbots to Agents

This doesn't mean OpenAI and Google disappear. It means their role changes. If raw intelligence becomes cheap, the real question becomes: where does the value move next?

The shift isn't from one chatbot to a better chatbot. It's from LLMs that chat to agents that do. Claude doesn't just answer questions. It writes code, automates your desktop, and picks up mid-task on your phone. OpenAI is building an operating system for AI workflows. Google is embedding agents into products 2 billion people already use.

That agent layer, not the model underneath, is what you're actually paying for. And it's much harder to replicate locally than raw intelligence. A 9B model on your Mac Mini can match Opus on a benchmark, but it can't browse the web, execute code in a sandbox, connect to your Slack, or chain tasks together across devices. Not yet, anyway.

The likely split: local models handle the private, latency-sensitive tasks (drafting emails, summarising documents, analysing your own data) while cloud agents handle the heavy orchestration that requires web access, multi-service integrations, and compute that exceeds what sits on your desk.

When intelligence becomes cheap and abundant, the competitive advantage moves up the stack, from models to systems. The underlying commodity, raw language model intelligence, is being rapidly democratised. The companies that survive this shift won't be the ones with the biggest model. They'll be the ones that build the most useful agent ecosystems on top, because the models themselves are becoming infrastructure.


What This Means for You

For the first time in the history of AI, the most powerful model you use in a typical day may soon be running on your own machine.

As models become commodities, the real skill isn't access to AI. It's knowing how to use it well. When everyone has the same powerful model running locally, the gap between someone who prompts well and someone who doesn't gets wider, not narrower.

The model won't be the bottleneck much longer. You will. In a world of commodity intelligence, the human in the loop becomes the only non-commodity left. That's why I built The Prompt Engine, a hands-on course that teaches you to extract maximum value from any model, whether it's in the cloud or on your laptop.


Sources

Epoch AI estimates the open-weight to closed-weight lag at approximately 3 months as of early 2026. MMLU gap data from Red Hat's State of Open Source AI 2025 report. Apple M5 benchmarks from Apple's MLX research. Qwen 3.5 benchmarks from Alibaba's March 2, 2026 release. Claude Opus 4.6 benchmarks from Anthropic, February 2026. Hardware specs from apple.com. All MMLU scores are 5-shot from published model cards. 2027-2028 values are extrapolations.

Written by Sami Mandeel · Creator of The Prompt Engine