AI

Transformer Rewrites: How CODA Is Reshaping LLM Economics

Friday, May 22, 20263 min read

A new paper from the systems side of AI is quietly solving one of infrastructure's thorniest problems: how to make transformers actually run fast on real hardware.

CODA restructures transformer blocks to rewrite them as GEMM-epilogue programs—basically treating the entire transformer computation as a single optimized matrix multiplication followed by lightweight post-processing. This matters because it's the difference between theoretical peak compute and what your GPUs actually achieve in practice. Most transformer implementations leave performance on the table through fragmented operations, memory movement, and suboptimal kernel fusion. CODA collapses that waste.

For founders building LLM infrastructure or serving models at scale, this is a direct hit to unit economics. Inference cost and latency are the moats in this business. A 20-30% efficiency gain—which papers like this routinely demonstrate—translates to either cheaper serving, faster response times, or both. That's the difference between a viable product and one that can't compete on price in a commoditizing market.

The broader context: we're seeing the AI stack mature past the "throw more compute at it" phase. Early wins came from better algorithms and models. Now the frontier is systems-level optimization. Companies like Together, Mistral, and Modal aren't winning just on model quality—they're winning on inference efficiency. CODA is the kind of optimization that gets embedded into serving frameworks and eventually becomes table stakes. If you're not thinking about this layer, you're leaving margin on the table.

What's particularly interesting is that this isn't a one-off trick. GEMM operations are the bedrock of almost every accelerator. By treating inference as a GEMM problem with an epilogue, CODA ports across hardware—GPUs, TPUs, custom silicon. It's architecture-agnostic optimization, which means it scales to wherever your workloads run.

The pattern emerging across this week's briefing is clear: the low-hanging fruit in raw model capability is mostly picked. The next frontier is efficiency, tooling, and verticalization. Multi-stream architectures (hit #1) unlock better batch utilization. Claude's coding advances (hit #2) point toward autonomous agents that need reliable, fast inference. Healthcare deployments like AdventHealth (hit #4) are showing that enterprise adoption depends on operational efficiency and integration, not just raw capability.

There's also a subtle shift in geographic strategy. Google DeepMind's Asia-Pacific accelerator (hit #3) signals that climate/sustainability AI is getting serious funding and that innovation hubs are decentralizing. If you're building in that space, there's now institutional capital flowing.

The question on world models (hit #5) is the long-term play—can we move beyond statistical pattern matching to causal reasoning? That's still research frontier, but it's the kind of architectural innovation that could unlock entirely new product categories. Not immediately relevant to most founders shipping today, but worth tracking.

Bottom line: if you're building anything in the AI stack—inference, tooling, applications—the next 18 months are about efficiency and deployment, not just capability. CODA is one piece of that puzzle, but it signals where the real returns are: making AI actually run at the cost and speed that customers expect.

Quick Hits

5 links

Get briefings in your inbox

Join 2,500+ founders and engineers. Daily at 9am UTC.