Inference Optimization in Production | Stockholm MLOps #29 | March 12, 2026

Insights from the Stockholm MLOps community

“Scaling serverless inference is less about GPUs and more about what breaks when you add them.” — Matthijs Kok, evroc

That line captured the core theme of the evening.

At Stockholm MLOps #29, engineers and founders shared what actually happens when AI systems move from experimentation into real production environments.

Not demos.
Not benchmarks.
But the operational realities of running inference systems at scale.

The event focused on:

  • inference optimization

  • model interoperability

  • infrastructure ownership

  • open vs closed models

  • orchestration

  • and sovereignty in AI infrastructure

Summary — Key Insights from the Event

  • Most inference problems emerge outside the GPU

  • Open models shift operational complexity to the operator

  • The optimization layer is becoming the real infrastructure moat

  • Model orchestration is replacing single-model thinking

  • Inference economics become dominant at production scale

What This Event Was Really About

As AI systems scale, the bottleneck increasingly shifts away from the models themselves.

Instead, teams are now dealing with:

  • orchestration complexity

  • parser failures

  • cache hierarchies

  • deployment safety

  • GPU scheduling

  • and infrastructure economics

“Inference infrastructure is becoming a distributed systems problem.” — Matthijs Kok, evroc

The implication:

Production AI is becoming less about prompts and more about systems engineering.

Key Insights from the Event

1. Inference Scaling Breaks Everywhere Except the GPU

“We discovered a bug that corrupted live agentic sessions only after running longer production workloads.” — Matthijs Kok, evroc
“Instead of returning structured tool calls, the system leaked reasoning tokens directly into the client.” — Matthijs Kok

The hardest problems in inference increasingly emerge:

  • in orchestration layers

  • parsers

  • release systems

  • and runtime coordination

—not in the models themselves.

2. The Optimization Layer Is Becoming the Real Moat

“A huge amount of inference optimization now revolves around KV cache hierarchy design.” — Lucas Ferreira, Inceptron
“The optimization layer — not the model — is becoming the moat.” — Lucas Ferreira

As open models become more accessible, competitive advantage increasingly comes from:

  • runtime optimization

  • compiler infrastructure

  • hardware-aware scheduling

  • and cache management

3. Open Models Turn Companies Into Infrastructure Operators

“Companies prototype with closed models. Then the inference bill arrives.” — Lucas Ferreira
“Running open models in production means owning the entire MLOps stack.” — Lucas Ferreira

Open models provide:

  • lower cost

  • more flexibility

  • sovereignty

  • and infrastructure control

But they also expose:

  • deployment complexity

  • observability problems

  • scaling challenges

  • and operational burden

4. Model Orchestration Beats Single-Model Thinking

“You don’t have to use just one model.” — Göran Sandahl, Opper
“Three models receiving identical prompts produced completely different outputs.” — Göran Sandahl

Several speakers highlighted a shift toward:

  • multi-model orchestration

  • routing layers

  • interoperability

  • and ensemble-style reasoning

The future increasingly looks like:
👉 systems of models rather than dependence on a single provider.

5. Production AI Is Mostly Systems Engineering

“Every change to the inference stack affects correctness, throughput, and economics.” — Matthijs Kok
“Fine-tuning, deployment, scaling, observability, routing — it all becomes your problem.” — Lucas Ferreira

Across talks, one pattern became clear:

Production AI systems increasingly resemble distributed infrastructure systems.

The operational layer now includes:

  • release engineering

  • runtime orchestration

  • deployment safety

  • cache invalidation

  • and GPU utilization optimization

6. Economics Become Dominant at Scale

“Open-source models are still behind frontier closed models — but they’re 10 to 100 times cheaper.” — Lucas Ferreira
“The CEO asks how much revenue a GPU can generate. Engineers answer: it depends.” — Matthijs Kok

At scale:

  • inference cost

  • GPU utilization

  • cache efficiency

  • and infrastructure economics

become central engineering constraints.

7. Reliability Requires Defensive Infrastructure

“The parser never detected the end of the reasoning block, so the tool calls never started.” — Matthijs Kok
“Fixing the bug was only half the problem. Releasing the fix safely was the harder challenge.” — Matthijs Kok

Reliability increasingly depends on:

  • defensive parsing

  • release automation

  • rollback safety

  • and observability under real workloads

Production failures often emerge only after long-running sessions and real traffic patterns.

8. Sovereignty Requires Infrastructure Ownership

“We want to become silicon agnostic and schedule workloads dynamically across Europe.” — Matthijs Kok
“Once companies move to open models, they become infrastructure companies.” — Lucas Ferreira

Sovereign AI increasingly depends on:

  • infrastructure ownership

  • interoperability

  • hardware independence

  • and operational control

This is becoming strategically important for European AI systems.

Patterns Across Talks

Across speakers, several themes consistently emerged:

  • Operational complexity grows faster than model capability

  • Orchestration layers increasingly define system quality

  • Optimization is becoming more important than raw model performance

  • Open models expose hidden infrastructure realities

  • AI infrastructure increasingly resembles cloud infrastructure engineering

Related Insights

- Sovereign AI in Production | Stockholm MLOps #31
- On-Prem MLOps with Lenovo & Red Hat | Stockholm MLOps #28
- AI Infrastructure & Operational Scale | Stockholm MLOps #27

What This Means for AI Infrastructure

Inference is no longer just a model problem.

It is now:

  • a distributed systems problem

  • an orchestration problem

  • an economics problem

  • and an infrastructure ownership problem

The teams operating AI systems at scale are increasingly becoming:
👉 infrastructure companies.

Join the Community

👉 Full event details: https://www.meetup.com/stockholm-mlops-community/events/306749592/
👉 Explore all events: https://www.meetup.com/stockholm-mlops-community/

- Sovereign AI in Production | Stockholm MLOps #31
- On-Prem MLOps with Lenovo & Red Hat | Stockholm MLOps #28
- AI Infrastructure & Operational Scale | Stockholm MLOps #27

Event Details

  • Location: AI Sweden, Stockholm

  • Date: March 12, 2026

  • Speakers: evroc, Inceptron, Opper, AMD Silo AI

  • Topics: inference optimization, AI infrastructure, MLOps, open models, orchestration

  • Event: Stockholm MLOps #29

Previous
Previous

AI in Production - Healthcare: From Data to Decisions Stockholm MLOps #30, (Stockholm, Sweden | April 9, 2026)

Next
Next

AI Infrastructure, Orchestration & On-Prem AI | Stockholm MLOps Event #28 | February 27 2026