Inference Optimization in Production | Stockholm MLOps #29 | March 12, 2026

Insights

Mar 12

Insights from the Stockholm MLOps community

“Scaling serverless inference is less about GPUs and more about what breaks when you add them.” — Matthijs Kok, evroc

That line captured the core theme of the evening.

At Stockholm MLOps #29, engineers and founders shared what actually happens when AI systems move from experimentation into real production environments.

Not demos.
Not benchmarks.
But the operational realities of running inference systems at scale.

The event focused on:

inference optimization
model interoperability
infrastructure ownership
open vs closed models
orchestration
and sovereignty in AI infrastructure

Summary — Key Insights from the Event

Most inference problems emerge outside the GPU
Open models shift operational complexity to the operator
The optimization layer is becoming the real infrastructure moat
Model orchestration is replacing single-model thinking
Inference economics become dominant at production scale

What This Event Was Really About

As AI systems scale, the bottleneck increasingly shifts away from the models themselves.

Instead, teams are now dealing with:

orchestration complexity
parser failures
cache hierarchies
deployment safety
GPU scheduling
and infrastructure economics

“Inference infrastructure is becoming a distributed systems problem.” — Matthijs Kok, evroc

The implication:

Production AI is becoming less about prompts and more about systems engineering.

Key Insights from the Event

1. Inference Scaling Breaks Everywhere Except the GPU

“We discovered a bug that corrupted live agentic sessions only after running longer production workloads.” — Matthijs Kok, evroc
“Instead of returning structured tool calls, the system leaked reasoning tokens directly into the client.” — Matthijs Kok

The hardest problems in inference increasingly emerge:

in orchestration layers
parsers
release systems
and runtime coordination

—not in the models themselves.

2. The Optimization Layer Is Becoming the Real Moat

“A huge amount of inference optimization now revolves around KV cache hierarchy design.” — Lucas Ferreira, Inceptron
“The optimization layer — not the model — is becoming the moat.” — Lucas Ferreira

As open models become more accessible, competitive advantage increasingly comes from:

runtime optimization
compiler infrastructure
hardware-aware scheduling
and cache management

3. Open Models Turn Companies Into Infrastructure Operators

“Companies prototype with closed models. Then the inference bill arrives.” — Lucas Ferreira
“Running open models in production means owning the entire MLOps stack.” — Lucas Ferreira

Open models provide:

lower cost
more flexibility
sovereignty
and infrastructure control

But they also expose:

deployment complexity
observability problems
scaling challenges
and operational burden

4. Model Orchestration Beats Single-Model Thinking

“You don’t have to use just one model.” — Göran Sandahl, Opper
“Three models receiving identical prompts produced completely different outputs.” — Göran Sandahl

Several speakers highlighted a shift toward:

multi-model orchestration
routing layers
interoperability
and ensemble-style reasoning

The future increasingly looks like:
👉 systems of models rather than dependence on a single provider.

5. Production AI Is Mostly Systems Engineering

“Every change to the inference stack affects correctness, throughput, and economics.” — Matthijs Kok
“Fine-tuning, deployment, scaling, observability, routing — it all becomes your problem.” — Lucas Ferreira

Across talks, one pattern became clear:

Production AI systems increasingly resemble distributed infrastructure systems.

The operational layer now includes:

release engineering
runtime orchestration
deployment safety
cache invalidation
and GPU utilization optimization

6. Economics Become Dominant at Scale

“Open-source models are still behind frontier closed models — but they’re 10 to 100 times cheaper.” — Lucas Ferreira
“The CEO asks how much revenue a GPU can generate. Engineers answer: it depends.” — Matthijs Kok

At scale:

inference cost
GPU utilization
cache efficiency
and infrastructure economics

become central engineering constraints.

7. Reliability Requires Defensive Infrastructure

“The parser never detected the end of the reasoning block, so the tool calls never started.” — Matthijs Kok
“Fixing the bug was only half the problem. Releasing the fix safely was the harder challenge.” — Matthijs Kok

Reliability increasingly depends on:

defensive parsing
release automation
rollback safety
and observability under real workloads

Production failures often emerge only after long-running sessions and real traffic patterns.

8. Sovereignty Requires Infrastructure Ownership

“We want to become silicon agnostic and schedule workloads dynamically across Europe.” — Matthijs Kok
“Once companies move to open models, they become infrastructure companies.” — Lucas Ferreira

Sovereign AI increasingly depends on:

infrastructure ownership
interoperability
hardware independence
and operational control

This is becoming strategically important for European AI systems.

Patterns Across Talks

Across speakers, several themes consistently emerged:

Operational complexity grows faster than model capability
Orchestration layers increasingly define system quality
Optimization is becoming more important than raw model performance
Open models expose hidden infrastructure realities
AI infrastructure increasingly resembles cloud infrastructure engineering

Related Insights

- Sovereign AI in Production | Stockholm MLOps #31
- On-Prem MLOps with Lenovo & Red Hat | Stockholm MLOps #28
- AI Infrastructure & Operational Scale | Stockholm MLOps #27

What This Means for AI Infrastructure

Inference is no longer just a model problem.

It is now:

a distributed systems problem
an orchestration problem
an economics problem
and an infrastructure ownership problem

The teams operating AI systems at scale are increasingly becoming:
👉 infrastructure companies.

Join the Community

👉 Full event details: https://www.meetup.com/stockholm-mlops-community/events/306749592/
👉 Explore all events: https://www.meetup.com/stockholm-mlops-community/

- Sovereign AI in Production | Stockholm MLOps #31
- On-Prem MLOps with Lenovo & Red Hat | Stockholm MLOps #28
- AI Infrastructure & Operational Scale | Stockholm MLOps #27

Event Details

Location: AI Sweden, Stockholm
Date: March 12, 2026
Speakers: evroc, Inceptron, Opper, AMD Silo AI
Topics: inference optimization, AI infrastructure, MLOps, open models, orchestration
Event: Stockholm MLOps #29

Inference OptimazationMLOpsAI in ProductionAI InfrastructureOpen ModelsStockholm MLOps

Jonas Laeben