LLM FinOps: what an AI system really costs in production

Between semantic caching, reranking, embedding tokens and LLM calls, the real cost of a production RAG always surprises. A framework for anticipating it.

The initial budget of a RAG project is almost always wrong. Not because teams are incompetent — because the cost models available during the POC phase don't reflect the reality of a production system with real users, real volumes, and usage patterns that weren't anticipated.

The surprise usually takes the same form: LLM inference costs are roughly in line with estimates. Peripheral costs — embedding, reranking, vector infrastructure, orchestration, monitoring — account for 40% to 80% of the total depending on the architecture. And nobody had budgeted for peripheral costs.

This article is available on request.

Full content is accessible after reaching out. I regularly share analyses, field notes, and case studies with people who ask.

Request access

FinOpsLLMRAG