Google Introduces Flex and Priority Inference Tiers to the Gemini API — A Cost-Reliability Trade-Off That Enterprise AI Has Needed
Google has added two new inference modes to the Gemini API: Flex inference (lower cost, best-effort latency) and Priority inference (guaranteed performance, higher cost). The tiered approach mirrors what cloud compute has offered for a decade and finally gives enterprise AI teams a principled way to optimize cost versus performance across workloads.

D.O.T.S AI Newsroom
AI News Desk
Google has launched two new inference modes for the Gemini API — Flex inference and Priority inference — giving developers a structured way to trade off cost against guaranteed performance, according to the Google AI Blog.
What the Tiers Mean
Flex inference is a best-effort tier: Google will serve requests at reduced cost, but with variable latency and without guaranteed throughput. The model is the same; what changes is the resource allocation priority. For batch workloads, asynchronous processing, offline document analysis, or any use case where a few extra seconds of latency is acceptable, Flex provides meaningfully lower costs on what is already one of the highest-performing model APIs available.
Priority inference is the inverse: guaranteed performance, consistent low latency, and reserved capacity — at a higher price point. This is the tier for real-time customer-facing applications, latency-sensitive pipelines, and production systems where unpredictable response times create user experience problems.
Why This Matters
The tiered inference model is borrowed directly from cloud computing — AWS Spot instances versus On-Demand, Google Preemptible VMs versus Standard — and it solves the same problem: most production AI workloads are actually a mix of latency-sensitive and latency-tolerant tasks, but API products historically offered only one price point for all of them.
For enterprise AI teams building on Gemini, the practical implication is significant. Consider a typical enterprise pipeline: real-time chat responses need Priority inference to stay under 2-second response targets, but the same system's nightly batch summarization of support tickets is perfectly suited to Flex inference at meaningfully lower cost. Previously, both workloads paid the same rate. Now they don't have to.
Competitive Context
This move puts Google more directly in alignment with how sophisticated infrastructure customers already think about resource cost management. Anthropic does not currently offer tiered inference on the Claude API. OpenAI offers batch processing at a discount but no equivalent real-time priority tier. Google's two-tier structure is currently the most structured approach to this trade-off among the major frontier API providers.
Both tiers are available now in the Gemini API, with documentation available via Google AI Studio and the Gemini API reference.