Live
OpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling SoraOpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling Sora
Tools

Google Introduces Flex and Priority Inference Tiers to the Gemini API — A Cost-Reliability Trade-Off That Enterprise AI Has Needed

Google has added two new inference modes to the Gemini API: Flex inference (lower cost, best-effort latency) and Priority inference (guaranteed performance, higher cost). The tiered approach mirrors what cloud compute has offered for a decade and finally gives enterprise AI teams a principled way to optimize cost versus performance across workloads.

D.O.T.S AI Newsroom

D.O.T.S AI Newsroom

AI News Desk

2 min read
Google Introduces Flex and Priority Inference Tiers to the Gemini API — A Cost-Reliability Trade-Off That Enterprise AI Has Needed

Google has launched two new inference modes for the Gemini API — Flex inference and Priority inference — giving developers a structured way to trade off cost against guaranteed performance, according to the Google AI Blog.

What the Tiers Mean

Flex inference is a best-effort tier: Google will serve requests at reduced cost, but with variable latency and without guaranteed throughput. The model is the same; what changes is the resource allocation priority. For batch workloads, asynchronous processing, offline document analysis, or any use case where a few extra seconds of latency is acceptable, Flex provides meaningfully lower costs on what is already one of the highest-performing model APIs available.

Priority inference is the inverse: guaranteed performance, consistent low latency, and reserved capacity — at a higher price point. This is the tier for real-time customer-facing applications, latency-sensitive pipelines, and production systems where unpredictable response times create user experience problems.

Why This Matters

The tiered inference model is borrowed directly from cloud computing — AWS Spot instances versus On-Demand, Google Preemptible VMs versus Standard — and it solves the same problem: most production AI workloads are actually a mix of latency-sensitive and latency-tolerant tasks, but API products historically offered only one price point for all of them.

For enterprise AI teams building on Gemini, the practical implication is significant. Consider a typical enterprise pipeline: real-time chat responses need Priority inference to stay under 2-second response targets, but the same system's nightly batch summarization of support tickets is perfectly suited to Flex inference at meaningfully lower cost. Previously, both workloads paid the same rate. Now they don't have to.

Competitive Context

This move puts Google more directly in alignment with how sophisticated infrastructure customers already think about resource cost management. Anthropic does not currently offer tiered inference on the Claude API. OpenAI offers batch processing at a discount but no equivalent real-time priority tier. Google's two-tier structure is currently the most structured approach to this trade-off among the major frontier API providers.

Both tiers are available now in the Gemini API, with documentation available via Google AI Studio and the Gemini API reference.

Back to Home

Related Stories

Astropad's Workbench Turns a Mac Mini Into an AI Agent Server You Control From Your Phone
Tools

Astropad's Workbench Turns a Mac Mini Into an AI Agent Server You Control From Your Phone

Astropad, the company behind the Luna Display hardware that lets iPads function as Mac monitors, has built a new product for a new era: Workbench lets users remotely monitor and control AI agents running on Mac Minis from an iPhone or iPad. It is remote desktop software reimagined not for IT support but for the AI agent operator — the person who needs to check on autonomous workflows without being at their desk.

D.O.T.S AI Newsroom
Microsoft's Bing Team Open-Sources Harrier, a Multilingual Embedding Model That Tops the MTEB v2 Benchmark
Tools

Microsoft's Bing Team Open-Sources Harrier, a Multilingual Embedding Model That Tops the MTEB v2 Benchmark

Microsoft's Bing search team has released Harrier as an open-source embedding model, and it tops the multilingual MTEB v2 benchmark while supporting over 100 languages. The release is significant not just for the benchmark numbers but for the source: a search team that has spent decades optimizing retrieval systems has built an embedding model for the exact use case — semantic search and retrieval — that underpins most production RAG applications.

D.O.T.S AI Newsroom
Stability AI Pivots to Enterprise With Brand Studio — a Platform for Brand-Consistent AI Image Generation
Tools

Stability AI Pivots to Enterprise With Brand Studio — a Platform for Brand-Consistent AI Image Generation

Stability AI, the company that made open-source image generation mainstream with Stable Diffusion, is repositioning for enterprise with Brand Studio. The platform lets creative teams train brand-specific image models, automate visual production workflows, and route tasks to the best-suited AI model — a commercial play from a company that built its name on open access.

D.O.T.S AI Newsroom