Live
OpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling SoraOpenAI announces GPT-5 with unprecedented reasoning capabilitiesGoogle DeepMind achieves breakthrough in protein folding for rare diseasesEU passes landmark AI Safety Act with global implicationsAnthropic raises $7B as enterprise demand for Claude surgesMeta open-sources Llama 4 with 1T parameter modelNVIDIA unveils next-gen Blackwell Ultra chips for AI data centersApple integrates on-device AI across entire product lineupSam Altman testifies before Congress on AI regulation frameworkMistral AI reaches $10B valuation after Series C funding roundStability AI launches video generation model rivaling Sora
Tools

ServiceNow's EVA Framework Exposes a Hidden Tradeoff: Voice AI Systems That Are Accurate Are Often Unpleasant to Talk To

ServiceNow AI has released EVA (Evaluating Voice Agents), the first end-to-end benchmark to jointly score voice agents on both task accuracy and conversational experience — and its initial results across 20 systems reveal a troubling pattern: the architectures that complete tasks reliably tend to deliver worse conversations, and vice versa.

D.O.T.S AI Newsroom

D.O.T.S AI Newsroom

AI News Desk

3 min read
ServiceNow's EVA Framework Exposes a Hidden Tradeoff: Voice AI Systems That Are Accurate Are Often Unpleasant to Talk To

ServiceNow AI has released EVA (Evaluating Voice Agents), an open evaluation framework for conversational voice agents that measures both task completion accuracy and conversational experience quality in complete, multi-turn spoken interactions. The framework fills a gap that has been widening as voice AI deployments proliferate: existing benchmarks evaluate speech-to-text accuracy, or response quality, or task completion — but never all three together in a realistic end-to-end scenario.

The Framework Architecture

EVA is built around a bot-to-bot testing architecture: an AI user simulator generates spoken interactions with the voice agent under test, using text-to-speech to produce realistic audio. The agent responds through its full pipeline (speech recognition, language model, text-to-speech). A tool executor handles backend database queries that the agent may call during the conversation. Validators check that conversations are complete and valid before two scoring modules apply the metrics.

The metrics split into two dimensions. EVA-A (Accuracy) measures task completion, response faithfulness to instructions, and speech fidelity — whether the voice system accurately reproduced critical entities like flight numbers, codes, and monetary amounts in audio. EVA-X (Experience) scores conversational quality: conciseness appropriate for spoken delivery, conversation progression without stalling, and turn-taking timing. The airline domain was chosen for the initial dataset — 50 scenarios covering rebooking, cancellations, standby, and compensation — because it involves complex multi-step workflows with strict policy constraints and named entity handling.

The Key Finding: No System Dominates Both Dimensions

EVA evaluated 20 voice agent systems spanning both proprietary and open-source implementations, and both cascade architectures (speech-to-text → language model → text-to-speech) and audio-native models. The scatter plot of EVA-A versus EVA-X scores tells a clear story: systems that score highly on task accuracy tend to score lower on conversational experience, and systems with smooth, natural conversational flow tend to struggle with accuracy.

The ServiceNow team identified several root causes. Named entity transcription errors — a single misheard digit in a confirmation code — cascade into authentication failures that task-completion metrics penalize heavily, even when the conversational handling was otherwise excellent. Multi-step workflows involving preserved ancillary services broke agents consistently regardless of architecture type. And systems optimized for accuracy tended to be more verbose and formulaic, exactly the qualities that degrade spoken experience.

Why This Matters for Enterprise Voice AI

The accuracy-experience tradeoff has direct business implications. A voice agent that completes 85% of banking inquiries correctly but sounds robotic and formulaic will generate customer complaints and escalations that offset its automation value. A voice agent with excellent conversational flow that mishandles 20% of transactions creates regulatory and reputational exposure. EVA makes this tradeoff measurable for the first time, enabling organizations deploying voice AI to understand the specific failure modes of their systems and optimize for the right dimension given their use case. The framework, dataset, and code are all available on GitHub and Hugging Face.

Back to Home

Related Stories

Astropad's Workbench Turns a Mac Mini Into an AI Agent Server You Control From Your Phone
Tools

Astropad's Workbench Turns a Mac Mini Into an AI Agent Server You Control From Your Phone

Astropad, the company behind the Luna Display hardware that lets iPads function as Mac monitors, has built a new product for a new era: Workbench lets users remotely monitor and control AI agents running on Mac Minis from an iPhone or iPad. It is remote desktop software reimagined not for IT support but for the AI agent operator — the person who needs to check on autonomous workflows without being at their desk.

D.O.T.S AI Newsroom
Microsoft's Bing Team Open-Sources Harrier, a Multilingual Embedding Model That Tops the MTEB v2 Benchmark
Tools

Microsoft's Bing Team Open-Sources Harrier, a Multilingual Embedding Model That Tops the MTEB v2 Benchmark

Microsoft's Bing search team has released Harrier as an open-source embedding model, and it tops the multilingual MTEB v2 benchmark while supporting over 100 languages. The release is significant not just for the benchmark numbers but for the source: a search team that has spent decades optimizing retrieval systems has built an embedding model for the exact use case — semantic search and retrieval — that underpins most production RAG applications.

D.O.T.S AI Newsroom
Stability AI Pivots to Enterprise With Brand Studio — a Platform for Brand-Consistent AI Image Generation
Tools

Stability AI Pivots to Enterprise With Brand Studio — a Platform for Brand-Consistent AI Image Generation

Stability AI, the company that made open-source image generation mainstream with Stable Diffusion, is repositioning for enterprise with Brand Studio. The platform lets creative teams train brand-specific image models, automate visual production workflows, and route tasks to the best-suited AI model — a commercial play from a company that built its name on open access.

D.O.T.S AI Newsroom