ServiceNow's EVA Framework Exposes a Hidden Tradeoff: Voice AI Systems That Are Accurate Are Often Unpleasant to Talk To
ServiceNow AI has released EVA (Evaluating Voice Agents), the first end-to-end benchmark to jointly score voice agents on both task accuracy and conversational experience — and its initial results across 20 systems reveal a troubling pattern: the architectures that complete tasks reliably tend to deliver worse conversations, and vice versa.

D.O.T.S AI Newsroom
AI News Desk
ServiceNow AI has released EVA (Evaluating Voice Agents), an open evaluation framework for conversational voice agents that measures both task completion accuracy and conversational experience quality in complete, multi-turn spoken interactions. The framework fills a gap that has been widening as voice AI deployments proliferate: existing benchmarks evaluate speech-to-text accuracy, or response quality, or task completion — but never all three together in a realistic end-to-end scenario.
The Framework Architecture
EVA is built around a bot-to-bot testing architecture: an AI user simulator generates spoken interactions with the voice agent under test, using text-to-speech to produce realistic audio. The agent responds through its full pipeline (speech recognition, language model, text-to-speech). A tool executor handles backend database queries that the agent may call during the conversation. Validators check that conversations are complete and valid before two scoring modules apply the metrics.
The metrics split into two dimensions. EVA-A (Accuracy) measures task completion, response faithfulness to instructions, and speech fidelity — whether the voice system accurately reproduced critical entities like flight numbers, codes, and monetary amounts in audio. EVA-X (Experience) scores conversational quality: conciseness appropriate for spoken delivery, conversation progression without stalling, and turn-taking timing. The airline domain was chosen for the initial dataset — 50 scenarios covering rebooking, cancellations, standby, and compensation — because it involves complex multi-step workflows with strict policy constraints and named entity handling.
The Key Finding: No System Dominates Both Dimensions
EVA evaluated 20 voice agent systems spanning both proprietary and open-source implementations, and both cascade architectures (speech-to-text → language model → text-to-speech) and audio-native models. The scatter plot of EVA-A versus EVA-X scores tells a clear story: systems that score highly on task accuracy tend to score lower on conversational experience, and systems with smooth, natural conversational flow tend to struggle with accuracy.
The ServiceNow team identified several root causes. Named entity transcription errors — a single misheard digit in a confirmation code — cascade into authentication failures that task-completion metrics penalize heavily, even when the conversational handling was otherwise excellent. Multi-step workflows involving preserved ancillary services broke agents consistently regardless of architecture type. And systems optimized for accuracy tended to be more verbose and formulaic, exactly the qualities that degrade spoken experience.
Why This Matters for Enterprise Voice AI
The accuracy-experience tradeoff has direct business implications. A voice agent that completes 85% of banking inquiries correctly but sounds robotic and formulaic will generate customer complaints and escalations that offset its automation value. A voice agent with excellent conversational flow that mishandles 20% of transactions creates regulatory and reputational exposure. EVA makes this tradeoff measurable for the first time, enabling organizations deploying voice AI to understand the specific failure modes of their systems and optimize for the right dimension given their use case. The framework, dataset, and code are all available on GitHub and Hugging Face.