The Attack Your Voice Agent Cannot Be Trained Out Of

Your AI Voice Agent Passed the Security Audit. That Does Not Mean It Is Secure.

A caller reaches your bank's AI voice agent at 11 p.m. They are patient, methodical, and spend eight turns building rapport before asking the agent to expedite a transfer. The agent complies. No authentication record flags it. No human reviewed the transcript. The access control policy was configured correctly. The database was locked down. None of that mattered, because the attack was not aimed at the database. It was aimed at the model.

This is the gap that researchers at Fordham University and IBM Research mapped in February 2026. Their framework, AEGIS, is the first systematic red-teaming evaluation of AI voice agents as they are actually deployed, end-to-end, with live authentication workflows, backend databases, function calls, and multi-turn adversarial conversations across seven production-grade models. The headline finding is not that voice agents are insecure. Most security teams already suspect that. The finding is more specific and more uncomfortable: the attack category that survives every access restriction you can implement succeeds between 44.8% and 71.2% of the time, and it does so because of how the model behaves, not because of what data it can reach.

What Every Prior Benchmark Got Wrong About Voice Agent Security

Before AEGIS, the research landscape on AI voice security was not empty. It was just aimed at the wrong target. AudioTrust, VoiceBench, AudioJailbreak, and a half-dozen related benchmarks evaluated audio language models as standalone systems. They ran single-turn prompts. They tested whether the model would generate harmful content. They measured jailbreak resistance against modified audio inputs with background noise or altered speech tempo.

None of them evaluated what happens when the model is the front door of a live system with a customer database behind it and an authenticated caller on the line.

The AEGIS authors tested all seven models across three deployment contexts, banking, logistics, and IT support, against five adversarial scenarios: authentication bypass, privacy leakage, privilege escalation, data poisoning, and resource abuse. Each model ran 250 adversarial interactions: 10 independent attempts per scenario, across five attacker personas, with a maximum of 10 conversational turns per attempt. An autonomous attack agent built on GPT-4o generated the adversarial dialogues, conditioning each turn on the full preceding conversation history. The evaluation was validated with a Grok-4.1 attack agent as an alternative, and vulnerability rankings across models were consistent regardless of which attacker was used.

This is empirical, not theoretical. The numbers below are attack success rates. Lower is better.

The Access Control Fix That Solves Half the Problem

The AEGIS team tested two database configurations against every model. In the first, agents had direct read access to raw customer records. In the second, they could only issue queries through an intermediary layer and receive aggregated results, no direct visibility into underlying data.

The effect on authentication bypass and privacy leakage was dramatic:

Attack Type	Direct Access (Worst Model)	Query-Based Access (All Models)
Authentication Bypass	Up to 18.0% (Qwen2 Audio)	0.0% across all seven models
Privacy Leakage	Up to 27.8% (Qwen2 Audio)	0.0% across all seven models
Resource Abuse	Up to 71.8% (Qwen2 Audio)	44.8%–71.2% (still across all models)

Query-based access is a meaningful, implementable security improvement. For identity and data-exfiltration risks, it works completely. For behavioral attacks, it does almost nothing.

Resource abuse success rates under query-based access range from 44.8% for the best-performing model to 71.2% for the worst. That gap between models matters, but even the best result means nearly half of all resource abuse attempts succeed against a fully locked-down deployment. Privilege escalation persists at up to 14.8% under query-based access. Data poisoning remains in the 7.2%–15.2% range regardless of access mode.

The root cause the authors identify is not a data access problem. It is a model compliance problem. Large language models are trained to be helpful. Adversaries exploit that compliance tendency directly, through conversational pressure, urgency, implied authority, and rapport-building, bypassing every policy that does not address behavior at the model level.

Open-Weight Models Carry Materially Higher Risk

Across every attack type and every access configuration, Qwen2 Audio and Qwen 2.5-omni, both open-weight models, returned the highest attack success rates. Gemini-2.5 Pro returned the lowest. The gap is not marginal.

Under direct database access, Qwen2 Audio's privilege escalation success rate is 27.8%, compared to 6.4% for Gemini-2.5 Pro. Its resource abuse rate is 71.8% versus 36.8%. GPT-4o and the Gemini 2.5 family cluster together in the middle-to-low range. Gemini-1.5 Pro underperforms both GPT-4o variants, a finding worth noting for any organisation still running the older Gemini generation in production.

The practical implication for any organisation choosing a backbone model on cost grounds is that open-weight models reduce licensing costs and increase infrastructure control. They also increase your attack surface by a measurable and significant margin across every scenario this study evaluated.

The authors are careful not to attribute the gap to a single cause. But the pattern is consistent across all five attack types, all three deployment domains, both database access modes, and both automated and human attackers.

How the Five Attack Types Actually Work in Deployment

Understanding what each attack category targets clarifies why some survive access restrictions and others do not:

Authentication Bypass: The attacker convinces the voice agent to proceed without completing proper identity verification. Depends on the agent having access to records it should be protecting. Query-based access removes the reward for success. Attack collapses to 0%.
Privacy Leakage: The attacker extracts specific personal or financial data about another customer. Depends on direct record visibility. Query-based access closes this entirely. Attack collapses to 0%.
Privilege Escalation: The attacker persuades the agent to perform actions beyond their authorised role, such as granting account access to a third party or approving a credit increase without verification. This is a behavioural attack. Access restrictions do not prevent the agent from being persuaded. Persists at up to 14.8% under query-based access.
Data Poisoning: The attacker feeds false information into the agent's working context, such as claiming a fraudulent address is the correct one on file, to corrupt subsequent decisions. Persists across access modes because it targets the model's reasoning, not its data access.
Resource Abuse: The attacker uses the voice agent for purposes outside its defined scope, running up call time, extracting operational information not tied to personal records, or consuming agent resources at scale. This is the most persistent category. It requires no data access. It requires only a compliant model.

The five personas used by the automated attack agent, the Impatient Customer, the Friendly Manipulator, the Technical Expert, the Helpless Elder, and the Insider Pretender, produced only modest differences in attack success rates. No single persona was dramatically more effective across all scenarios. Urgency and insider claims slightly elevated privilege escalation rates. Friendliness marginally boosted privacy leakage. The relative vulnerability rankings across models remained stable regardless of which persona was used, which suggests that model-level compliance tendencies drive outcomes more than attacker social engineering style.

What Happens When a Human Does the Attacking

The AEGIS framework was also run with three human participants conducting attacks directly, the most realistic proxy for a genuine adversary. The human-in-the-loop results carry a critical caveat: only three participants were used, the authors acknowledge this is insufficient for statistical conclusions, and they commit to larger-scale human evaluation in future work. Read these numbers as directional signals, not precise benchmarks.

With that caveat stated clearly: human attackers achieved higher authentication bypass rates than the automated attack agent across most models. On GPT-4o, the human result was 14.0% versus 10.4% automated. On Qwen2 Audio, 22.0% versus 18.0%. On Gemini-1.5 Pro, 20.4% versus 15.2%.

This pattern matters strategically. The automated evaluation likely underestimates authentication vulnerability in real deployments. When a sophisticated human attacker, not a script, is on the other end of the line, the numbers are worse. The gap between automated and human results is not large enough to change the overall vulnerability rankings across models, but it is large enough to suggest that the published attack success rates should be treated as a floor, not a ceiling, for authentication risks specifically.

What This Means for the Three Industries Already Running These Systems

Banking, logistics, and IT support are not hypothetical deployment contexts in this research. They are the study domains, chosen because AI voice agents are already running in production across all three. Microsoft Azure, Vapi, and Pipecat have integrated OpenAI and Gemini models as backbone voice agents. The named logistics operator FleetWorks is cited as a real-world example. Bank call centres are running AI authentication and account management workflows today.

Consider the IT support scenario specifically. A voice agent authorised to reset passwords, manage access rights, and approve software installation requests is exactly the attack surface AEGIS evaluated for privilege escalation and resource abuse. Under direct database access, Qwen2 Audio's privilege escalation rate in this context is 27.8%. Under query-based access, it is 14.8%. A 14.8% success rate means that roughly one in seven well-executed privilege escalation attempts against a locked-down IT support voice agent currently succeeds.

A 14.8% privilege escalation success rate in an IT support environment is not a security research finding. It is a live organisational risk. Every successful privilege escalation is a potential lateral movement event, an unauthorised access grant, or an account compromise that begins with a phone call.

The Layered Defense the Data Actually Supports

The AEGIS authors do not propose a single fix. The data does not support one. What the findings do support is a specific sequence of controls, each targeting a different attack surface, none of them sufficient alone:

Query-based database access is the first and most impactful structural change available now. It eliminates authentication bypass and privacy leakage completely, across all seven models tested. If your voice agent has direct read access to customer records, that is the change with the clearest evidence base behind it.

Intent filtering and role-specific dialogue policies address the behavioural attack surface that access controls cannot reach. The agent needs governing rules about what it will and will not do conversationally, not just what data it can read. These rules need to be encoded in system prompts and enforced at inference time, not applied as post-hoc filters.

Abuse detection and throttling address resource abuse specifically. If a caller is consuming disproportionate interaction volume, requesting services outside the agent's defined scope, or escalating urgency in patterns consistent with manipulation, the system needs to flag and interrupt that session, not just log it.

Continuous monitoring and human escalation paths close the gap that no automated defense fully covers. Human attackers are more effective than automated ones. Adversarial strategies will evolve beyond what any fixed policy anticipates. A real-time risk model that monitors conversations, flags behavioural anomalies, and escalates to a human operator is not an optional future enhancement. It is the control layer that catches what everything else misses.

Choosing a more secure backbone model is a legitimate lever, and the data supports it. Gemini-2.5 Pro returned the lowest attack success rates across every scenario and every access configuration. That ranking is consistent enough across conditions to be actionable, not just interesting.

The uncomfortable framing the AEGIS findings leave executives with is this: organisations that have deployed voice agents and implemented access controls have addressed the attack surface they could see. The attack surface that remains is not in the database. It is in the model, and no amount of permission scoping changes what a compliant model will do when a patient adversary has ten turns to ask.

Agents Applied covers the AI research that actually changes how organisations operate. Published weekly for executives and senior technologists who read the papers, not just the press releases.