The Enterprise Search Stack That Keeps Lying to Your LLM
A compliance analyst submits a complex query about a regulatory exposure buried across three sections of a 200-page filing. The RAG system retrieves four documents, generates a confident, well-structured answer, and moves on. The answer is accurate about what it retrieved. It is silent about everything it did not retrieve. Nobody notices. The decision gets made.
The failure here is not hallucination in the dramatic sense. The model did not invent a fact. It synthesized retrieved content correctly. The breakdown happened before synthesis began, at the moment the search stack returned its fixed candidate set and handed the LLM a constrained window of evidence it was not permitted to question, expand, or navigate further. Every reasoning capability the model possesses becomes irrelevant the moment the retrieval is done and the evidence set is closed.
Researchers with apparent ties to the Copilot Studio team at Microsoft published a paper addressing this architectural flaw directly. Their system, AgenticRAG, does not improve the search stack. It changes who is in charge of the retrieval process. On FinanceBench, a benchmark of 150 questions against real financial filings, traditional RAG achieves 24.24% answer correctness. AgenticRAG achieves 92%, within 2 percentage points of a system given oracle access to the correct evidence. That is not a retrieval improvement. That is what happens when you let the model decide when retrieval is finished.
Why Your Search Stack Is Not the Problem You Think It Is
The framing most enterprises operate under goes roughly like this: better retrieval produces better answers. So the investment goes into embeddings, hybrid search, reranking, query expansion, hypothetical document embeddings. All of these are improvements at the margin. None of them change the underlying architecture.
[DIRECT QUOTE from the paper]: "Standard RAG pipelines place significant burden of grounding on the search stack, constraining the language model to a fixed candidate set chosen deep in the retrieval process." Every enhancement technique, HyDE, multi-query reformulation, adaptive retrieval, preserves this assumption. The LLM still operates over a candidate set it had no hand in selecting, with no ability to go back.
AgenticRAG is built on a different insight: the search stack only needs to achieve recall. It does not need to achieve precision. The model handles precision through iterative, autonomous navigation. This decouples the quality of your answers from the quality of your search infrastructure in a way that years of retrieval enhancement work could not.
The numbers show what this decoupling produces. On BRIGHT, a public benchmark of complex, reasoning-heavy retrieval tasks across eight domains, the best embedding model achieves 27.8% recall@1. BM25 achieves 11.4%. AgenticRAG with Claude Sonnet 4.5 achieves 49.6%, a 21.8 percentage point absolute gain over the best embedding baseline. That margin holds across domains: Economics (+33.0 pp over best baseline), Robotics (+33.7 pp), Psychology (+25.6 pp).
Four Tools and an Agentic Loop: What Actually Changes
AgenticRAG layers a four-tool harness on top of existing enterprise search infrastructure. No model fine-tuning. No custom embeddings. No graph construction. No corpus-specific preprocessing beyond indexing into whatever search backend is already running. The tools give the LLM four verbs it did not have before:
- SEARCH delegates to the enterprise search stack, accepting up to five query reformulations in a single call and returning up to ten deduplicated results per query. Each result receives a unique reference ID that makes it addressable in subsequent operations. The model decides what to search for, not the retrieval pipeline.
- OPEN retrieves the full content of a specific document in a fixed 1,800-line window, with a header indicating position and total document length. The model can navigate through long filings section by section, reading only what it needs rather than consuming an entire document dump.
- FIND performs targeted in-document search on a specific reference ID using keyword patterns, returning up to two matching passages per pattern. When the model knows what it is looking for inside a long document, it can go directly to that evidence rather than reading sequentially.
- SUMMARIZE manages the context window. When token usage hits 90% of the 128K budget, the harness issues a warning. At the threshold, it forces summarization, consolidating reasoning and preserving only the references the model designates as relevant.
The agentic loop runs for a maximum of 15 iterations. In practice, Claude Sonnet 4.5 uses an average of 4.48 tool calls per query on BRIGHT. The model terminates early when it judges the evidence sufficient. The budget is rarely stressed.
The most important finding from the ablation study is also the most counterintuitive: removing the semantic find capability slightly improves average recall@1, from 43.49% to 46.34%. The authors attribute this to lexical matching being sufficient for most in-document searches, with the semantic option adding latency without proportionate benefit. The architecture is more tool-agnostic than it appears. The capability that matters most is not which individual tool is used, but the shift from single-shot retrieval to iterative, LLM-driven search.
That shift produces a 5.9x improvement in recall@1 for Claude Sonnet 4.5, and 5.2x for GPT-5-mini, compared to a single-shot search baseline. The single-shot baseline scores 8.41% recall@1. AgenticRAG scores 49.59%.
The Trade-off That Is Worth Having and the One That Is Not
AgenticRAG costs more tokens. On BRIGHT, the average query consumes 52.3K tokens versus 20.4K for single-shot, a 2.6x overhead. On FinanceBench, the overhead reaches 7.8x (114.8K tokens versus 14.7K). These numbers include system prompt, tool calls, tool results, and thinking tokens.
The tradeoff calculation is straightforward on most tasks. On BRIGHT, you spend 2.6x more tokens to get 5.9x better recall. The ratio is favorable. On FinanceBench, you spend 7.8x more tokens to achieve 92% correctness versus 24.24% with traditional RAG. For questions about financial filings where a wrong answer carries real consequences, the ratio is also favorable, just more expensive.
Multi-query search partly offsets the cost. When the model issues multiple reformulations in a single SEARCH call, it reduces total tool calls by 29% (from 6.79 average to 4.79) and cuts OPEN calls by 44%. The efficiency mechanism is not a minor optimization. It is what keeps the system practical at production scale.
The harder limit is structural. AgenticRAG is optimized for coarse-to-fine navigation toward a small number of high-value evidence sources. On the Pony split of BRIGHT, which requires recovering many related documents distributed across a broad corpus, both models score near single-shot levels: 4.8% for GPT-5-mini and 7.1% for Claude Sonnet 4.5 versus 0.40% for single-shot. The architecture assumes that the right answer lives in a small number of retrievable locations. When the answer requires a broad sweep of loosely connected evidence, that assumption does not hold. Queries like "summarize all customer complaints related to product X across the last three years" are not what this system was built to answer.
Two Models, Two Strategies, One Gap Worth Understanding
Claude Sonnet 4.5 and GPT-5-mini both run the same four-tool harness. They reach similar answers through different strategies, and the strategic difference has operational implications.
Claude favors exploitation: fewer search calls (2.51 versus 3.39), deeper document reading via OPEN (1.54 calls versus 1.22), and three times more semantic FIND usage (0.42 versus 0.14). It commits to a document earlier and extracts from it more thoroughly.
GPT-5-mini favors exploration: more search calls, wider initial net, less deep reading per document. On most domains Claude outperforms this strategy. On Stack Overflow, where answers tend to be distributed across many short documents rather than concentrated in a few long ones, GPT-5-mini scores 40.62% versus Claude's 34.05%.
This is not a model quality comparison. It is a signal about query architecture. The right model depends on the document topology of your corpus. If your enterprise knowledge base is dense with long-form documents (regulatory filings, technical manuals, financial reports), the exploitation strategy produces better results. If it is a large corpus of short, distributed artifacts (support tickets, forum answers, brief policy documents), the exploration strategy has an edge.
What 92% Correctness Looks Like in a Financial Operations Context
FinanceBench is the benchmark where the stakes are clearest. The 150 questions require answers extracted from real financial filings: revenue figures, margin calculations, year-over-year comparisons, footnote disclosures. These are not general knowledge questions. They require precise location and extraction of evidence that is often buried in dense tables or multi-paragraph footnotes.
Traditional RAG achieves 24.24% answer correctness on this benchmark. A previous agentic system using keyword search tools (pdfgrep, rga, Linux command tools) achieves 32.71%. AgenticRAG with GPT-5-mini achieves 92%, within 2 percentage points of a system given direct access to the correct evidence before the question is asked.
The gap between 32.71% and 92% is the gap between agentic search that uses keyword tools and agentic search that uses structured document navigation. The keyword-based system can find a term inside a document. AgenticRAG can read around a found term, understand its context, cross-reference it against a figure in a different section, and confirm the answer is complete before generating it. The tool design is what creates that difference.
For a financial analyst team running quarterly document review against hundreds of filings, the difference between 32% and 92% correctness is not an abstract quality metric. It is the number of answers that require manual verification before they can be acted on.
What Deploying This Actually Requires
The paper's most practically significant claim is also its most unusual one for an AI systems paper: the system requires no model fine-tuning, no custom embedding model, no graph construction, and no corpus-specific preprocessing beyond indexing documents into an existing enterprise search backend.
That claim changes the deployment calculus. Most architectural improvements to RAG require rebuilding parts of the pipeline. AgenticRAG requires adding a harness on top of what is already running.
The practical deployment path has three decision points:
First, confirm that your existing search infrastructure can serve as the SEARCH tool backend. The system delegates to whatever enterprise search stack is already running. The requirement is that it achieves reasonable recall on your corpus, not that it achieves precision. Precision is the model's job.
Second, evaluate your query distribution against the system's architectural assumption. AgenticRAG is optimized for queries where the answer is concentrated in a small number of documents. If your highest-volume query type requires broad evidence aggregation across many sources, the gains will be smaller and the Pony failure pattern will appear.
Third, price the token overhead against the correctness gain before deploying at scale. The 2.6x overhead on BRIGHT-style queries is manageable. The 7.8x overhead on FinanceBench-style queries is worth it for high-stakes financial analysis but needs to be modeled explicitly for high-volume, lower-stakes retrieval workloads.
The architecture works on existing infrastructure. The question is which parts of your retrieval workload match the problem it was built to solve.
The enterprise RAG failure mode was never that the model was not smart enough. It was that the model was never given a second look.
Agents Applied covers applied AI research for executives and senior technologists. Each issue translates one paper into operational implications.