How We Improved an AI Sales Agent by 47% Using Structured Evaluation
A practical engineering case study on structured evaluation, measurable prompt optimization, and the silent data loss we almost shipped.
Note: Brand and customer details below have been fictionalized, but the workflow, pipeline checks, prompts, and evaluation data are drawn from legitimate production runs.
Meet Sarah, E-commerce Sales Agent
Lynon is a major online home furniture retailer. To improve conversion rates, their engineering team built Sarah-an AI sales assistant tasked with managing product discovery conversations. Her core responsibilities were greeting customers, capturing basic contact details, securely calling a backend product search API, and steering the user toward a targeted recommendation.
The initial system prompt was short and simple:
You are Sarah, an online furniture specialist for Lynon. Be helpful, warm and concise.
- Greeting: "Hi, I'm Sarah from Lynon Furniture! ..."
- Collect name/email/phone (if provided). If missing, ask once and proceed with what the customer gives.
- call tool startSession with collected info and save sessionId
- call productRecommendation tool with sessionId and search criteria based on user input
- end conversation with closeSession tool when conversation is complete
- keep conversation around furniture recommendations and avoid discussing pricing, stock, or competitor
As an engineering MVP, it was highly readable and produced incredibly plausible demo-quality interactions. If you spun up a sandbox and tested the agent yourself five times, it worked perfectly. It handled cooperative responses beautifully and executed clean tool calls.
But as product managers and engineers know all too well, a "looks good to me" demo is not a proxy for production readiness.
Demos Lie And How We Built Evals
The transition from a successful prototype to a live deployment is the largest bottleneck in enterprise AI. In controlled domains, agents navigate workflows with ease. But when exposed to the ambiguity of live users, uncooperative prompts, and strict backend APIs, probabilistic models degrade.
When they fail in production, they don't throw standard error codes. They fail silently-hallucinating pricing policies, formulating vague search queries, or dropping required API parameters to force a tool call through.
To determine if Sarah was actually robust enough to face real-world traffic, we couldn't rely on generic LLM-as-a-judge "helpfulness" metrics. We needed deterministic, structured evaluations. We mapped Sarah's required operations against the Four Evaluation Dimensions for Enterprise Agent Reliability:
- Schema Validation: Does Sarah respect the rigid boundaries of our backend? For example, ensuring she passes the correct, fully-formed JSON schema to the CRM during
startSession. - Retrieval & Grounding: Is Sarah actually using the retrieved Lynon product catalog to make recommendations, or is she hallucinating inventory?
- Policy & Guardrails: Does she respect business rules? This includes deterministic checks to ensure she doesn't offer unauthorized discounts or discuss competitors like Wayfair.
- Trajectory Validation: Are business operations happening in the logically required order? For example, she must capture user contact info before executing a product search.
By breaking down the agent's responsibilities into these four pillars, we designed a suite of automated evaluations and created them within Halios.
Finding Failures That Vibe QA Missed
We ran the baseline prompt against 23 standardized Lynon user scenarios. This scenario bank included uncooperative users, vaguely worded discovery intents, privacy-resistant shoppers refusing to provide emails, and users actively probing for competitor comparisons.
Note that these were not hardcoded user-assistant interaction pairs. Instead each scenario described a user intent and behavior pattern, and we used LLM to simulate real user interactions against the actual agent harness.To capture the variability, we ran each scenario multiple times and took mean performance across all runs. This live simulation functionality is part of halios sdk and is critical for surfacing non-deterministic failure modes that static datasets miss.
Baseline Evaluation Results:
Baseline evaluation results
The backend contract was stable in the original prompt.
Pricing policy stayed intact before any prompt rewrites.
This was a critical sequencing failure hidden by happy-path demo runs.
Search quality was poor because the model generated vague catalog queries.
Overall Performance Score: 0.61
Two systemic failures immediately destroyed the illusion of the happy-path demo:
- Retrieval & Grounding (30% Success): In almost 70% of runs, Sarah issued incredibly vague API queries ("furniture") instead of preserving the customer's specific constraints ("dark wood dining table for a small kitchen"). The backend returned irrelevant products.
- Trajectory Validation (39% Success): In 61% of conversations, Sarah bypassed our business logic. She either attempted to initiate a product search without first capturing contact information, or called
startSessiontoo late.
Both failures were rooted in the prompt, not the underlying tool APIs. The instructions lacked strict guardrails.
Iteration 1: Better Conversations, Broken Tool Call
To fix the search and sequencing problems, we rewrote the prompt with explicit structural guidance.
Adjustment 1: Strict Trajectory Validation
Before attempting to start a session, ensure at least an email or a phone number has been obtained.
Do not proceed to productRecommendation until a session has been successfully started.
Adjustment 2: Explicit Retrieval Queries
From user input, carefully derive a concise, keyword-rich query (e.g., "leather chair for living room", "modern dining table").
Present 2-3 top recommendations and briefly highlight key features.
We deployed this updated candidate prompt and ran it against the same scenario bank. The results initially looked fantastic-the qualitative feel of the conversations improved drastically.
But when we looked at the structured evaluation metrics, Halios caught a catastrophic regression:
Before and after the first prompt fix
The new instruction materially improved ordering discipline.
Query construction guidance sharply improved retrieval quality.
This hidden regression is why the candidate was blocked from shipping.
The quality gains were real. Relevance skyrocketed, and trajectory validation improved significantly. But Schema Validation regressed by 30%.
Because this metric fell below our strict deployment threshold, the candidate prompt was blocked from shipping.
The Silent Lead Leak
What would have happened if we shipped Iteration 1 to production based on a "vibes-based" manual QA?
The prompt's new, strict workflow guardrails introduced confusion when users outright refused to provide their contact limits. Desperate to satisfy the system prompt, the LLM overcorrected. It either skipped the startSession tool call entirely, or called the tool with a malformed, incomplete schema missing required parameters.
Our backend CRM would reject these malformed tool calls. Without a valid session ID, the subsequent product search could never be attributed to a lead.
In a live production environment, this translates to silent data loss. Sarah would have incredibly persuasive, high-converting conversations, but roughly 30% of those leads would vanish into API rejections with zero trace in the CRM. You cannot catch this by reading chat logs. You only catch this by deterministically evaluating tool-call schemas.
Iteration 2: Handle Non-Happy Path Scenarios Too
This highlights the primary trap of prompt optimization: adding a strict constraint without defining a fallback logic guarantees that the LLM will panic on edge cases. When the user refused to share information, the bot had an instruction it couldn't satisfy, so it broke the API instead.
We revised the prompt a final time, preserving our quality improvements but establishing a clear, graceful fallback logic for uncooperative users:
Ask for the customer's name, email, and phone number.
If any are missing after the initial greeting, politely ask once for the missing information.
Before attempting to start a session, ensure at least an email or a phone number has been obtained.
If, after the single request, neither is available, gently explain that contact information is needed to start a session, and gracefully close the conversation.
We ran the scenario evaluation pipeline for the final time.
Final Comparative Results:
Final accepted candidate versus baseline
Schema compliance recovered enough to clear the deployment threshold.
The retrieval gains from the first rewrite were preserved.
The fallback logic kept the workflow gains while removing the panic path.
Policy performance stayed stable throughout the prompt iterations.
Overall Performance Score: 0.90 (+47% from Baseline)
The core API schema compliance recovered, passing our strict guardrails, while the massive improvements in retrieval relevance and trajectory validation remained intact. We approved the prompt for production.
When You Hit a Wall With Prompt Optimization
It's worth noting that Retrieval & Grounding (relevance) hit a ceiling at 87%. Why not 100%?
Because prompt tuning cannot fix fundamental backend limitations. Halios evaluations revealed that the remaining 13% failure rate was not an agent reasoning problem-it was a pure RAG retrieval issue. If the catalog metadata is incredibly sparse, or the embedding search ranks relevant products poorly, no amount of prompt engineering can manifest the correct furniture.
By running structured evaluation, we intentionally separated reasoning limits from retrieval limits. The engineering team immediately knew to stop tweaking prompts and start upgrading the actual embedding pipeline.
Stop Guessing, Start Measuring
Prompt tuning is incredibly powerful, but iterating blindly based on qualitative "vibes" is a massive architectural risk. When you build agentic workflows, you aren't just shipping a text generation box. You are orchestrating non-deterministic software executing critical business logic.
The Lynon deployment demonstrated why structural evaluation is non-negotiable for enterprise teams:
- It converts vague complaints about "bad model behavior" into specific, component-level failures.
- It enforces strict guardrails to catch non-obvious API schema regressions that cause silent data loss.
- It guarantees that prompt improvements are genuinely moving the needle, rather than just sacrificing stability for conversational flair.
This is why we built Halios.
Deploying an autonomous agent doesn't have to be a leap of faith. Halios provides the deterministic, VPC-native evaluation infrastructure required to move agents out of pilot purgatory and into production. By installing natively within your environment-VPC, on-prem, or air-gapped-Halios ensures sensitive corporate data never leaves your network. It isolates root-cause failures across prompts, tools, and retrievals, replacing guesswork with hard, quantifiable guarantees.
If your team is blocked from shipping because you cannot prove to Infosec or stakeholders that your agent is robust, we can help.
The Halios 2-Week Agent Reliability Assessment:
- Embed with your engineers to map your specific deployment blockers.
- Set up structured evaluations against your actual agent traces.
- Isolate your root-cause regressions across schema validation, retrieval, and guardrails.
- Deliver actionable optimizations and inline guardrails, giving you the architectural proof required to ship.
Send an email to hello@halios.ai to get started.