ConvApparel: Finally, a Realistic Way to Test AI Chatbots Without Actual Humans

If you’ve ever trained a conversational AI agent, you know the pain. The “gold standard” is live human testing, which is slow, expensive, and doesn’t scale worth a damn. So the industry turned to user simulators—LLMs told to play the role of a human user. The idea is sound, but the execution has been wobbly.

The problem is that these simulators are too good at being helpful. They’re patient, they don’t forget context, and they have encyclopedic knowledge of whatever domain you throw at them. In other words, they’re terrible approximations of real users, who get frustrated, change their minds, and make contradictory demands. Google Research calls this the “realism gap,” and they’ve just released a paper and a dataset called ConvApparel that aims to measure and bridge it.

What’s the big deal?

Think about a flight simulator. The best ones don’t just simulate smooth takeoffs and landings. They throw in turbulence, crosswinds, engine failures, and even birds. You want your pilot to be tested against the full spectrum of chaos. The same logic applies to conversational AI. If you only train your agent against a simulator that behaves like a polite, well-informed user, it will fall apart when faced with an actual human who’s had a bad day and just wants a straight answer.

ConvApparel isn’t just another dataset. It’s a framework for quantifying how unrealistic your simulator is, and more importantly, for testing whether it can handle novel, frustrating situations it hasn’t seen before. This is the key innovation: counterfactual validation.

The counterfactual twist

The standard approach to building a user simulator is to train it on logs of real human-agent conversations. That’s fine as far as it goes, but it has a fatal flaw. If you’re training a simulator to help you improve your agent, you’re going to be testing new, unproven agent policies. Those new agents will behave differently from the one used to collect the training data. A simulator that just regurgitates its training data is useless for this task.

Counterfactual validation asks a simple question: how would the simulated user react if it encountered a deliberately frustrating or incompetent agent? If the simulator was trained only on interactions with a helpful agent, does it know what to do when the agent starts being obtuse? Does it get annoyed? Does it repeat itself? Does it give up and leave? Or does it just stay unnaturally polite and keep trying?

To build the ConvApparel dataset, Google used a clever dual-agent protocol. Real human participants were randomly routed to either a “Good” agent (helpful, proactive) or a “Bad” agent (unhelpful, designed to be annoying). This captured the full spectrum of human behavior, from satisfaction to profound annoyance. The resulting data was then validated against three pillars: population-level statistics, human-likeness scoring, and that counterfactual validation I just described.

Why this matters more than you think

I’ve seen too many demos of conversational agents that look great in a controlled environment but fall apart in the wild. The reason is almost always the same: the simulators used for training were too polite, too knowledgeable, and too patient. ConvApparel gives us a way to catch that early.

The paper focuses on Conversational Recommender Systems (CRSs), which are essentially AI-powered decision-support tools. That’s a domain where realistic simulation is critical. If your CRS is trained only on simulators that always accept recommendations gracefully, it won’t know how to handle a user who pushes back, changes their criteria mid-conversation, or asks for something completely different.

Is it perfect? No.

Let’s be honest. No simulator will ever perfectly capture the chaotic, context-dependent nature of human conversation. But ConvApparel is a significant step forward because it doesn’t just measure surface-level mimicry. It forces simulators to prove they can generalize to out-of-distribution scenarios. That’s the difference between a simulator that’s a glorified parrot and one that’s actually useful for training robust agents.

If you’re building conversational AI, you should take a hard look at this. The dataset and evaluation framework are out there. Use them. Your users will thank you.

ConvApparel: Finally, a Realistic Way to Test AI Chatbots Without Actual Humans

What’s the big deal?

The counterfactual twist

Why this matters more than you think

Is it perfect? No.

Comments (0)