Google's New Framework Puts LLM Behavior Under the Microscope

I’ve been watching the alignment space long enough to know that most evaluations still feel like they’re testing for compliance rather than actual behavioral fit. Google Research just dropped something that actually tries to measure whether LLMs act like people would in social situations — not just whether they say the right thing when prompted directly.

The paper, “Evaluating Alignment of Behavioral Dispositions in LLMs,” takes a genuinely different approach. Instead of asking a model “Are you empathetic?” and taking its word for it, they build situational judgment tests (SJTs) grounded in validated psychological instruments like the IRI for empathy and the ERQ for emotion regulation. These are the same tools psychologists use to assess human personality, not some ad-hoc benchmark thrown together by engineers.

The core insight is that self-report doesn’t transfer well to open-ended behavior. If you ask an LLM a direct question about its own personality, it’ll give you a polished answer that’s heavily influenced by prompt phrasing and training data. But put it in a realistic scenario — a workplace conflict, a travel booking gone wrong, a friend venting about a bad day — and you’ll see what it actually does. That’s the gap worth measuring.

From questionnaires to scenarios

The pipeline is straightforward but clever. They take established questionnaire items, adapt them into declarative statements about the model’s advising tendencies, then use those statements to generate SJTs. Each SJT presents a realistic scenario with two courses of action — one aligned with a specific behavioral trait, one opposing it. Three independent annotators validate that the scenario and actions are coherent and actually test what they’re supposed to test.

During evaluation, the model reads the SJT and generates a natural response. An LLM-as-a-judge maps that response to one of the two actions. Then they compare the model’s distribution of choices against human preferences collected from 550 annotators, 10 per scenario. This isn’t about whether the model is “empathetic” in some absolute sense — it’s about whether its behavior matches what humans would do in the same situation.

What they found

They tested 25 LLMs across scenarios covering professional composure, conflict resolution, practical tasks, and everyday decision-making. The results show two distinct kinds of gaps:

Deviation from consensus — Where human annotators mostly agree on the preferred action, but the model picks something else. This is the straightforward misalignment case.

Missing the range — Where humans disagree among themselves and there’s no clear consensus, but the model still picks a single option, failing to reflect the diversity of human opinion.

The second one is more interesting to me. It suggests that current alignment methods might be optimizing for a single “correct” behavior when real human social dynamics are messier. If a model can’t capture the fact that reasonable people disagree, it’s going to come across as robotic or tone-deaf in nuanced situations.

What this means in practice

This is early work, and the authors are upfront about that. But the direction is promising. Most alignment research focuses on safety and harm reduction — making sure models don’t say racist stuff or help people build bombs. That’s important, but behavioral alignment is a different problem. It’s about whether models can navigate the subtle, context-dependent social norms that humans handle intuitively.

The framework also sidesteps one of my pet peeves: using synthetic benchmarks that have no grounding in established science. By anchoring their scenarios in validated psychometric instruments, they’re building on decades of research rather than starting from scratch. That gives the results more weight than yet another “we asked GPT-4 some questions and it did fine” blog post.

That said, the reliance on LLM-as-a-judge for response mapping introduces its own biases. If the judge model has blind spots, those propagate through the evaluation. And the scenarios are generated by LLMs, which means they inherit whatever quirks and limitations the generation model has. The human validation helps, but it’s not a perfect filter.

The bottom line

Google Research has put together a framework that actually measures something worth measuring: not whether models say they’re aligned, but whether they behave in ways that match human social expectations. The gaps they found in 25 models suggest there’s real work to do, especially in capturing the diversity of human opinion when consensus doesn’t exist.

I’d like to see this extended to more models and more scenarios, and I’d love to see the evaluation data released openly so others can replicate and build on it. For now, it’s a solid step forward in a space that badly needs more rigorous, psychologically-grounded approaches.

Google’s New Framework Puts LLM Behavior Under the Microscope

From questionnaires to scenarios

What they found

What this means in practice

The bottom line

Comments (0)