I Asked Six LLMs Expert-Level Physics Questions. The Results Were Surprising

Can an LLM actually help a physicist think through a hard problem? Not just summarize Wikipedia or write an email, but engage with open, unresolved questions in a specialized field? Google Research and Cornell University decided to find out, using high-temperature superconductivity as their test case.

The results, published in PNAS, are interesting. They tested six LLMs — including general-purpose models and specialized tools — asking them questions that would challenge a graduate student in condensed matter physics. A panel of experts graded the responses on accuracy, comprehensiveness, and how well they handled competing theories.

The top performers weren’t the biggest models. They were NotebookLM and a custom-built system that both draw from a closed ecosystem of curated, quality-controlled sources. That matters more than I expected.

The problem with open questions

High-temperature superconductivity in cuprates is a mess in the best way. Since the Nobel-winning discovery in 1987, physicists have published thousands of papers using different experimental techniques. Multiple competing theories exist. The sheer volume of literature makes it hard for anyone — even experienced researchers — to stay on top of everything.

A good LLM could act as a neutral, knowledgeable tutor or thought partner here. But only if it can navigate conflicting evidence without parroting the most common answer or ignoring minority viewpoints.

What they actually tested

The researchers asked questions that required more than fact retrieval. Things like: “What experimental evidence supports the spin-fluctuation mechanism for pairing in cuprates, and what are its main challenges?” You can’t just pull that from a training corpus — you need to synthesize, weigh evidence, and acknowledge uncertainty.

Six systems were evaluated. The two that performed best — NotebookLM and a custom retrieval-augmented generation (RAG) system — both relied on a curated set of papers and sources. The general-purpose LLMs without that guardrail did noticeably worse, often giving confident but incomplete or slightly wrong answers.

This doesn’t surprise me. I’ve seen this pattern before in other domains: a model with access to a trusted knowledge base beats a larger model that has to rely on whatever it memorized during training. For science, where accuracy matters and errors propagate, this is a big deal.

Where they still fall short

The paper also identified clear weaknesses. Even the best systems struggled with questions that required understanding the relative importance of different experiments. They could list evidence, but they couldn’t always judge which pieces of evidence were more significant. They also sometimes failed to flag when a theory was controversial or when a particular result had been challenged later.

That’s a harder problem to solve. It’s not just about having the right papers — it’s about understanding the sociology of a field, knowing which results are considered foundational and which are fringe. That kind of tacit knowledge is tough to encode.

What this means for AI in science

I’ve been watching AI tools for research for a while, and this study confirms something I’ve suspected: the “firehose of everything” approach doesn’t work well for specialized knowledge work. A model that has been trained on Reddit, GitHub, and random blog posts is going to have a harder time staying precise than one that only sees peer-reviewed literature.

But the bigger takeaway is that we’re still early. The best system in this test was good enough to be useful as a tutor or a starting point for literature review, but not good enough to be a trusted research partner without human oversight. That’s fine — it’s a tool, not a replacement.

If you’re a physicist working on cuprates, you probably won’t replace your reading group with NotebookLM anytime soon. But if you’re a grad student trying to get up to speed on a messy field, having a system that can give you a balanced, referenced overview of competing theories is genuinely valuable.

I’d like to see this kind of evaluation extended to other fields — biology, materials science, maybe even my own corner of machine learning. The methodology here is sound, and the results are actionable. That’s rare in AI evaluation papers.

The full paper is worth a read if you’re interested in how we might actually build trustworthy AI for science. The short version: curated sources matter, open questions are hard, and we’re not there yet. But we’re closer than I thought we were a year ago.

I Asked Six LLMs Expert-Level Physics Questions. The Results Were Surprising

The problem with open questions

What they actually tested

Where they still fall short

What this means for AI in science

Comments (0)