Just Ask the Model: One-Shot LLM Research Evaluation and Structured Expert Review

Authors

Affiliations

Valentin Klotzbücher

University of Basel & University Hospital Basel

David Reinstein

The Unjournal

Lorenzo Pacchiardi

University of Cambridge, Leverhulme Centre for the Future of Intelligence

Tianmai Michael Zhang

University of Washington

Published

April 27, 2026

Abstract

Peer review is strained, and AI tools generating referee-like feedback are already adopted by researchers and commercial services—yet field evidence on how reliably frontier LLMs can evaluate research remains scarce. We compare structured one-shot evaluations by GPT-5 Pro against paid expert review packages from The Unjournal, an open evaluation platform covering economics and social-science working papers, where both humans and the model rate papers on seven percentile criteria with uncertainty intervals and provide narrative critiques. Treating human evaluations as a high-quality but noisy reference signal, we find that GPT-5 Pro approaches the agreement levels observed among human evaluators themselves on several criteria, while exhibiting consistent failure modes: compressed rating scales, uneven criterion coverage, and variable identification of expert-flagged concerns. Our results suggest that even a minimal one-shot setup—a single prompt with a fixed rubric, no iteration or retrieval augmentation—yields LLM ratings comparable to an additional expert rater, though central compression and uneven qualitative coverage indicate clear limitations. Appendix results for five additional models confirm the pattern across capability tiers.

Introduction

Include global setup and parameters

source("setup_params.R")

A collaboration with The Unjournal (55+ open evaluation packages for global-priorities research). Funding: Survival and Flourishing Fund, Long Term Future Fund, EA Funds.

Peer review is under strain. Reviewers are hard to find, turnaround times are lengthening, and the system costs an estimated $1.5 billion per year in the United States alone (Aczel, Szaszi, and Holcombe 2021). At the same time, generative AI lowers the cost of producing polished manuscripts; in at least some fields, editors report submission growth that exceeds reviewer capacity, and explicitly link this trend to LLM-assisted writing (Spitzer 2026). This combination creates demand for automated support in editorial and pre-submission workflows.

Commercial AI reviewer products—such as Refine (Refine n.d.), IsItCredible (IsItCredible.com n.d.), and QED Science (QED Science 2026)—already market automated referee-like feedback directly to authors, though they explicitly disclaim substituting for human peer review and their internal architectures remain undocumented. OpenAIReview (Hsu and Tan 2026) takes a different approach: an open-source, transparent pipeline that uses progressive prompting to generate detailed critiques at roughly $4 per paper. Meanwhile, publishers are formalizing policies that restrict reviewer use of general-purpose AI tools while permitting controlled in-house applications for screening tasks (Elsevier 2025; Leung 2026). Our study complements these efforts by asking how far the simplest possible setup—a single prompt to a frontier model, with no bespoke pipeline—can go.

These developments make the evidentiary gap salient: funders, editors, and policymakers need to know when AI evaluation outputs are trustworthy enough to use, and when they are unstable, biased, or manipulable. Recent work documents three interlocking concerns. Reproducibility can be “jagged” across models and time (Thomas, Romasanta, and Pujol Priego 2026), and subtle task reframings can induce systematic output shifts reminiscent of specification search (Asher et al. 2026). Adversarial manipulation is not hypothetical: invisible prompt-injection text can inflate LLM review scores in simulated peer review (Choi et al. 2026). And even without manipulation, AI reviews tend to be less thematically diverse and less focused on interpretation and originality than human reviews (Rajakumar et al. 2026), while LLM scoring exhibits range restriction and halo effects that distort agreement metrics (Wang et al. 2025).

The central question we address is therefore: how reliably can frontier LLMs evaluate research, relative to expert peer review and under realistic levels of rater disagreement? We study this question in a setting designed to make “expert judgment” observable and multi-dimensional rather than implicit.

We use The Unjournal’s structured human evaluations as a reference signal. We prompt GPT-5 Pro—a frontier reasoning model—with the same rubric and guidelines used by human evaluators, then compare the resulting quantitative ratings and qualitative critiques against expert evaluations for 60 economics and social-science working papers. We ask whether a frontier LLM evaluation can approximate expert judgment, where systematic differences arise, and whether the model reveals characteristic AI preferences over research. Our headline finding is that GPT-5 Pro matches or exceeds pairwise human inter-rater rank agreement on overall quality—that is, the LLM-human correlation is comparable to the human-human correlation, which is the appropriate benchmark given substantial disagreement among human evaluators themselves. Appendix results for five additional models spanning different capability and cost tiers confirm the pattern. This suggests that top reasoning models can currently serve as supplementary raters in structured evaluation pipelines, even under our minimal one-shot setup.

Our approach is minimal by design: each model receives the same PDF and a fixed rubric in a single prompt, with no iteration, retrieval augmentation, chain-of-thought scaffolding, or multi-step agentic loop. This makes our results a conservative lower bound on what LLM-based evaluation can currently achieve. If frontier models already yield meaningful agreement with expert reviewers under the simplest possible setup, more sophisticated pipelines—structured measurement schemas (Asirvatham, Mokski, and Shleifer 2026), iterative quality-checking workflows (Zhang and Abernethy 2025), or the kind of prompt-robustness engineering motivated by specification-search concerns (Asher et al. 2026)—should improve further. Quantifying how much headroom remains above this one-shot baseline, and which pipeline elements unlock it, is a key direction for future work.

The Unjournal setting is particularly well suited for this comparison. It commissions paid expert evaluations using a structured rubric covering seven percentile criteria with 90% credible intervals plus journal-tier predictions, and publishes the resulting packages openly rather than making binary accept/reject decisions—which may increase reviewer effort through accountability and transparency. The resulting ratings and critiques still exhibit substantial inter-rater variation; accordingly, we treat human evaluations as a high-quality but noisy reference signal, not ground truth. The rich, multi-dimensional data allow us to compare the priorities and calibration of humans and AI models across criteria and domains, while eventual publication outcomes for journal-tier predictions provide an external validation opportunity¹ enabling a human-vs-LLM horse race. Finally, The Unjournal’s pipeline of future evaluations allows for clean out-of-training-data predictions, serving as a live testing lab for prospective validation.

Unjournal impact: author engagement evidence

Do authors engage with and respond to Unjournal evaluations? Manual tracking of 57 evaluations finds that 16 papers received formal written responses and at least 5 show clear evidence of substantive revision in response to the feedback—including one paper with over 3,000 net line changes. A further 7 authors stated an intention to update their paper.

See the full tabulation: did authors adjust their papers? for the interactive table and public author statements. For broader context, see Evidence: do authors engage with evaluations? in The Unjournal’s knowledge base.

These represent verifiable publication outcomes, not statements about the “true quality” of the paper.↩︎