Discussion
GPT-5 Pro achieves moderate-to-strong agreement with human expert evaluators, approaching the agreement levels observed among human evaluators themselves. Central compression—the tendency to pull extreme ratings toward the middle of the scale—is the most consistent pattern, likely reflecting alignment training that discourages confident extreme outputs. Qualitative coverage varies widely across papers: on some, the model captures nearly all consensus human concerns; on others, it misses key critiques or raises issues absent from the expert consensus. Appendix results for five additional models confirm these patterns across capability tiers, with reasoning-capable models outperforming lightweight ones.
Limitations. Several caveats temper these conclusions. Our sample comprises roughly 50 social-science papers specifically selected by The Unjournal for evaluation, not a random draw from the research literature; performance may differ in other fields or on less polished manuscripts. Human evaluations are themselves a noisy reference signal rather than ground truth, with substantial inter-rater variation that caps achievable agreement. We cannot fully rule out knowledge contamination: while we instruct the model to ignore prior knowledge about authors, institutions, or publication history in the system prompt, the models’ training data may include fragments of these papers or related discussions. Robustness checks with models whose training cutoffs predate the papers, or out-of-time validation on papers entering The Unjournal’s pipeline after model training, would help address this concern. Alignment training likely contributes to score inflation and narrower credible intervals than humans provide. All LLM evaluations are single-run; aggregating across multiple runs or temperature settings could change the picture. Finally, the qualitative coverage and precision metrics are themselves LLM-assessed (GPT-5.2 Pro as judge), introducing a further layer of model dependence.
Implications. Even a reasoning-capable model costing several dollars per paper is orders of magnitude cheaper than human expert review. The qualitative gaps we observe—missed critiques, generic issues, and central compression of ratings—argue against full automation of peer review. AI evaluation appears most promising as a supplement: providing fast structured feedback, flagging potential concerns for human reviewers, and enabling systematic comparison across large paper sets that would be infeasible with human effort alone.
Governance and attack surface. As AI review tools move from research prototypes to deployed products, the attack surface expands. Prompt-injection techniques—embedding hidden instructions in a manuscript’s metadata, footnotes, or even white-on-white text—could steer model outputs toward inflated ratings or suppressed critiques. Because our pipeline (and similar commercial services) routes unpublished manuscripts through third-party APIs, confidentiality cannot be guaranteed without end-to-end encryption or on-premise deployment. Over-reliance on AI scores introduces a further governance risk: if editorial decisions weight model ratings, authors may optimise papers for the model rather than for scientific rigour, creating a Goodhart dynamic. Finally, current evaluations reflect a single model checkpoint; model updates, alignment changes, or fine-tuning can shift ratings in ways that are invisible to users. We recommend that any operational deployment include adversarial red-teaming of prompts, formal confidentiality agreements with API providers, transparent disclosure of AI involvement in review, and periodic re-calibration against fresh human evaluations.
Future directions. All LLM evaluations reported here are single-run; multi-run robustness and prompt-sensitivity analysis would directly address the key open methodological question. Out-of-time validation on papers entering The Unjournal’s pipeline after model training cutoffs would eliminate residual contamination concerns. The qualitative coverage and precision metrics are themselves LLM-assessed (GPT-5.2 Pro as judge); human validation of these scores is needed to close the model-dependence loop. Finally, comparing human and LLM tier predictions against verified publication venues would offer a rare opportunity for externally verifiable accuracy measurement.