---
title: "Results: Critiques & Key Issues"
engine: knitr
execute:
freeze: auto
---
::: {.content-visible when-format="pdf"}
Detailed paper-by-paper critique comparisons (matched issue pairs, severity labels, structural differences) are available in the online supplement at <https://llm-uj-research-eval.netlify.app/results_critiques.html>.
:::
::: {.content-visible when-format="html"}
```{r}
#| label: setup-critiques
#| code-summary: "Setup and libraries"
#| code-fold: true
#| message: false
#| warning: false
library("tidyverse")
library("jsonlite")
library("knitr")
library("kableExtra")
library("DT")
library("reticulate")
# Theme colors — crisp palette (matching results.qmd)
UJ_ORANGE <- "#E8722A" # vivid saffron orange
UJ_GREEN <- "#2D9D5E" # rich emerald green
UJ_BLUE <- "#2B7CE9" # clear azure blue
# Severity colors for badges
SEV_NECESSARY <- "#D62839" # Bright crimson
SEV_OPTIONAL <- "#E8722A" # Saffron orange
SEV_UNSURE <- "#8B95A2" # Cool grey
theme_uj <- function(base_size = 12) {
theme_minimal(base_size = base_size) +
theme(
panel.grid.minor = element_blank(),
panel.grid.major = element_line(linewidth = 0.3, color = "grey88"),
plot.title = element_text(face = "bold", size = rel(1.1)),
plot.title.position = "plot",
plot.subtitle = element_text(color = "grey40", size = rel(0.9)),
plot.caption = element_text(color = "grey50", size = rel(0.8), hjust = 0),
axis.title = element_text(size = rel(0.95)),
axis.text = element_text(size = rel(0.88)),
legend.position = "bottom",
legend.text = element_text(size = rel(0.88)),
legend.title = element_text(size = rel(0.9), face = "bold"),
strip.text = element_text(face = "bold", size = rel(0.95))
)
}
# Function to create severity badge HTML
severity_badge <- function(severity) {
sev <- tolower(trimws(severity))
if (grepl("necessary", sev)) {
return('<span style="background-color:#D62839;color:white;padding:2px 6px;border-radius:3px;font-size:0.8em;">Necessary</span>')
} else if (grepl("optional", sev)) {
return('<span style="background-color:#E8722A;color:white;padding:2px 6px;border-radius:3px;font-size:0.8em;">Optional</span>')
} else if (sev != "") {
return('<span style="background-color:#8B95A2;color:white;padding:2px 6px;border-radius:3px;font-size:0.8em;">Unsure</span>')
}
return('<span style="background-color:#bdc3c7;color:white;padding:2px 6px;border-radius:3px;font-size:0.8em;">\u2014</span>')
}
# Function to truncate text for table display
truncate_text <- function(text, max_chars = 80) {
if (nchar(text) > max_chars) {
paste0(substr(text, 1, max_chars), "...")
} else {
text
}
}
```
This chapter reports preliminary results from comparing LLM-identified key issues against human expert critiques. The analysis uses LLM-based assessment (GPT-5.2 Pro as judge) to measure coverage and precision; human validation of these alignment scores is underway.
::: callout-warning
This section is under active development. Coverage and precision metrics below are LLM-assessed; manual annotation is in progress.
:::
**Data sources.** We compare GPT-5.2 Pro key issues (structured output from the focal evaluation run, January 2026) against human expert critiques curated from The Unjournal's internal tracking database, where evaluation managers synthesize evaluator feedback using severity labels ("Necessary", "Optional but important", "Unsure").
```{r}
#| label: load-comparison-data
#| code-fold: true
#| code-summary: "Load comparison data"
# Load the matched comparison data
comparison_file <- "results/key_issues_comparison.json"
comparison_results_file <- c(
"results/key_issue_comp_results.json",
"results/key_issues_comparison_results.json"
)
comparison_results_file <- comparison_results_file[file.exists(comparison_results_file)][1]
if (file.exists(comparison_file)) {
comparison_data <- fromJSON(comparison_file)
n_papers <- nrow(comparison_data)
} else {
comparison_data <- NULL
n_papers <- 0
}
# Load LLM-based comparison results (coverage/precision) if available
if (!is.null(comparison_data) && !is.na(comparison_results_file)) {
llm_results_raw <- fromJSON(comparison_results_file)
llm_results <- llm_results_raw |>
as_tibble() |>
unnest_wider(comparison) |>
select(
gpt_paper,
coverage_pct,
precision_pct,
# New format uses matched_pairs, unmatched_human, unmatched_llm
any_of(c("matched_pairs", "unmatched_human", "unmatched_llm")),
# Old format used missed_issues, extra_issues (keep for backward compat)
any_of(c("missed_issues", "extra_issues")),
overall_rating,
overall_justification,
detailed_notes
)
comparison_data <- comparison_data |>
left_join(llm_results, by = "gpt_paper")
}
# Update paper count after loading
if (!is.null(comparison_data)) {
n_papers <- nrow(comparison_data)
}
```
We matched **`r n_papers` papers** with both GPT-5.2 Pro key issues and human expert critiques.
**LLM-based assessment.** The comparison was assessed using GPT-5.2 Pro, which evaluated coverage (what proportion of human concerns GPT identified) and precision (whether GPT issues are substantive).
- **Coverage**: Proportion of *consensus* human issues that have any LLM match (match quality >= 30%)
- **Weighted Coverage**: Mean match quality across all human issues (treating unmatched issues as 0%)
- **Precision**: Proportion of LLM issues that match a human concern
::: callout-note
## Caveat on Precision
"Precision" only reflects whether LLM issues match the *consensus* human issues curated in Coda---those prioritized by evaluation managers as the most important concerns. Many LLM-identified issues may have been noted by one or more individual human evaluators but were not included in this curated set. A low precision score does not necessarily mean the LLM raised irrelevant concerns; it may simply reflect issues that weren't prioritized in the consensus summary.
:::
```{r}
#| label: check-llm-results
#| code-fold: true
# Check if LLM comparison results are available
has_llm_results <- !is.null(comparison_data) &&
n_papers > 0 &&
"coverage_pct" %in% names(comparison_data) &&
any(!is.na(comparison_data$coverage_pct))
```
```{r}
#| label: llm-assessment-summary
#| tbl-cap: "LLM assessment of GPT vs human critique alignment"
if (has_llm_results) {
# Filter to papers with valid LLM results
llm_results <- comparison_data |>
filter(!is.na(coverage_pct) & !is.na(precision_pct))
summary_stats <- llm_results |>
summarise(
`Papers Assessed` = n(),
`Mean Coverage (%)` = round(mean(coverage_pct, na.rm = TRUE), 1),
`Mean Precision (%)` = round(mean(precision_pct, na.rm = TRUE), 1),
`Coverage Range` = paste0(min(coverage_pct, na.rm = TRUE), "-", max(coverage_pct, na.rm = TRUE)),
`Precision Range` = paste0(min(precision_pct, na.rm = TRUE), "-", max(precision_pct, na.rm = TRUE))
) |>
mutate(across(everything(), as.character)) |>
pivot_longer(everything(), names_to = "Metric", values_to = "Value")
kable(summary_stats, align = c("l", "r"))
}
```
```{r}
#| label: fig-coverage-precision
#| fig-cap: "Coverage (% of human-identified issues captured by the LLM) vs Precision (% of LLM-raised issues matching a substantive human concern), as assessed by GPT-5.2 Pro acting as judge. Each point is one paper. Dashed blue lines mark cross-paper means. Papers in the upper-right show strong human\u2013LLM critique alignment."
#| fig-width: 10
#| fig-height: 7
if (has_llm_results) {
llm_results <- comparison_data |>
filter(!is.na(coverage_pct) & !is.na(precision_pct)) |>
mutate(paper_short = str_trunc(gpt_paper, 25))
ggplot(llm_results, aes(x = coverage_pct, y = precision_pct)) +
geom_point(size = 4.5, color = UJ_ORANGE, alpha = 0.85) +
geom_text(aes(label = paper_short), hjust = -0.1, vjust = 0.5, size = 2.6, check_overlap = TRUE) +
geom_vline(xintercept = mean(llm_results$coverage_pct), linetype = "dashed", color = UJ_BLUE, linewidth = 0.6) +
geom_hline(yintercept = mean(llm_results$precision_pct), linetype = "dashed", color = UJ_BLUE, linewidth = 0.6) +
scale_x_continuous(limits = c(0, 100), breaks = seq(0, 100, 20)) +
scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 20)) +
labs(
x = "Coverage (%) \u2014 Human issues captured by LLM",
y = "Precision (%) \u2014 LLM issues that are substantive"
) +
theme_uj()
}
```
::: {.callout-note collapse="true"}
## Interpretation Guide
- **Coverage**: Percentage of human-identified issues that GPT also captured (in some form). Higher = GPT missed fewer human concerns.
- **Precision**: Percentage of GPT issues that are substantive rather than generic or spurious. Higher = GPT's critiques are more targeted.
- **Overall Rating**: Qualitative assessment (Excellent/Good/Moderate/Poor) based on both coverage and precision metrics.
:::
The coverage-precision figure above shows substantial variation across papers: some papers achieve high coverage and precision, while others show the LLM missing key human concerns or raising issues not in the expert consensus. Paper-by-paper comparisons with matched issue pairs, structural difference tables, and the manual annotation tool are available in the [full online version](https://llm-uj-research-eval.netlify.app/results_critiques.html).
:::