ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
C9Text analysisStylisticLayer 1 (Deterministic)

Conclusions

Detects overgeneralised conclusions that use universal quantifiers, lack conditional language, make vague recommendations, and float untethered from the body's specific findings. large language model (LLM)-generated conclusions systematically strip qualifiers and produce broader claims than the evidence supports.

Technical description

C9 operationalises the "conclusion constraints" dimension from the Anti-AI Vibe Review spec. It measures four aspects of conclusion quality: (1) the density of universal quantifiers and absolute language in the conclusion section, (2) the complete absence of conditional and constraining language, (3) the presence of recommendations without implementation criteria, and (4) whether the conclusion's claims are anchored to specific findings (p-values, N values, percentages, figure/table references) from the body of the document. The indicator uses IMRaD (Introduction, Methods, Results and Discussion)-based section detection to isolate the actual Conclusions section rather than a fixed percentage heuristic. It runs at Layer 1 using only pattern matching and word-list matching.

How it works

Conclusion extraction. The text is partitioned into IMRaD sections using the shared section classifier. If a Conclusions section is detected, it is used as the analysis target. If no explicit Conclusions heading is found, the indicator falls back to the last 25% of the document (widened from the previous 20% to improve recall). The body text is everything preceding the Conclusions section (or the first 75% of the document in fallback mode).

Sub-check 1, universal quantifier detection. The conclusion text is scanned against a per-language dictionary of universal quantifiers (15 English entries: all, always, never, clearly demonstrates, unequivocally, without doubt, definitively, undeniably, conclusively proves, every, none, no one, everybody, everything, nothing; 15 Romanian entries). Each match is counted and reported as a Finding with character position. A short-quantifier gating mechanism prevents false positives on factual usage: "all patients", "every sample", "none of the isolates" are not flagged because they refer to concrete study subjects rather than making absolute general claims. Each flagged quantifier contributes +0.5 to the score, capped at +2.0.

Sub-check 2, conditional absence. The conclusion text is scanned against a per-language dictionary of conditional phrases (13 English entries: provided that, except for, conditional on, only when, in the context of, for populations, assuming that, under conditions, if and only if, unless, depending on, subject to, given that; 12 Romanian entries). If the conclusion contains more than three sentences and not a single conditional phrase is found, the sub-check fires. Contributes +1.0.

Sub-check 3, vague recommendations. The full text is scanned for recommendation patterns (is recommended, we recommend, should be, it is recommended in English; se recomanda, recomandam, ar trebui in Romanian). Each matched recommendation sentence is checked for specificity indicators: the presence of numbers, percentages, or conditional subordinators (when, if, for, during). Recommendations without any specificity indicator are flagged as vague. Each flagged recommendation contributes +0.5, capped at +1.0 (two recommendations).

Sub-check 4, Conclusion-Results lexical anchoring. The body text is scanned for numeric and referential anchors: p-values (p < 0.05), sample sizes (N = 120), percentages (35%), figure/table references (Figure 1, Table 2), confidence intervals, and effect sizes (d = 0.52). These anchors form a set of "findings" that a properly tethered conclusion should reference. The conclusion text is then checked for how many of these specific anchors reappear. If fewer than two body anchors carry into the conclusion while the conclusion simultaneously uses more than two present-tense absolute verbs (demonstrates, proves, confirms, establishes, shows, reveals, indicates, validates, verifies, substantiates), the sub-check fires. Contributes +1.0.

The four contributions sum to a theoretical maximum of 5.0 (2.0 + 1.0 + 1.0 + 1.0).

Why this matters

Peters and Chin-Yee's landmark 2025 study tested 10 prominent LLMs across 4,900 summaries of scientific papers from top journals (Nature, Science, The Lancet, New England Journal of Medicine) and compared them to the original texts. They found that LLM-generated summaries were 4.85 times more likely to contain broad, unqualified generalizations than human-written summaries (odds ratio = 4.85, 95% confidence interval [3.06, 7.70], p < 0.001) [1]. The problem is getting worse, not better: newer models including ChatGPT-4o (9x) and LLaMA 3 (39x) overgeneralised significantly more than older models like GPT-3.5 and Claude 2.

The study identified three types of overgeneralization. Generic generalizations replace specific, quantified findings with unqualified present-tense claims (e.g., "the drug reduced symptoms by 35% (p < 0.001)" becomes "the drug is an effective treatment"). Present-tense shifts convert past-tense reports of study-specific findings into present-tense statements of general fact (e.g., "participants showed improvement" becomes "the intervention improves outcomes"). Action-guiding generalizations transform descriptive findings into prescriptive recommendations without the necessary scope qualifiers (e.g., "clinicians should adopt this approach" without specifying which patient populations, under what conditions, or with what caveats).

Sub-check 4 directly targets the first two types. A conclusion that omits the specific p-values, effect sizes, and figure/table references from the body while deploying present-tense absolute verbs is performing exactly the transformation Peters and Chin-Yee documented. Sub-check 1 targets the absolute language that replaces qualified findings. Sub-check 2 targets the missing conditional scaffolding that would constrain a claim's scope.

A striking secondary finding from Peters and Chin-Yee is the "ironic rebound effect": prompting LLMs to "be accurate" paradoxically increased overgeneralization rates by approximately 2x [1]. This means that well-intentioned users who ask for accurate summaries may actually receive more overgeneralised output, a finding with direct implications for anyone using LLMs to draft conclusion sections.

Score thresholds

Score Meaning
0 to 1 Conclusions are well-constrained: quantifiers are qualified, conditionals scope the claims, recommendations specify implementation criteria, and specific findings from the body anchor the concluding statements. Typical of carefully written human academic prose.
2 to 3 Moderate overgeneralization: some absolute language in the conclusion, conditionals are sparse, or the conclusion floats partially untethered from the body's specific findings. Common in AI-assisted drafts and hurried human writing.
4 to 5 Severe overgeneralization: the conclusion reads as a set of absolute, unqualified claims with no conditional scaffolding, vague recommendations, and no tether to the body's numeric evidence. Highly consistent with LLM-generated conclusions (Peters & Chin-Yee, 2025).

Limitations

The IMRaD-based conclusion extraction depends on the section classifier recognising a Conclusions heading. Documents without standard IMRaD headings fall back to a positional heuristic (last 25% of text), which may include non-conclusion material or miss a conclusion placed earlier in the document. The positional fallback is deliberately wider (25% vs. the original 20%) to improve recall at the cost of some precision.

The conclusion-body anchor check uses exact string matching on normalised anchor forms. A conclusion that paraphrases a numeric finding ("a 25% reduction") while the body reports it differently ("reduced by 25.0%") will not register a match. The check is deliberately conservative to avoid false-positive anchor matches on coincidental numeric similarity.

The universal quantifier gating mechanism exempts quantifiers followed by concrete study-subject nouns (patients, samples, isolates, specimens, subjects, participants, measurements). This prevents flagging legitimate factual statements but may miss genuinely overgeneralised claims that happen to use a gated construction.

References

  1. Peters U, Chin-Yee B. Generalization bias in large language model summarization of scientific research. Royal Society Open Science. 2025;12(4):241776. DOI: 10.1098/rsos.241776 https://royalsocietypublishing.org/doi/10.1098/rsos.241776
  2. HindSight: evaluating LLM-generated research ideas via future impact. arXiv preprint arXiv:2603.15164. 2026. https://arxiv.org/abs/2603.15164
  3. AI for auto-research: roadmap and user guide. arXiv preprint arXiv:2605.18661. 2026. https://arxiv.org/abs/2605.18661