ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
G4Text analysisHallucinationLayer 3

Citation Support Verification

Checks whether each cited paper actually supports the claim it is attached to, by retrieving the cited paper's title and abstract and measuring how much of the claim they cover. It looks up real sources and uses no language model.

Technical description

G4 covers the support question that the internal and existence checks leave open: granting that a reference is real and correctly identified, does the cited paper say what the citing sentence claims. It is the support layer of the citation stack, distinct from G1 (internal screening) and L3 (existence and identity). It runs on text of at least 200 words, extracts up to eight citations, retrieves each cited paper's title and abstract from an external scholarly index, and grades each citation as strong, partial or low support by measuring directional weighted coverage of the claim by that evidence. Per-citation contributions sum into a 0 to 5 score (capped).

How it works

The implementation runs at Layer 3, querying Semantic Scholar with bounded concurrency, and uses no language model.

Citations and claim context. Author-year and identifier citations are located with their positions; numbered citations are left to the existence checks. For each citation the surrounding sentence is taken as the claim. Up to eight citations are checked per document.

Evidence retrieval. For each citation the cited paper's title and abstract are fetched, first by identifier and then by an author-year search; an abstract shorter than thirty characters is treated as missing and the citation is recorded separately without affecting the score.

Weighted claim terms. The citation markers are stripped from the claim sentence, and the remaining content words are assigned weights: proper nouns (named entities) carry weight 2.0, numeric and measurement tokens 1.5 (excluding four-digit years), long content words of nine or more characters 1.5, and ordinary content words 1.0. Numeric tokens shorter than two characters are dropped.

Directional coverage. The evidence term set is the content words and numeric tokens of the title and abstract combined. Coverage is the weighted share of the claim's terms that appear in the evidence: the sum of the weights of covered claim terms divided by the sum of all claim-term weights. Because the denominator is the claim, a long abstract does not dilute the signal the way a symmetric overlap would.

Grading and scoring. A coverage of 0.45 or more is strong support and contributes nothing. A coverage from 0.22 up to 0.45 is partial support: the source is on topic but does not appear to cover the specific assertion, and it adds 0.35 at informational severity. A coverage below 0.22 is low support: little of the claim is reflected, and it adds 0.70 at warning severity as a likely unsupported or misattributed citation. The per-citation contributions are summed and reported as min(5.0, total). The metadata returns the checked, strongly-supported, low-support, unsupported and no-abstract counts and the total citation count.

Score thresholds

Score Meaning
0 to 1 Checked citations are well covered by their sources. The paper appears to cite material that genuinely supports its claims.
2 to 3 Several citations cover their claims only partially or weakly. The document may contain citation padding, misrepresented findings, or real references attached to claims they do not support.
4 to 5 Most checked citations are barely reflected in their sources, a strong signal of systematic misattribution.

Why this matters

The reference that exists but does not support its claim is now one of the most common citation failures, and far harder to spot than an outright fake. A benchmark built specifically to ask whether an author had read what they cited found that a large majority of machine-generated citations, by some measures between half and nearly all, do not fully support the claims they are attached to, even when the cited papers are real. Work on the source attribution of automated research agents reaches the same conclusion: systems routinely attach claims to sources that are accessible and real yet not consistent with the assertion. This sits on top of an older finding from the medical literature that a meaningful fraction of citations misrepresent or fail to support the statements they accompany, long before language models. The framing G4 borrows from scientific claim verification is the natural one: a piece of evidence either supports a claim, contradicts it, or does not contain enough information to judge, and a citation in the last two categories deserves a second look. By comparing the claim with the cited paper's own words, G4 surfaces exactly those citations for checking.

Limitations

G4 reasons from words, not meaning, so it shares the blind spots of any term-based comparison. It does not model negation or direction, so a claim that a treatment did not work and an abstract reporting that it did will look similar and can pass; the indicator measures weak coverage, not contradiction. It depends on an external index, so papers that are not indexed, or that carry no abstract, are set aside rather than penalised. It checks only the first eight citations per document, so a long reference list is sampled. The sentence around a citation is treated as the claim, which can pull in neighbouring material when abbreviations blur where a sentence ends. Because the support measure is directional, it rewards a paper that covers the claim even within a long abstract, but it cannot confirm that the covering sentence in the abstract is about the same finding. G4 points reviewers at the citations most worth verifying by hand.

Theoretical background

G4 operationalises the citation-precision idea from attribution evaluation, in which a citation is sound only when the cited source supports the part of the text it backs, and the scientific-claim-verification framing that labels evidence as supporting, contradicting or not-enough-information. The directional, weighted coverage measure is a lightweight, retrieval-style stand-in for that judgement: weighting named entities and numbers reflects the information-retrieval result that distinctive terms carry the discriminating signal, while normalising to the claim rather than to the union of claim and abstract follows the documented failure of symmetric overlap on long documents. The division of labour with the rest of the citation stack mirrors recent multi-stage verification pipelines, which separate retrieving the cited source from judging whether it supports the claim; G4 takes the support step, deterministically and without a language model, and leaves existence to L3 and internal screening to G1.

References

  1. Shi K, Sun W, Zhang Z, Sun L, Chawla NV, Ye Y. CiteAudit: you cited it, but did you read it? A benchmark for verifying scientific references in the LLM era. arXiv preprint arXiv:2602.23452. 2026. https://arxiv.org/abs/2602.23452
  2. Haan S. SemanticCite: citation verification with AI-powered full-text analysis and evidence-based reasoning. arXiv preprint arXiv:2511.16198. 2025. https://arxiv.org/abs/2511.16198
  3. Onweller H, Lumer E, Huber A, Ramchandani P, Subbiah VK, Feld C. Cited but not verified: parsing and evaluating source attribution in LLM deep research agents. arXiv preprint arXiv:2605.06635. 2026. https://arxiv.org/abs/2605.06635
  4. Wadden D, Lin S, Lo K, Wang LL, van Zuylen M, Cohan A, Hajishirzi H. Fact or fiction: verifying scientific claims. Proceedings of EMNLP 2020. 2020. https://arxiv.org/abs/2004.14974
  5. Nicholson JM, Mordaunt M, Lopez P, Uppala A, Rosati D, Rodrigues NP, Grabitz P, Rife SC. scite: a smart citation index that displays the context of citations and classifies their intent using deep learning. Quantitative Science Studies. 2021;2(3):882-898. DOI: 10.1162/qss_a_00146