ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
L3Text analysisVerificationLayer 3

Citation Verify

Confirms that the references a document cites exist and are correctly identified, by looking each one up across the major scholarly databases and matching the cited details against the record, and additionally flags any cited paper recorded as retracted.

Technical description

L3 is the existence-and-identity layer of the citation stack. It confirms not only that a reference can be found but that the cited title, authors and year match the record found, separating a reference that merely exists from one that is what it claims to be. It runs on text of at least 200 words and checks up to twelve references. When a bibliography is present it parses the reference list and routes each entry through the shared verification engine; when no bibliography is found it falls back to verifying the identifiers and author-year citations in the running text. Located references carrying an identifier are additionally checked for retraction. Each problem reference is labelled with a fabrication category, and per-reference contributions sum into a 0 to 5 score (capped).

How it works

The implementation runs at Layer 3 against external services, with a global timeout on the verification pass.

Reference isolation and parsing. The bibliography is located by a heading heuristic (References, Bibliography, Works Cited, and Romanian equivalents) and split into individual references, each parsed for authors, year, title and identifier. Up to twelve references are verified.

Verification engine. Each parsed reference is sent through the shared engine, which queries CrossRef, OpenAlex, PubMed and Semantic Scholar and computes a match quality between the cited reference and each candidate record from title, author and year agreement. The outcome is one of three states: found and matching (verified), found but with details that do not match (partial), or not found (unverified).

Fabrication categories and scoring. Each verification carries a fabrication category derived from the match: a reference found nowhere is Total Fabrication, an identifier resolving to a different paper is Identifier Hijacking, a record whose details are partly wrong is Partial Attribute Corruption, a plausible entry matching nothing closely is Semantic Hallucination, and a stub is Placeholder Hallucination. Total Fabrication, Identifier Hijacking and Placeholder Hallucination each add 1.0; Partial Attribute Corruption and Semantic Hallucination each add 0.6. The finding quotes the engine's field-level warning, such as a year or first-author mismatch or that the identifier resolves to a different paper.

Retraction check. For each located reference carrying an identifier, the identifier is checked against the research-integrity database; an exact match adds 0.7 and is reported as a retracted source, since citing retracted work as valid is its own integrity failure.

In-text fallback. When no bibliography is found, identifiers and author-year citations in the text are verified through an existence cascade (CrossRef and Semantic Scholar by identifier, then PubMed, CrossRef and Semantic Scholar by search); a citation found nowhere is counted as Total Fabrication, and the score is the unverified fraction times five, plus the retraction contribution.

Aggregation. Contributions are summed and reported as min(5.0, total). The metadata returns the verified, partial, unverified and retracted counts, the total extracted and checked, the fabrication_categories map and the mode (bibliography or in-text).

Score thresholds

Score Meaning
0 to 1 Every checked reference was found and matched its cited details, with no retractions.
2 to 3 A minority of references could not be confirmed, did not match the record, or were retracted.
4 to 5 A large share of references could not be found or were misidentified, a strong sign of fabricated or mishandled citations.

Why this matters

A fabricated citation is dangerous precisely because it is invisible on the page: plausible authors, a real-sounding journal, a well-formed identifier, and only a lookup reveals the problem. The most deceptive case is the reference that resolves to the wrong paper. When a citation carries a valid identifier pointing to a real but unrelated work, a reviewer who clicks it sees a genuine paper and assumes the citation is sound. A study that coded one hundred fabricated citations from accepted conference papers found that a substantial share of fabrications carried valid identifiers pointing to real papers, exactly the case a plain existence check waves through; matching the cited details against the record is what catches these. The same lookup makes a second problem visible: a reference can be real, correctly cited and retracted, and continuing to cite retracted work as though it still stands is its own integrity failure. L3 confirms the internal red flags against the record, while the citation-support indicator goes one step further to ask whether a real cited paper actually supports the claim.

Limitations

L3 depends on external services, so it is slower than the Layer 1 checks and is affected by database downtime and rate limits; it verifies only the first twelve references per document. Coverage is not total: a genuine reference that is very new, in a niche venue, or indexed under a different form can fail to match and be reported as unconfirmed, so an unmatched reference is a prompt to check by hand rather than proof of fabrication. The matching is deliberately strict, favouring the catching of fabrications over the passing of borderline real entries. The retraction check reflects what is recorded in the research-integrity database at the time of lookup, so a very recent retraction may not yet appear. The reference-isolation and parsing recognise common heading and citation conventions; unusual formats reduce coverage.

Theoretical background

L3 operationalises the metadata-matching approach that contemporary citation-checking tools converge on: a reference is verified not when some record is returned but when the cited title, authors and year match the record at its identifier. This catches the chimera reference, a valid identifier paired with mismatched metadata, which the NeurIPS-2025 fabricated-citation study found to be a substantial fraction of fabrications and which field-level checkers flag as an identifier conflict. The three-state outcome, verified, partial, unverified, follows the field-level adjudication framing of recent multi-agent citation detectors, which separate found-and-matching from found-but-mismatched from not-found. The multi-source design, querying CrossRef, OpenAlex, PubMed and Semantic Scholar, reflects the multi-database practice of open citation-validation tools. The retraction layer adds an integrity dimension absent from existence checking, drawing on the curated research-integrity record to flag real but withdrawn work.

References

  1. Ansari MS. Compound deception in elite peer review: a failure mode taxonomy of 100 fabricated citations at NeurIPS 2025. arXiv preprint arXiv:2602.05930. 2026. https://arxiv.org/abs/2602.05930
  2. Abbonato D. CheckIfExist: detecting citation hallucinations in the era of AI-generated content. arXiv preprint arXiv:2602.15871. 2026. https://arxiv.org/abs/2602.15871
  3. Li M, Lin Z, Ma S. Source or it didn't happen: a multi-agent framework for citation hallucination detection. arXiv preprint arXiv:2605.08583. 2026. https://arxiv.org/abs/2605.08583
  4. Russinovich M. RefChecker: a tool that validates academic paper references. 2025. https://github.com/markrussinovich/refchecker
  5. The Center for Scientific Integrity. The Retraction Watch Database. 2018 onward. https://retractionwatch.com/