ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
L4-FactCheckText analysisLLMLayer 4

Fact Check

Uses a language model to weigh the factual claims in a text, scoring internal contradiction, numerical and temporal plausibility, and conflict with established knowledge, and combining them into a weighted score. It works from the model's own knowledge and is built to abstain rather than guess.

Technical description

L4-FactCheck is the semantic, model-based counterpart to the deterministic fact-hallucination check. It works from the model's knowledge without browsing, which shapes the design: a model judges whether a text contradicts itself and whether its numbers and dates hang together reliably, and judges specialised or very recent facts much less reliably, so the rubric weights and the abstention rule are built around that split. It runs at Layer 4 when a model is configured and the text is at least 100 words, sends the document (first 8000 characters) together with the current date and a three-dimension rubric, and aggregates the dimension scores into a weighted 0 to 5 score.

How it works

The model is given the text, the current date for temporal reasoning, and a rubric of three dimensions scored 0 to 5 independently.

Internal contradiction asks whether statements in the document conflict; this is grounded entirely in the text, so a model judges it well. Plausibility asks whether numerical claims are reasonable for what they describe (prevalence rates, effect sizes) and whether dated claims are anachronistic relative to the supplied date. Consensus asks whether a claim conflicts with well-established science; this is the dimension a model is least sure of from memory, so it is instructed to flag a claim here only when confident and to stay silent otherwise.

Abstention. The prompt instructs the model to list a finding only when reasonably confident it is a real problem, and as a code-level backstop any finding the model marks low-confidence is dropped, which removes the false positives that arise when a model questions a true claim.

Aggregation. The dimension scores s are combined by a weighted mean with weights contradiction 0.35, plausibility 0.30, consensus 0.35, over the dimensions returned: score = Σ wᵢ sᵢ / Σ wᵢ, clamped to 0 to 5. When the model returns a single overall score instead of the rubric, that value is used directly. Each surviving finding is labelled with its dimension (Internal contradiction, Plausibility, Consensus conflict), and the per-dimension sub-scores and a summary are returned in the metadata.

Score thresholds

Score Meaning
0 to 1 Claims are consistent with one another and with established knowledge.
2 to 3 A few doubtful, implausible, or internally inconsistent claims.
4 to 5 Multiple claims that are contradictory or conflict with well-established facts.

Why this matters

Confident factual invention is the signature failure of language models, and the costliest case in a manuscript is the claim that is plausible, well-phrased, and wrong. A surface check can flag a claim that lacks a citation, but it cannot tell whether the claim is false or whether two statements quietly contradict each other; that needs meaning. A model supplies that read without any lookup, and recent work shows its parametric knowledge carries real signal, with internal certainty paired against consistency separating sound claims from confident inventions. The same work is equally clear about the limits: models are poorly calibrated and tend to overconfidence, their accuracy falls as the text grows longer, and the reliable move when knowledge runs out is to abstain rather than guess. L4-FactCheck is built to that shape, putting most weight on the contradictions and implausibilities a model catches well, treating settled-science conflicts as a confidence-gated signal, and concentrating a reviewer's attention rather than delivering a verdict.

Limitations

L4-FactCheck judges facts with a language model and does not browse, so its judgements come from the model's internal knowledge, with all of that knowledge's gaps. It can miss false claims and question true ones, most of all on very recent or highly specialised material, which is why the consensus dimension is confidence-gated and the indicator is tuned to abstain rather than over-flag. Its reliability is best on internal contradictions and numerical plausibility, which are grounded in the text, and weakest on external truth. Accuracy also falls on longer passages, and the text is truncated to 8000 characters, so a long document is judged on its opening. It is slower and costlier than the deterministic checks, results vary between runs, and it is a prompt for human verification rather than a final ruling.

Theoretical background

L4-FactCheck rests on the literature on fact-checking with language models without retrieval, which verifies a claim against the model's intrinsic knowledge and reports both its promise and its limits. A central result is that pairing internal certainty with reasoning consistency separates sound claims from overconfident hallucination; the rubric's emphasis on internal contradiction operationalises the consistency half of that pairing, and the confidence-gating operationalises the calibration half. The abstention rule follows the benchmarking work on knowledge reliability, which rewards a model for declining to answer when uncertain rather than guessing, and the decision to weight contradiction and plausibility above consensus follows the finding that parametric verification accuracy is high on text-grounded judgements and falls on specialised external facts, especially as context length grows.

References

  1. Vazhentsev A, Marina M, Moskovskiy D, et al. Leveraging LLM parametric knowledge for fact checking without retrieval. arXiv preprint arXiv:2603.05471. 2026. https://arxiv.org/abs/2603.05471
  2. Wang H, Khalid M, Wu Q, Gao J, Cao C. Fact-checking with large language models via probabilistic certainty and consistency. arXiv preprint arXiv:2601.02574. 2026. https://arxiv.org/abs/2601.02574
  3. Mündler N, He J, Jenko S, Vechev M. Self-contradictory hallucinations of large language models: evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852. 2023. https://arxiv.org/abs/2305.15852
  4. Jackson D, Keating W, Cameron G, Hill-Smith M. AA-Omniscience: evaluating cross-domain knowledge reliability in large language models. arXiv preprint arXiv:2511.13029. 2025. https://arxiv.org/abs/2511.13029
  5. Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, Ishii E, Bang Y, Madotto A, Fung P. Survey of hallucination in natural language generation. ACM Computing Surveys. 2023;55(12):1-38. DOI: 10.1145/3571730