Anchoring Lack
Detects claims that lack concrete factual support, citations, data references, figure/table anchors, or traceable Results antecedents, in the immediate context.
Technical description
C4 operationalises the "anchor density" dimension of the Anti-AI Vibe Review spec (A1, A2, A4, B3). It asks three questions of every claim-bearing sentence: (1) does the sentence carry a support marker (citation, numeric data, figure/table reference) in its immediate neighbourhood, (2) does the sentence pack citations onto trivially short content, and (3) when the document has identifiable Results and Discussion sections, does each Discussion claim trace back to a concrete finding reported in Results. The three sub-checks operate independently and sum into a single 0 to 5 score. Sub-checks 1 and 3 are additive up to per-incident caps; sub-check 2 adds 0.3 per incident without a separate cap because overcitation is rare enough that it cannot saturate the score alone. The indicator runs at Layer 1 using only pattern matching dictionaries, citation patterns, and the shared IMRaD (Introduction, Methods, Results and Discussion) section detector; it does not call external application programming interfaces (APIs) or language models.
How it works
The implementation is deterministic and runs at Layer 1. Sentence segmentation and word counting are provided by the shared preprocessor. The IMRaD section partition used by sub-check 3 comes from app.engine.section_detector.classify_imrad_sections(). The claim-verb dictionary is loaded per-language from data/dictionaries/claim_verbs_*.json.
Sub-check 1, unsupported claim detection. The text is sentence-split and each sentence is tested for the presence of a claim phrase ("studies show", "research demonstrates", "it was found", "results suggest", and their translations into the effective language). When a claim phrase is found, the indicator builds an adjacency window spanning the claim sentence plus one sentence before and one after. If none of the support patterns fire in that window, the claim is flagged as unsupported. Support patterns are: Harvard-style citations (Author et al., 2024), numeric citations [1], sample-size declarations N = 120, p-values p < 0.05, percentages 35%, and figure/table references (Fig. 1, Table 2). Each unsupported claim contributes +0.5 to the score, capped at a maximum contribution of 3.0 from this sub-check alone. The cap ensures that a document consisting entirely of unsupported claims does not consume the full score range, leaving headroom for the other sub-checks.
Sub-check 2, overcitation detection. Each sentence is scanned for citations using the Harvard and numeric citation patterns. When a sentence contains three or more citations and fewer than 15 content words after the citation strings are stripped out, it is flagged as potential overcitation, citation padding that signals AI-typical "cite everything" behaviour in lieu of genuine scholarly engagement. Each incident contributes +0.3 to the score.
Sub-check 3, internal traceability (Discussion → Results). When the document contains both a Results section and a Discussion and/or Conclusions section recognised by the shared IMRaD classifier, the indicator extracts a set of numeric anchors from the Results body: p-values, percentages, sample sizes, effect sizes, figure/table references, confidence intervals, and specific decimal values. Each claim-bearing sentence in the merged Discussion+Conclusions body is then checked for at least one concrete tether to Results: either (a) a citation, (b) a figure/table reference that also appears in Results, or (c) any other numeric anchor whose normalised form appears in the Results anchor set. A Discussion claim that lacks all three signals is flagged as untethered, contributing +0.5 to the score, capped at 2.0 from this sub-check. The cap prevents overly long Discussion sections with many generic sentences from dominating the score; the cap is lower than sub-check 1's because a Discussion that paraphrases Results in natural language without restating the raw numbers can still be academically legitimate, and the indicator should not penalise that as severely as the absence of any local support.
The three contributions are summed and clamped at 5.0.
Why this matters
Generative language models produce text that reads fluently without being tethered to specific facts. Lund and colleagues documented that academic text generated by large language models (LLMs) sounds plausible but systematically lacks the granular detail, sample sizes, effect magnitudes, instrument names, exclusion criteria, that characterises genuine scientific writing [1]. Fleckenstein and colleagues showed in a controlled experiment that even experienced university faculty cannot distinguish AI-generated essays from student writing above chance, precisely because the surface fluency of machine-generated prose masks the absence of underlying evidence [2]. The anchoring problem is therefore a detection vulnerability: a reviewer who cannot see the missing anchors will rate an AI-generated manuscript higher than a human-written one that includes them.
Sub-check 1 (unsupported local claims) targets the most direct manifestation: a sentence that asserts "studies show X" without naming a study, a year, or a finding. Sub-check 2 (overcitation) targets the compensatory behaviour, models that learn to cite extensively in the hope that sheer volume will substitute for genuine engagement. Sub-check 3 (traceability) targets the structural manifestation across IMRaD sections: a Discussion that draws conclusions without pointing to the specific Results that justify them.
Score thresholds
| Score | Meaning |
|---|---|
| 0 to 1 | Claims are well-anchored: most claim-bearing sentences cite sources or data, overcitation is absent, and Discussion claims trace back to specific Results findings. Typical of original scientific prose with proper referencing. |
| 2 to 3 | Moderate anchoring gaps: several unsupported claims detected, possibly one overcitation incident, or Discussion claims that paraphrase Results without citing specific figures or values. Common in first drafts and review articles that summarise broad literatures. |
| 4 to 5 | Severe anchoring deficit: the majority of claims float without any local support, the text relies on generic authority phrases ("it is well established", "prior research suggests") without citing specific work, and Discussion claims are disconnected from the actual Results. Highly consistent with AI-generated text and with pseudo-academic writing that lacks scholarly infrastructure. |
Limitations
The claim-verb dictionary is finite and static. A text that expresses claims through novel constructions ("We provide evidence that...", "The present data indicate...") will not match and will not be checked for support, passing through undetected. The adjacency-window approach in sub-check 1 uses a fixed ±1 sentence radius; a claim whose support appears two sentences away is incorrectly flagged, and a claim adjacent to a false-positive support match (a date in a different context, e.g. "the year 2024 saw…") is incorrectly passed.
Overcitation detection depends on accurate citation counting and a static 15-word content threshold. Sentences in the 14–20 content-word range that legitimately need three citations (e.g. "Prior studies using cognitive-behavioural therapy [1], mindfulness-based stress reduction [2], and pharmacological intervention [3] have all reported moderate effect sizes") may be caught by this heuristic. The threshold is deliberately set low to favour recall over precision on this sub-check, since overcitation is rare and a false positive here has low severity ("info").
Sub-check 3 requires the document to have both a Results and a Discussion/Conclusions section identified by the shared IMRaD classifier. Documents without an explicit Results heading, or whose headings do not match the classifier's pattern matching (e.g. "Empirical Findings", "Data Analysis"), will skip the traceability check entirely. The anchor-matching logic uses exact string matching on normalised forms, so a Discussion that reports "a 25% reduction" while Results reports "25.0%" will not register a match. The numerical comparison is deliberately conservative to avoid false-positive anchor matches on coincidental numeric similarity; the cost is lower recall on legitimately tethered claims.
Theoretical background
The anchoring concept in C4 derives from two independent traditions. In the scientific-writing assessment literature, "anchoring" refers to the density of verifiable references, citations, data points, instrument names, protocol identifiers, that tie a manuscript to the empirical world. A text with low anchor density makes claims that cannot be checked, replicated, or traced to evidence. This is distinct from the psychological "anchoring bias" (Tversky and Kahneman, 1974); the shared term is coincidental.
In the AI-detection literature, the anchoring gap is one of the most robust single discriminators between human and machine-generated academic prose. Liang and colleagues, in a corpus-level analysis of 1.12 million scientific papers, showed that LLM usage has measurably altered the word-frequency and phrasing distributions of post-2022 academic writing [3], and the shift is most pronounced in the parts of a paper that carry the least concrete information, introductions and discussions, precisely where anchoring is weakest. C4 operationalises this observation at the sentence level rather than the corpus level: where Liang et al. measure aggregate vocabulary shift, C4 flags individual sentences whose anemic anchor profile is consistent with the aggregate pattern.
References
- Lund BD, Wang T, Mannuru NR, Nie B, Shimray S, Wang Z. ChatGPT and a new academic reality: artificial intelligence-written research papers and the ethics of the large language models in scholarly publishing. Journal of the Association for Information Science and Technology. 2023;74(5):570-581. DOI: 10.1002/asi.24750 https://asistdl.onlinelibrary.wiley.com/doi/10.1002/asi.24750
- Fleckenstein J, Meyer J, Jansen T, Keller SD, Köller O, Möller J. Do teachers spot AI? Evaluating the detectability of AI-generated texts among experts and novices. Computers and Education: Artificial Intelligence. 2024;6:100209. DOI: 10.1016/j.caeai.2024.100209 https://www.sciencedirect.com/science/article/pii/S2666920X24000013
- Liang W, Zhang Y, Wu Z, Lepp H, Ji W, Zhao X, Cao H, Liu S, He S, Huang Z, Yang D, Potts C, Manning CD, Zou J. Quantifying large language model usage in scientific papers. Nature Human Behaviour. 2025. DOI: 10.1038/s41562-025-02273-8 https://www.nature.com/articles/s41562-025-02273-8