ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
C1Text analysisStylisticLayer 1 (Deterministic)

Generality

Detects text that sounds authoritative but says nothing checkable: claims with no quantitative anchors, abstract noun-heavy phrasing, and confident assertions backed by no data.

Technical description

C1 measures how much verifiable substance a passage carries, across three independent signals: the density of empirical anchors per sentence, the ratio of abstract nominalizations to concrete verbs, and the number of strong epistemic claims left without any supporting anchor. It does not penalise generality of topic; it asks whether each assertion is tied to something a reader could check. The three sub-checks operate independently and sum into a single 0 to 5 score (the raw maximum from the three together is 4.0, which is then clamped at 5.0). Texts below 100 words are not scored, because anchor density and the nominalization ratio both require enough material to stabilise. The anchor set defined here is shared with C4.

How it works

The implementation is deterministic and runs at Layer 1. The text is split into sentences by a language-aware splitter, lower-cased and tokenised once, and the three sub-checks reuse that shared preprocessing.

The backbone of the indicator is the notion of an anchor: a span that ties a claim to something verifiable. A sentence counts as anchored if it contains at least one match of eight regular expressions: sample sizes (N\s*=\s*\d+), p-values (p\s*[<=]\s*0\.\d+), percentages (\d+\.?\d*\s*%), confidence intervals (CI\s*[\[(]), measurements with units (\d+\.?\d*\s*(mg|kg|ml|mm|cm|m|Hz|kHz|°C|dB)), author-year citations (\([A-Z][a-z]+(?:\s+et\s+al\.?)?,?\s*\d{4}\)), numeric citations (\[\d+(?:[,;]\s*\d+)*\]), and figure or table references ((?:Fig\.|Figure|Table|Tabel)\s*\d+).

Sub-check 1, anchor density. For each sentence the number of anchor matches is counted, and the anchor density is the number of sentences with at least one anchor divided by the total number of sentences (the denominator floored at 1). When this density falls below 0.08, meaning fewer than one sentence in twelve is tied to anything concrete, the sub-check fires and contributes +1.5 to the score, attaching a warning to the first anchorless sentence. The 0.08 floor is deliberately conservative; results-and-methods writing carries far higher anchor densities, so a value this low across a whole document is a strong generality signal rather than a section-specific artefact.

Sub-check 2, nominalization-to-verb ratio. Abstract prose tends to convert processes into nouns ("implementation", "utilization", "consideration") rather than stating them as verbs. The lower-cased tokens are matched against a per-language dictionary of nominalizations to obtain the nominalization count, and the verb count is approximated by a per-language suffix regular expression (for English, endings such as -ed, -ing, -ize, -ate, -ify, -ies and -en), discarding any match that is itself a nominalization. Their ratio is the nominalization count divided by the verb count (floored at 1). When the ratio exceeds 3.0 and at least one nominalization is present, the sub-check fires and contributes +1.0. Suffix-based verb detection is approximate by design: it over-counts gerunds and participles and under-counts irregular verbs, which is why the threshold is set high enough to tolerate that noise.

Sub-check 3, empty strong claims. A sentence that asserts knowledge but cites nothing is the clearest mark of confident emptiness. A sentence triggers this sub-check when it contains an epistemic verb, matched as a substring against a per-language dictionary of epistemic verbs (for example "demonstrates", "proves", "establishes"), yet contains none of the eight anchors. Each such sentence contributes the smaller of 0.3 and the remaining headroom (1.5 minus the running sub-check total), so the sub-check saturates at +1.5 however many empty claims appear, and each flagged sentence is reported at informational severity.

The three contributions are summed and clamped at 5.0. The returned metadata exposes the anchor density, the nominalization count, the verb count, their ratio, the empty-claim count and the sentence count, which together let the score be recomputed from the text.

Why this matters

Generality is the easiest register for a language model to produce and the hardest for a reader to falsify. Trained to maximise the likelihood of plausible continuations, a model defaults to safe, widely-applicable phrasing that fits almost any context, which is precisely the prose that carries no anchors. Population-scale analysis of scientific writing confirms the drift: Liang and colleagues, studying 1.12 million papers from 2020 to 2024, find that machine-assisted writing shifts measurably toward longer, lower-frequency and more generic word choices, detectable from the lexical surface alone. The nominalization signal has an independent linguistic basis. Halliday's account of grammatical metaphor describes how packing processes into abstract nouns raises the technicality and the abstraction of a text at the same time; an unusually high nominalization-to-verb ratio is a quantitative proxy for that abstraction, and it rises when concrete agents and actions are removed from the prose. The empty-claim signal targets a third pattern documented in the work on stance and hedging: confident epistemic verbs that promise evidence are cheap to generate but expensive to back, and a passage that repeatedly asserts that something is "demonstrated" or "established" without attaching any data is signalling certainty it has not earned.

Score thresholds

Score Meaning
0 to 1 Claims are routinely tied to data, citations or figures; concrete verbs carry the prose. Typical of empirical results and methods writing.
2 to 3 Patches of unsupported generality mixed with anchored passages, or a noun-heavy abstract that states few processes directly.
4 to 5 Pervasively unanchored, abstract and assertion-heavy prose. Compatible with generic AI drafts, padded introductions, or text that paraphrases a topic without engaging any specific evidence.

Limitations

The anchor regular expressions recognise Latin-script conventions and a fixed set of unit symbols; quantitative claims written out in words, in non-Latin scripts, or with units outside the list are not counted, so a genuinely empirical passage written discursively can read as unanchored. Verb detection is suffix-based and therefore approximate: it over-counts gerunds and participles and under-counts irregular verbs, and the per-language suffix lists cover the platform languages unevenly, so the nominalization ratio is a directional signal rather than a precise part-of-speech count. The nominalization and epistemic-verb dictionaries are curated per language and will miss terms not in the list, while substring matching for epistemic verbs can over-fire on unrelated words that contain a listed form. The 100-word gate suppresses scoring on short abstracts, where a single anchorless sentence would otherwise dominate the density. Because all three sub-checks are deterministic surface measures, a writer who sprinkles token numbers and citations through otherwise empty prose can defeat the anchor-density check without adding real substance.

Theoretical background

C1 draws on three established strands. The first is the dimension of information packaging in register analysis: Biber's multidimensional work showed that texts vary along measurable axes of involvement versus information density, and that markers such as nominalizations and the presence or absence of concrete reference separate informational from generalised prose. The second is Halliday and Martin's theory of grammatical metaphor, which gives the nominalization-to-verb ratio its interpretation as a measure of abstraction rather than of mere style. The third is Hyland's framework of stance and engagement, in which epistemic verbs and boosters are the devices by which a writer claims certainty; C1 treats a booster with no anchor as an unredeemed promise of evidence. Layered on top is the recent quantitative work on how language models shift scientific vocabulary toward the generic, which supplies the contemporary motivation for measuring anchor density at scale.

References

  1. Biber D. Variation across Speech and Writing. Cambridge: Cambridge University Press; 1988. (Multidimensional analysis separating informational from involved/generalised registers, including the role of nominalization and concrete reference.)
  2. Halliday MAK, Martin JR. Writing Science: Literacy and Discursive Power. London: Falmer Press; 1993. (Theory of grammatical metaphor and nominalization as markers of abstraction and technicality.)
  3. Hyland K. Stance and engagement: a model of interaction in academic discourse. Discourse Studies. 2005;7(2):173-192. DOI: 10.1177/1461445605050365
  4. Liang W, Zhang Y, Wu Z, Lepp H, Ji W, Zhao X, Cao H, Liu S, He S, Huang Z, Yang D, Potts C, Manning CD, Zou J. Quantifying large language model usage in scientific papers. Nature Human Behaviour. 2025. DOI: 10.1038/s41562-025-02273-8 https://www.nature.com/articles/s41562-025-02273-8
  5. Peters U, Chin-Yee B. Generalization bias in large language model summarization of scientific research. arXiv preprint arXiv:2504.00025. 2025. https://arxiv.org/abs/2504.00025