ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
C2Text analysisStylisticLayer 1 (Deterministic)

Verbosity

Detects circular paraphrasing, repeated phrasal templates and low lexical variety that signal padded or AI-generated prose.

Technical description

C2 measures redundancy at four lexical scales: token, multi-token sequence, sentence pair and multi-dimensional diversity (moving-average type-token ratio jointly with local content-token recurrence). The indicator does not equate redundancy with raw repetition; it asks whether the text introduces new lexical material as it advances, or whether it cycles through the same surface forms in different grammatical positions. The four sub-checks operate independently and sum into a single 0 to 5 score (the raw sum can reach 5.5 from sub-checks 1, 2, 3 and 4 together; the final score is clamped). Texts below 100 words are not scored, because both type-token estimates and adjacent-sentence overlap require enough material to stabilise.

How it works

The implementation is deterministic and runs at Layer 1. All four sub-checks share the same tokenisation, sentence segmentation and language-specific stopword list, so the marginal cost of running the additional checks after the first is small.

Sub-check 1, mean type-token ratio (MTTTR) on sliding windows. Tokens are lower-cased and the document is sliced into non-overlapping windows of 100 tokens. For each window the ratio of unique tokens to total tokens is computed, and the mean across windows is reported as MTTTR. Sliding-window type-token ratio was introduced precisely because the raw type-token ratio collapses as text length grows, biasing comparisons between short and long texts; the windowed estimator removes that dependency [1]. The sub-check fires when the document exceeds 300 words and MTTTR drops below 0.35, contributing +1.5 to the score. The 300-word floor avoids triggering on short abstracts where a single repeated technical term can drag the average down.

Sub-check 2, n-gram repetition after stopword removal. Tokens are filtered against a language-specific stopword list covering the twelve languages supported by the platform; high-frequency function words such as English the, Spanish el, Japanese の or Russian и are removed so that repeated grammatical scaffolding does not dominate the signal. The repetition rate is then computed as the fraction of distinct n-gram types that appear more than once, for n = 3 and n = 4. Trigram rates above 5% contribute +1.0; four-gram rates above 3% contribute an additional +0.5. The asymmetric thresholds reflect that four-gram collisions are rarer in natural prose and therefore more diagnostic of a verbatim or near-verbatim repetition.

Sub-check 3, adjacent-sentence content overlap. The text is split into sentences by a regular expression that fires on ., ! or ? followed by whitespace, with an abbreviation guard list for English and a Romanian fallback list for other Latin-script languages. Each adjacent pair (s_i, s_{i+1}) is then compared on the set of non-stopword tokens. Overlap is computed as |A ∩ B| / min(|A|, |B|), an asymmetric ratio that penalises a short sentence wrapped inside a longer paraphrase. A pair with overlap above 0.60 is flagged as a redundant restatement; the score contribution is (high_overlap_pairs / max(sentence_count − 1, 1)) × 2.0, so a document where every other sentence echoes its predecessor accrues close to the full +2.0.

Sub-check 4, multi-dimensional diversity ensemble (MATTR + local content-token dispersion). The same lower-cased token sequence is used to compute the moving-average type-token ratio (MATTR) on overlapping windows of fifty tokens, and the local recurrence rate on the content-token sequence (stopwords removed). A content token is counted as a local repeat if the same surface form appears within the twenty preceding or twenty following content positions. Dispersion is the proportion of content tokens that are local repeats. Sub-check 4 fires only when the document exceeds 300 words and MATTR is below 0.55 and dispersion is above 0.30; the conjunction is intentional, because either signal alone is too easy to trip on technical writing (low MATTR alone) or on routine grammatical scaffolding (high dispersion alone). When all three conditions hold together, the document combines low overall lexical variety with tight local recurrence of content words, the pattern that the 2025 diversity-detection literature reports as the strongest single AI-versus-human stylometric signal. The fire contributes +0.5 to the score.

The four contributions are summed and clamped at 5.0; the raw maximum from all four sub-checks together is 5.5 (1.5 + 1.0 + 0.5 + 2.0 + 0.5), so the clamp can engage when sub-check 3 contributes close to its full +2.0 alongside the other firings. Findings include character offsets for each flagged passage so the user interface can highlight either the first 200 characters of the document (for the global MTTTR, n-gram and diversity-ensemble flags) or the specific restating sentence (for the overlap flag).

Why this matters

Generative models repeat themselves for architectural reasons documented in the decoding literature. Under low-temperature or restrictive top-p sampling, the model collapses onto high-probability continuations and produces what Holtzman and colleagues called neural text degeneration: long stretches of fluent but redundant prose that loop back on the same phrasal templates [2]. Even when sampling is open enough to avoid outright degeneration, transformer-based language models exhibit a typewise repetition bias because their training objective averages over many similar contexts, smoothing the output distribution and reducing the chance of low-frequency word choices appearing in expected positions [3]. The downstream effect on academic prose is exactly the pattern C2 targets: paragraphs that restate the heading, sentences that paraphrase their predecessor, and sequences of trigrams that recur with minor permutation across an otherwise long manuscript.

Population-level analyses confirm that the lexical surface of scientific writing modified by large language models (LLMs) drifts toward a narrower vocabulary band. Liang and colleagues, in a corpus of 1.12 million arXiv, bioRxiv and Nature-portfolio papers between January 2020 and September 2024, report a consistent shift toward longer, lower-frequency and more generic word choices, with the modification detectable from word-frequency distributions alone and estimated at 22.5% of computer-science abstracts and 19.5% of computer-science introductions by September 2024 [4]. Finlayson and colleagues, in a 2024 follow-up to Holtzman, isolate the specific decoding mechanism responsible for the residual repetition that survives nucleus sampling, showing that the sampling tail truncation does not eliminate the typewise bias of the underlying model [7]. In a controlled experiment on hotel reviews, Markowitz and colleagues found that machine-generated prose was more analytic and less readable than human-written prose; the discriminating signal was a combination of style features rather than any single marker, with classification accuracies exceeding 80% on style-only models [5]. C2 contributes the lexical-diversity slice of that combination.

Score thresholds

Score Meaning
0 to 1 High lexical variety, varied phrasing, no obvious paraphrastic loops. Typical of original prose that earns its length with new content.
2 to 3 Patches of redundancy mixed with substantive passages. Common in undergraduate writing and in introductions that recapitulate the abstract.
4 to 5 Pervasive lexical reuse and adjacent-sentence echoes. Compatible with LLM-generated drafts produced at low temperature, with text that was paraphrased to defeat duplicate detection, or with manuscripts inflated to meet a length requirement.

Limitations

Texts under 100 words are not scored, and texts between 100 and 300 words bypass the MTTTR check because a single 100-token window is too short for the mean to converge. Highly technical writing legitimately reuses precise terminology and can produce moderate MTTTR readings without any padding; the n-gram and overlap checks compensate by ignoring exact-term repetition outside of stopword-filtered phrasal sequences. The adjacent-sentence overlap heuristic is symmetric within a pair but does not detect long-range echoes, so a paragraph that restates the abstract several pages later will be missed. The stopword lists are curated by language family rather than by domain; a domain-specific filler word such as "patient" in medical writing is treated as content and can inflate the overlap ratio in clinical reports. Sentence segmentation only recognises Latin-script sentence terminators (., !, ? followed by whitespace) and uses English abbreviation guards, with a Romanian fallback list for the remaining Latin-script languages and no specific handling for Chinese, Japanese or Korean punctuation (, , ); text written exclusively in Chinese, Japanese and Korean (CJK) punctuation can collapse into a single pseudo-sentence and silence the overlap sub-check. The tokeniser, based on the Unicode word-boundary regex \b\w+\b, also produces very few tokens on space-less CJK scripts, which makes the MTTTR estimate unreliable for Chinese and Japanese manuscripts even though the word count, computed separately as a CJK character count, can exceed the 300-word floor.

Sub-check 3 is vulnerable to paraphrase attacks. An adversary who runs the document through a second LLM with a paraphrasing prompt can preserve meaning while rewriting the surface form of each sentence, which drops the lexical overlap below the 0.60 threshold even when the underlying restatement pattern is intact. The recent literature on paraphrase as an evasion technique against AI-text detectors confirms that lexical-overlap signals degrade quickly once a paraphraser is interposed [8]; a semantic-embedding variant of sub-check 3 would be needed to catch paraphrased redundancy at Layer 4. Sub-check 2 is partly resistant to nucleus sampling: although top-p sampling truncates the unreliable tail of the next-token distribution and reduces the most pathological forms of repetition, the typewise bias of the underlying model remains, and trigram/4-gram collisions still accumulate at rates above the 5% and 3% thresholds in long generations [7]. The score thresholds were calibrated against 2024-2025 LLM output; as generative models continue to converge on human-style lexical variation, the fixed thresholds will yield lower true-positive rates and will need periodic recalibration.

Theoretical background

Lexical diversity has been a stylometric primitive since Mosteller and Wallace used Bayesian inference on function-word frequencies to settle the authorship of the disputed Federalist Papers [6]. Their finding that authorial fingerprints survive paraphrase and topic shift, but show up reliably in low-content tokens, motivated the use of stopword filtering in modern repetition metrics. McCarthy and Jarvis later refined the type-token ratio into the Measure of Textual Lexical Diversity (MTLD), a length-invariant lexical-diversity index that converges to a stable value rather than decaying with text length; MTTTR sits between the raw type-token ratio and MTLD on the simplicity-versus-stability axis and is preferred here for its transparent computation [1]. The third strand comes from Holtzman and colleagues, whose demonstration that neural language models suffer from degeneration under restrictive sampling provided the mechanistic explanation for why LLM output looks redundant in the first place [2], and from the 2024 follow-up by Finlayson and colleagues that closes the original case by identifying the decoding-side mechanism responsible for the residual typewise bias [7]. The fourth strand is the 2025 multi-dimensional diversity work that benchmarks support vector machine (SVM) classifiers on six diversity dimensions (volume, abundance, MATTR, evenness, WordNet-based disparity and local repetition dispersion) and reports above-97% accuracy in discriminating LLM from human text using these features [9]; sub-check 4 takes the two simplest of those dimensions (MATTR and dispersion) and operationalises them at Layer 1 with the same hard-threshold conjunction used by sub-check 1 on MTTTR. C2 combines these threads: a stable lexical-diversity estimator (sub-check 1), a sequence-level repetition counter (sub-check 2), a sentence-pair overlap measure (sub-check 3) and a multi-dimensional diversity ensemble (sub-check 4), each one targeting a different scale at which the degeneration manifests.

References

  1. McCarthy PM, Jarvis S. MTLD, vocd-D, and HD-D: a validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods. 2010;42(2):381-392. DOI: 10.3758/BRM.42.2.381 https://link.springer.com/article/10.3758/BRM.42.2.381
  2. Holtzman A, Buys J, Du L, Forbes M, Choi Y. The curious case of neural text degeneration. International Conference on Learning Representations (ICLR). 2020. https://arxiv.org/abs/1904.09751
  3. Welleck S, Kulikov I, Roller S, Dinan E, Cho K, Weston J. Neural text generation with unlikelihood training. International Conference on Learning Representations (ICLR). 2020. https://arxiv.org/abs/1908.04319
  4. Liang W, Zhang Y, Wu Z, Lepp H, Ji W, Zhao X, Cao H, Liu S, He S, Huang Z, Yang D, Potts C, Manning CD, Zou J. Quantifying large language model usage in scientific papers. Nature Human Behaviour. 2025. DOI: 10.1038/s41562-025-02273-8 https://www.nature.com/articles/s41562-025-02273-8
  5. Markowitz DM, Hancock JT, Bailenson JN. Linguistic markers of inherently false AI communication and intentionally false human communication: evidence from hotel reviews. Journal of Language and Social Psychology. 2024;43(1):63-82. DOI: 10.1177/0261927X231200201 https://journals.sagepub.com/doi/10.1177/0261927X231200201
  6. Mosteller F, Wallace DL. Inference and Disputed Authorship: The Federalist. Reading, MA: Addison-Wesley; 1964. (Foundational application of stylometric analysis to authorship attribution using function-word frequencies.)
  7. Finlayson M, Hewitt J, Koller A, Swayamdipta S, Sabharwal A. Closing the curious case of neural text degeneration. Advances in Neural Information Processing Systems (NeurIPS). 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/34899013589ef41aea4d7b2f0ef310c1-Paper-Conference.pdf. (Identifies the decoding-side mechanism behind the residual typewise repetition that survives nucleus sampling.)
  8. Krishna K, Song Y, Karpinska M, Wieting J, Iyyer M. Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. Advances in Neural Information Processing Systems (NeurIPS). 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/575c450013d0e99e4b0ecf82bd1afaa4-Abstract-Conference.html. (Empirical evidence that surface-level lexical-overlap detectors fail under paraphrase attack.)
  9. Anonymous. Diversity boosts AI-generated text detection. arXiv preprint arXiv:2509.18880. 2025. https://arxiv.org/abs/2509.18880. (Six-dimensional lexical-diversity benchmark including MATTR and local repetition dispersion as the basis of an SVM classifier above 97% accuracy.)