Artificial Structure
Detects unnaturally rigid formal organisation: uniform paragraph lengths, uniform sentence lengths, stereotypical list sizes and recurring generic section headings.
Technical description
C3 quantifies formal symmetry at four structural scales (paragraph, sentence, list, section heading) and adds two section-conditioned variants that test the same paragraph and sentence symmetry inside individual IMRaD (Introduction, Methods, Results and Discussion) sections. The indicator does not penalise structure as such; it asks whether the surface organisation of the document varies in the way human prose normally varies, or whether it has collapsed onto a template. The six sub-checks operate independently and sum into a single 0 to 5 score. Four of the six require enough material to produce a meaningful estimate, so they are gated on minimum counts (three paragraphs, five sentences, three lists, and the same minima reapplied inside each IMRaD section); the heading check has no such floor because it counts discrete dictionary matches rather than estimating a distribution.
How it works
The implementation is deterministic and runs at Layer 1. Tokenisation, sentence segmentation and paragraph splitting are shared with the other C-series indicators, so the cost is dominated by the regex passes used to detect bullets, numbered items and Markdown headings.
Sub-check 1, coefficient of variation of paragraph lengths. Paragraphs are split on blank-line boundaries by the shared preprocessor; paragraphs with zero word count are discarded. When at least three non-empty paragraphs remain, the per-paragraph word count is collected and the coefficient of variation CV = std / mean is computed using the population estimator for the mean and the sample estimator for the standard deviation. A CV below 0.20 fires the sub-check and contributes +1.5 to the score. The 0.20 threshold corresponds to paragraphs whose lengths cluster within roughly ±20% of the mean, a tightness that is unusual outside boilerplate reports and template-driven writing.
Sub-check 2, coefficient of variation of sentence lengths. Sentences are produced by the shared splitter, which fires on ., ! or ? followed by whitespace, with an abbreviation guard list for English and a Romanian fallback list for the other Latin-script languages. When at least five non-empty sentences remain, the per-sentence word count is collected and the same CV estimator is applied. A CV below 0.25 fires the sub-check and contributes +1.0 to the score. The threshold is looser than for paragraphs because sentence lengths in natural prose vary more than paragraph lengths; collapsing below 0.25 is a stronger signal than the same value would be on a longer unit.
Sub-check 3, stereotypical list templates. The text is split into raw paragraphs on \n\s*\n and each paragraph is scanned with two multi-line regular expressions: ^[\s]*[-*•]\s for bullet items (hyphen, asterisk or Unicode bullet •) and ^[\s]*\d+[.)]\s for numbered items (a digit run followed by . or )). Any paragraph that contains at least one bullet or numbered item counts as one list, and the number of items in that paragraph is recorded. When more than two such lists are present and every single one of them has exactly three items, or every single one has exactly five items, the sub-check fires and contributes +1.0 to the score. The 3-and-5 specialisation reflects an empirical observation about generative output: the top-of-mind list lengths that recur in unprompted large language model (LLM) responses cluster strongly on 3, less strongly on 5, and rarely on other small integers without an explicit numerical prompt.
Sub-check 4, generic section headings. Each non-empty line of the text is stripped, freed of any leading Markdown heading marker (# through ######), and tested for two conditions: a length below ten words and a case-insensitive exact match against a per-language dictionary of generic section titles (representative English entries: introduction, background, methods, methodology, results, discussion, conclusion, conclusions, limitations, implications, future directions, recommendations, summary, overview, key findings, literature review). The match is anchored at the full line, not at any substring, so a sentence that happens to contain the word introduction is not flagged. When four such headings are found in the document, a single finding is emitted on the fourth match and +0.5 is added to the score. Additional generic headings beyond the fourth are counted into metadata but do not contribute further points, which keeps the heading slice bounded and prevents the indicator from saturating on legitimately long, well-sectioned reviews.
Sub-check 5, paragraph CV inside an IMRaD section. The text is partitioned into IMRaD sections using the section-header regex shared with the exclusion-zone detector; recognised section names cover Abstract, Introduction, Background, Methods (including the variants Materials and Methods and Study Design), Results (including Findings), Discussion and Conclusions. For each section that contains at least three non-empty paragraphs, the same per-paragraph word count and CV estimator from sub-check 1 are applied to the section body alone. The first section whose paragraph CV falls below 0.15 fires the sub-check and contributes +0.5 to the score. The threshold is tighter than the 0.20 global threshold because section-level content is naturally more homogeneous than the document as a whole, so the same CV value carries a stronger signal when computed inside a single section. Subsequent uniform sections are recorded in metadata but do not add further points, which keeps the section slice bounded.
Sub-check 6, sentence CV inside an IMRaD section. The same partitioning is reused, and for each section with at least five non-empty sentences the per-sentence word count and CV estimator from sub-check 2 are applied to the section body. The first section whose sentence CV falls below 0.20 fires the sub-check and contributes +0.5 to the score. The 0.20 threshold is tighter than the global 0.25 by the same logic as sub-check 5. The two section-conditioned sub-checks operationalise the IMRaD-section stylistic clustering proposed by Yin and Wang [7], who show that section-conditioned stylistic statistics amplify human–AI separability and reduce topic-dependence relative to document-level statistics.
The six contributions are summed and clamped at 5.0. The maximum reachable from the six sub-checks together is exactly 5.0 (1.5 + 1.0 + 1.0 + 0.5 + 0.5 + 0.5). Each fired sub-check emits a finding with character offsets so the user interface can highlight the offending span: the first 200 characters of the document for the global paragraph and sentence flags, the first list item for the list flag, the fourth generic heading itself for the heading flag, and the first 200 characters of the offending section body for the two section-conditioned flags.
Why this matters
The distribution of sentence lengths and paragraph lengths discriminates between authors, and within a single author it remains stable across topics, registers and translations. Mendenhall's nineteenth-century work on the "characteristic curves of composition" showed that the histogram of word-lengths within an author's prose is stable across works and discriminates one author from another with high reliability; later, the same logic was extended to sentence-length distributions, which proved equally idiosyncratic and equally stable across topics [1]. The diagnostic value of this older stylometric tradition is that uniformity is not the default state of human writing: even within a single paragraph, a writer who is not consciously imitating a template will lengthen and shorten clauses to manage emphasis, pacing and reader attention. When the surface of a manuscript becomes flat in any of these dimensions, the most parsimonious explanation is that the text was produced under a constraint that suppresses variation, whether that constraint is a strict house style, a fillable template or a generative model trained to produce orderly responses.
Generative language models tend toward formal symmetry for reasons that are mechanical rather than aesthetic. Under restrictive decoding, transformer-based models collapse onto high-probability continuations and produce what Holtzman and colleagues called neural text degeneration: stretches of fluent prose whose surface statistics drift away from natural language [2]. Reinforcement learning from human feedback amplifies the effect at the structural level: instruction-tuned models are rewarded for producing clear, well-organised answers, and the cheapest way for a policy to look organised is to default to a template. Population-level analyses of scientific writing make the drift visible: Liang and colleagues, in a corpus of 1.12 million arXiv, bioRxiv and Nature-portfolio papers between January 2020 and September 2024, report a consistent shift in the word-frequency distribution toward LLM-typical vocabulary, with the modification estimated at 22.5% of computer-science abstracts and 19.5% of computer-science introductions by September 2024; mathematics shows the smallest effect at 7.7% and 4.1%, electrical engineering tracks close to computer science at 18.0% and 18.4%, and the Nature-portfolio journals together come in at 8.9% and 9.4% [3]. In a controlled comparison on hotel reviews, Markowitz and colleagues found that machine-generated text was measurably more analytic and less readable than human-written text, with classification accuracies on style-only features exceeding 80%; the discriminating signal was a combination of style features rather than any single marker [4]. C3 contributes the structural-uniformity slice of that combination, complementing C1 (anchor density) and C2 (lexical reuse).
Score thresholds
| Score | Meaning |
|---|---|
| 0 to 1 | Natural variation in paragraph and sentence lengths, irregular list sizes, content-driven section titles. Typical of original prose that earns its structure from the underlying argument. |
| 2 to 3 | One or two surface dimensions collapse onto a template, often paragraphs or sentence lengths in a section that has been heavily reworked. Common in well-edited review articles and in introductions that follow a fixed disciplinary pattern. |
| 4 to 5 | Multiple surface dimensions collapse simultaneously: uniform paragraphs, uniform sentences and a fully generic section skeleton. Compatible with LLM-generated drafts produced from a template prompt, with manuscripts assembled by mechanical paraphrase of an outline, or with boilerplate reports that were never meant to read as original prose. |
Limitations
The paragraph-CV sub-check requires at least three non-empty paragraphs and the sentence-CV sub-check requires at least five non-empty sentences; documents shorter than these thresholds are returned with the corresponding metadata fields set to null and no contribution to the score. The CV estimator is itself unstable on small samples: a document with exactly three short paragraphs of similar length can fire sub-check 1 spuriously, and the threshold should be read as a directional signal rather than as a precise variance test. The list-detection regex recognises only the three bullet glyphs -, *, • and only digit runs followed by . or ); alphabetic enumerators such as a. or i., Roman numerals, and other Unicode bullet glyphs (▪, ▶, ◦) are silently ignored, so a document that mixes enumerator styles can defeat the uniformity check even when the underlying templating is severe. The 3-and-5 specialisation in the list sub-check is also genre-sensitive: clinical-trial inclusion and exclusion criteria, executive-summary bullet decks, and recipe-style technical instructions often produce uniform list lengths for reasons unrelated to AI authorship. The generic-heading sub-check uses exact-match dictionary lookup at the line level; translations not present in the dictionary, paraphrased headings ("Materials and Procedure" instead of "Methods"), and headings combined with subtitle text ("Discussion: clinical implications") do not match and are not counted. Sentence segmentation inherits the same limitations as in C1 and C2: it only recognises Latin-script sentence terminators with English and Romanian abbreviation guards, with no specific handling for Chinese, Japanese or Korean punctuation (。, !, ?); Chinese, Japanese and Korean (CJK) manuscripts can therefore collapse into a single pseudo-sentence and silence sub-check 2 entirely, while paragraph and list detection continue to operate normally on the whitespace structure of the document. The score thresholds in this indicator were calibrated against 2024-2025 LLM output; as generative models produce more variable paragraph and sentence lengths over time, the fixed thresholds will yield lower true-positive rates and will need periodic recalibration.
Theoretical background
C3 sits at the intersection of three traditions. The first is classical stylometry, in which Mendenhall's characteristic-curves work and Yule's later statistical study of literary vocabulary established that the distribution of structural units (words, sentences, paragraphs) carries authorial information that survives translation and topic shift [1, 5]. The coefficient of variation is the simplest summary statistic of that distribution, and its use as a stylistic marker predates computational linguistics by several decades; Yule treated it as the natural measure of "evenness" against which authentic prose could be distinguished from boilerplate. The second tradition is Mosteller and Wallace's work on disputed authorship in the Federalist Papers, which demonstrated that low-content features (function words, sentence-length distribution) discriminate authors more reliably than topical content does [6]; their methodology motivates the choice in C3 to measure surface-level uniformity rather than thematic similarity. The third tradition is the contemporary literature on neural text generation. Holtzman and colleagues identified the mechanism by which transformer models collapse onto repetitive patterns under restrictive decoding [2], and Markowitz and colleagues showed empirically that the resulting prose is distinguishable from human prose at the structural level with classification accuracies above 80% [4]. C3 operationalises the structural slice of that finding: where C2 measures lexical reuse and C1 measures anchor density, C3 measures the geometric regularity of the document itself.
References
- Mendenhall TC. The characteristic curves of composition. Science. 1887;9(214S):237-246. DOI: 10.1126/science.ns-9.214S.237 https://www.science.org/doi/10.1126/science.ns-9.214S.237
- Holtzman A, Buys J, Du L, Forbes M, Choi Y. The curious case of neural text degeneration. International Conference on Learning Representations (ICLR). 2020. https://arxiv.org/abs/1904.09751
- Liang W, Zhang Y, Wu Z, Lepp H, Ji W, Zhao X, Cao H, Liu S, He S, Huang Z, Yang D, Potts C, Manning CD, Zou J. Quantifying large language model usage in scientific papers. Nature Human Behaviour. 2025. DOI: 10.1038/s41562-025-02273-8 https://www.nature.com/articles/s41562-025-02273-8
- Markowitz DM, Hancock JT, Bailenson JN. Linguistic markers of inherently false AI communication and intentionally false human communication: evidence from hotel reviews. Journal of Language and Social Psychology. 2024;43(1):63-82. DOI: 10.1177/0261927X231200201 https://journals.sagepub.com/doi/10.1177/0261927X231200201
- Yule GU. The Statistical Study of Literary Vocabulary. Cambridge: Cambridge University Press; 1944. (Foundational treatment of sentence-length and vocabulary distributions as authorship-discriminating features.)
- Mosteller F, Wallace DL. Inference and Disputed Authorship: The Federalist. Reading, MA: Addison-Wesley; 1964. (Bayesian analysis of function-word frequencies establishing low-content features as the most reliable stylometric markers.)
- Yin Z, Wang S. Span-level detection of AI-generated scientific text via contrastive learning and structural calibration. arXiv preprint arXiv:2510.00890. 2025. https://arxiv.org/abs/2510.00890