ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
C5Text analysisStylisticLayer 2 (Contextual)

Voice Variation

Detects unnaturally uniform authorial tone across the document. Human writers shift voice between sections naturally; AI-generated text maintains a flat, consistent stylistic temperature throughout.

Technical description

C5 operationalises the "voice variation" dimension from the Anti-AI Vibe Review spec. It measures four aspects of tonal consistency: (1) the across-window standard deviation of evaluative adjective density, (2) the presence or absence of first-person authorial stance markers, (3) part-of-speech register dominance, and (4) cross-IMRaD (Introduction, Methods, Results and Discussion)-section voice uniformity. The four sub-checks operate independently and sum into a single 0 to 5 score. The indicator runs at Layer 2 because it requires sentence segmentation, part-of-speech tagging, and lemmatisation.

How it works

Sub-check 1, tone variation per window. The text is segmented into windows of three consecutive sentences. For each window, the evaluative adjective ratio is computed as the count of adjectives whose lemma appears in the per-language evaluative adjectives dictionary divided by the total token count. The standard deviation of these ratios across all windows is then calculated. When the standard deviation falls below 0.01, the indicator fires: every window has essentially the same density of evaluative language, which is consistent with a model that applies a fixed stylistic temperature to the entire document. Contributes +2.0 to the score.

Sub-check 2, authorial voice markers. The document is scanned for first-person pronouns (English: I, we, my, our, me, us; Romanian: eu, noi, meu, mea, noastra, nostru) and authorial verb lemmas (English: observe, argue, propose, believe, consider, suggest, contend, maintain; Romanian: observ, argumentez, propun, consider, sustin, mentin). On documents longer than 1000 words, the complete absence of both pronoun and verb markers triggers the sub-check. The 1000-word gate exists because short documents can legitimately lack authorial stance without raising suspicion. Contributes +1.0 to the score.

Sub-check 3, register diversity. The global part-of-speech (POS) distribution is computed from the token stream. The ratio of nouns and the combined ratio of adjectives plus adverbs are compared. When one category exceeds 50% of tokens while the other falls below 5%, the text is flagged as mono-register: either noun-heavy (characteristic of template-generated descriptive prose) or adjective/adverb-heavy (characteristic of promotional or public-relations style text). Contributes +1.0 to the score.

Sub-check 4, cross-section voice uniformity. The text is partitioned into IMRaD sections using the shared section classifier. For each section with content, the evaluative adjective ratio is computed independently (same formula as sub-check 1 but applied per section body rather than per sliding window). When at least three sections are present and the standard deviation of their evaluative ratios falls below 0.008, the indicator fires. Human authors naturally use more evaluative language in Discussion than in Methods; a near-zero standard deviation across sections with different rhetorical purposes is a strong signal of template-driven generation. Contributes +1.0 to the score.

The four contributions sum to a theoretical maximum of 5.0 (2.0 + 1.0 + 1.0 + 1.0), with a hard clamp at 5.0.

Why this matters

One of the most reliable stylistic markers of AI-generated academic text is its tonal flatness. Where a human author writes Methods in a clipped, technical register, shifts to a more evaluative tone in the Discussion, and allows personal stance to surface in the Introduction, a language model applies a near-constant mixture of hedging, evaluative vocabulary, and syntactic complexity across the entire document. Markowitz and colleagues demonstrated that machine-generated text was measurably more analytic and less readable than human-written text, with the discriminating signal emerging from a combination of style features rather than any single marker [1]. C5 contributes the voice-uniformity slice of that combination.

The evaluative adjective dictionary used in sub-checks 1 and 4 targets words that signal authorial judgment: good, important, essential, crucial, vital, significant, major, key, critical, fundamental, central, primary. These are words that AI models overuse because they signal engagement without requiring a specific factual commitment. A document that deploys these adjectives at the same density in Methods, Results, and Discussion is not varying its authorial distance across sections, a pattern that is statistically aberrant in human academic prose.

The cross-section voice check (sub-check 4) extends the logic introduced in C3's IMRaD section-conditioning: the same principle that makes per-section paragraph-length coefficient of variation informative makes per-section evaluative density informative. The sub-check was calibrated with reference to Yin and Wang's observation that section-conditioned stylistic statistics reduce topic-dependence and amplify human-AI separability [2].

Score thresholds

Score Meaning
0 to 1 Natural variation in evaluative language density across windows and IMRaD sections. The author modulates tone to match the rhetorical purpose of each section.
2 to 3 One or two voice dimensions show AI-typical uniformity: either the window-level temperature is flat, or the IMRaD sections share a single tonal register. Common in well-structured but authorship-ambiguous documents.
4 to 5 Multiple voice dimensions collapse simultaneously: flat window temperature, absent authorial stance, mono-register POS distribution, and cross-section voice uniformity. Highly consistent with AI-generated text and with heavily templated human writing.

Limitations

The evaluative adjective dictionary contains 15 entries per language, which is deliberately narrow to avoid capturing domain-specific technical adjectives that legitimately appear at uniform density across a paper. A document that uniformly uses evaluative adjectives outside this list (e.g., remarkable, noteworthy, compelling) will not fire sub-check 1 or 4 even if the underlying voice pattern is AI-typical.

Romanian-language documents use equivalent NLP processing where available.

The 0.008 cross-section threshold for sub-check 4 was calibrated on the standard IMRaD structure (Introduction, Methods, Results, Discussion). Documents with atypical section structures, or whose section headers do not match the classifier's pattern matching, will not fire this sub-check even if voice uniformity is present.

References

  1. Markowitz DM, Hancock JT, Bailenson JN. Linguistic markers of inherently false AI communication and intentionally false human communication: evidence from hotel reviews. Journal of Language and Social Psychology. 2024;43(1):63-82. DOI: 10.1177/0261927X231200201
  2. Yin Z, Wang S. Span-level detection of AI-generated scientific text via contrastive learning and structural calibration. arXiv preprint arXiv:2510.00890. 2025.
C5 Voice Variation: AI Text analysis detection indicator — ResAIKit