Perplexity CV (LLM)
Measures how uniform a text's predictability is across sentences. A language model rates each sentence for predictability, and the coefficient of variation of those ratings is scored: uniform predictability, low variation, is the machine signal, while human writing bursts between the surprising and the routine.
Technical description
L4-Perplexity-CV is the model-judged counterpart to the deterministic structural-variation check: where that measures variation in sentence and paragraph length, this measures variation in predictability. It asks a language model to rate each sentence from 1 to 10 for how likely a generic model would be to produce it in context, computes the coefficient of variation of those ratings, and maps a low coefficient of variation to a high score. The absolute level of predictability is recorded in the metadata but does not drive the score, because some human writing is genuinely predictable end to end; it is the absence of variation, not the height of the average, that is diagnostic. A complementary tail signal adds to the score when the text shows no genuinely surprising sentence at all, the machine signature of decoding that excludes low-probability choices. It runs at Layer 4 when a model is configured and the text has at least five sentences, capping the analysis at fifty sentences as a token-budget guard.
How it works
The sentences are extracted and sent to the model with a rubric that defines the predictability scale: 1 for highly unpredictable, surprising word choices and idiosyncratic phrasing; 5 for moderately predictable; 10 for exactly what a generic model would output, standard phrasing, no surprise. The model returns a predictability score for each sentence.
Coefficient of variation. From the per-sentence scores s the mean and the sample standard deviation are computed, and the coefficient of variation is CV = standard deviation / mean. A high coefficient of variation means the text alternates between surprising and routine sentences, the burstiness of human writing; a low coefficient of variation means every sentence is about as predictable as every other, the uniformity of machine writing.
Score curve. The coefficient of variation is mapped to the 0 to 5 score by a step function, with breakpoints relaxed slightly from the deterministic structural check because model-judged predictability is noisier than direct token counts: CV ≥ 0.40 scores 0, CV ≥ 0.30 scores 1.0, CV ≥ 0.20 scores 2.5, CV ≥ 0.10 scores 4.0, and below 0.10 scores 5.0.
Tail signal. Uniformity is not the only machine signature. Human writing keeps a low-predictability tail, a few genuinely surprising sentences, while a model that excludes low-probability choices produces none. When the document has at least eight sentences and not one is rated 3 or below on the predictability scale, the score is increased by 1.0, clamped to 5. This is distinct from the average predictability the indicator otherwise ignores: a text can be predictable on average yet still carry a surprising sentence, and it is the complete absence of that tail, not the height of the mean, that this signal captures.
Findings. When the score is non-zero, the three most predictable sentences are surfaced, each reported with its predictability rating against the cohort mean and the coefficient of variation, at warning severity once the score reaches 2.5 and informational severity below it. The metadata returns the mean and standard deviation of predictability, the coefficient of variation, the sentence counts, and the model's rationale.
Score thresholds
| Score | Meaning |
|---|---|
| 0 to 1 | Predictability varies markedly between sentences, the bursty alternation of surprising and routine phrasing typical of human writing. |
| 2 to 3 | Moderately uniform predictability; some bursting, but flatter than usual. |
| 4 to 5 | Nearly uniform predictability across every sentence, the flat signature of machine-generated text. |
Why this matters
Human writing is bursty: an author alternates between sentences that are surprising in their phrasing and sentences that are routine, so the predictability of the text varies sentence to sentence. Machine writing flattens that variation, producing sentences that are each about as predictable as the next, because the model samples from a smoothed distribution that disfavours the low-probability choices a human makes for emphasis or idiosyncrasy. Detection systems have long exploited this, pairing the average predictability of a text with its variation, and the variation, the burstiness, often separates human from machine more reliably than the average alone, because a skilled writer can be predictable on average while still bursting, and a machine can hit any average while staying flat. L4-Perplexity-CV isolates that variation: it scores the uniformity of predictability and ignores its level, so a genuinely plain human text is not penalised for being plain, only for being uniformly plain in the way machine text is.
Limitations
L4-Perplexity-CV depends on a language model to rate predictability, so the ratings carry the model's own biases and noise, and a different model may rate the same sentences differently, and models show a self-preference bias, rating text in their own low-perplexity style as more predictable than human raters would, which can inflate the uniformity signal; the coefficient-of-variation breakpoints are relaxed to absorb some of that noise but cannot remove it. The model is asked to judge how likely a generic model would be to produce each sentence, which is an introspective task it performs imperfectly. The analysis is capped at fifty sentences, so a long document is judged on a prefix. Some human writing is genuinely uniform in predictability, formulaic reports, boilerplate, highly templated sections, and will score high without any machine involvement, which is why the indicator is one signal among many rather than a verdict. It is slower and costlier than the deterministic structural-variation check, and results vary between runs.
Theoretical background
L4-Perplexity-CV operationalises the burstiness half of perplexity-based detection. Early detectors visualised and scored generated text through the predictability of each token under a reference model, on the observation that machine text occupies the high-probability region more uniformly than human text; the degeneration literature gave the mechanism, showing that models under realistic decoding collapse toward high-probability continuations and away from the low-probability choices that create surprise. Where the average predictability is one axis, its variation across the text is the other, and the homogenization and diversity literature confirms that reduced variation is among the most reliable single discriminators of machine text. The indicator is deliberately the model-judged complement to the deterministic structural-variation check: that one measures the coefficient of variation of sentence and paragraph lengths from token counts at Layer 1, while this one measures the coefficient of variation of judged predictability at Layer 4, the same statistic applied to a property only a model can estimate.
References
- Gehrmann S, Strobelt H, Rush AM. GLTR: statistical detection and visualization of generated text. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL): System Demonstrations. 2019. https://aclanthology.org/P19-3019/
- Holtzman A, Buys J, Du L, Forbes M, Choi Y. The curious case of neural text degeneration. International Conference on Learning Representations (ICLR). 2020. https://arxiv.org/abs/1904.09751
- Wu C, Cheung YM, Han B, Lian D. Hidden human-like nature of machine-generated texts: theory and detection enhancement. arXiv preprint arXiv:2605.23190. 2026. https://arxiv.org/abs/2605.23190
- Garces Arias E, Sapargali N, Heumann C, Aßenmacher M. The truncation blind spot: how decoding strategies systematically exclude human-like token choices. arXiv preprint arXiv:2603.18482. 2026. https://arxiv.org/abs/2603.18482
- Wataoka K, Takahashi T, Ri R. Self-preference bias in LLM-as-a-judge. arXiv preprint arXiv:2410.21819. 2024. https://arxiv.org/abs/2410.21819
- Basani AR, Chen PY. Diversity boosts AI-generated text detection. arXiv preprint arXiv:2509.18880. 2025. https://arxiv.org/abs/2509.18880