ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
L4-Perplexity-CVText analysisStylisticLayer 4

Perplexity CV (LLM)

Measures how uniform a text's predictability is across sentences. A language model rates each sentence for predictability, and uniform ratings (low variation) are the machine signal, while human writing bursts between surprising and routine sentences.

Technical description

Asks a language model to rate each sentence from 1 to 10 for how likely a generic model would be to produce it, then computes the coefficient of variation of those ratings. A low coefficient of variation (uniform predictability) maps to a high score; the absolute predictability level is recorded but does not drive the score, since some human writing is genuinely predictable end to end. A complementary tail signal adds to the score when the text contains no genuinely surprising sentence at all, the machine signature of decoding that excludes low-probability choices.

How it works

Layer 4 (LLM-powered): Rates each sentence 1 to 10 for predictability, computes the coefficient of variation (standard deviation over mean) of the ratings, and maps it to the score by a step curve (CV >= 0.40 scores 0, >= 0.30 scores 1, >= 0.20 scores 2.5, >= 0.10 scores 4, below 0.10 scores 5). A tail signal adds 1.0 when the document has at least eight sentences and not one is rated 3 or below. Runs only when a model is configured and the text has at least five sentences, capping at fifty. The three most predictable sentences are surfaced as findings.

Why this matters

Human writing is bursty: an author alternates surprising and routine phrasing, so predictability varies sentence to sentence, while machine writing flattens that variation by sampling from a smoothed distribution. Recent work finds that the variation in predictability, and the presence of a low-predictability tail, separates human from machine text more reliably than the average predictability alone, because a skilled writer can be predictable on average while still bursting.

Score thresholds

0-1
Predictability varies markedly between sentences, the bursty pattern of human writing
2-3
Moderately uniform predictability, flatter than usual
4-5
Nearly uniform predictability with no surprising sentence, the flat signature of machine text

Limitations

Depends on a language model to rate predictability, so the ratings carry the model's biases and noise, and models show a self-preference bias toward their own low-perplexity style. The analysis is capped at fifty sentences, so a long document is judged on a prefix. Some human writing is genuinely uniform in predictability and scores high without machine involvement, so the indicator is one signal among many. It is slower and costlier than the deterministic checks, and results vary between runs.

References

  1. Gehrmann S, Strobelt H, Rush AM. (2019). GLTR: statistical detection and visualization of generated text. Proceedings of ACL 2019 (System Demonstrations)
  2. Holtzman A, Buys J, Du L, Forbes M, Choi Y. (2020). The curious case of neural text degeneration. International Conference on Learning Representations (ICLR)
  3. Wu C, Cheung YM, Han B, Lian D. (2026). Hidden human-like nature of machine-generated texts: theory and detection enhancement. arXiv preprint arXiv:2605.23190
  4. Garces Arias E, Sapargali N, Heumann C, Aßenmacher M. (2026). The truncation blind spot: how decoding strategies systematically exclude human-like token choices. arXiv preprint arXiv:2603.18482
  5. Wataoka K, Takahashi T, Ri R. (2024). Self-preference bias in LLM-as-a-judge. arXiv preprint arXiv:2410.21819
  6. Basani AR, Chen PY. (2025). Diversity boosts AI-generated text detection. arXiv preprint arXiv:2509.18880