ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
E3Text analysisForensicLayer 1 (Deterministic)

Punctuation Entropy

Measures the variety and unpredictability of punctuation usage. AI-generated text tends to use punctuation in very regular, predictable patterns compared to human writing.

Technical description

Computes Shannon entropy over the distribution of punctuation marks (periods, commas, semicolons, colons, dashes, parentheses, exclamation/question marks). Builds a punctuation n-gram model to measure sequence predictability. Compares the observed punctuation distribution against the expected distribution for academic text. Low entropy indicates overly regular punctuation patterns typical of AI generation.

How it works

Layer 1 (deterministic): Extracts all punctuation marks from the text. Computes Shannon entropy of the punctuation distribution. Builds bigram frequency tables for punctuation sequences. Compares against expected entropy range for human academic writing. Flags texts with entropy below the human baseline threshold.

Why this matters

Human writers use punctuation with natural variation — some sentences have multiple clauses with semicolons, others are short and punchy. AI models produce remarkably consistent punctuation patterns, favoring commas and periods while underusing semicolons, colons, dashes, and parentheses. This regularity is measurable through information-theoretic metrics.

Score thresholds

0-1
Rich, varied punctuation usage matching human patterns
2-3
Somewhat regular punctuation with reduced variety
4-5
Highly predictable punctuation patterns with minimal variety

Limitations

Very short texts yield unreliable entropy measurements. Some academic styles (especially in STEM fields) naturally use simpler punctuation. Technical writing with many equations may have unusual punctuation distributions.