E2Text analysisForensicLayer 1 (Deterministic)

Encoding Artifacts

Detects encoding corruption patterns (mojibake), replacement characters, broken ligatures, and whitespace anomalies. AI-generated text is paradoxically clean; human text carries the scars of multiple tools, encoding conversions, and imperfect typing.

Technical description

E2 scans for nine categories of encoding and whitespace artifacts that distinguish human from machine-generated text. The core insight is asymmetric: human text accumulates encoding artifacts through multi-tool pipelines (word processor -> Portable Document Format (PDF) -> copy-paste -> plain text), while large language model (LLM) output is generated directly in clean UTF-8. Absence of artifacts is itself a signal.

How it works

1. Mixed line endings. Counts carriage-return-plus-line-feed (CRLF), line feed (LF), and bare carriage return (CR). Mixed endings (+0.5) suggest text assembled from multiple sources. Uniform endings are expected from LLM output.

2. Double spaces. The "paradoxical cleanliness" sub-check. Human writers routinely insert accidental double spaces during typing and editing. Zero double spaces in a long text (>2000 words: +1.5; >500 words: +0.5) is statistically aberrant and strongly associated with machine-generated text.

3. Trailing whitespace. Human signal, reduces score by 0.3. Trailing spaces at line ends are a typing artifact that LLMs never produce.

4. Tab vs space indentation uniformity. Perfect consistency (all spaces or all tabs across >10 indented lines) contributes +0.3.

5. Indentation width uniformity. coefficient of variation (CV) = 0 across >10 indented paragraphs (+0.5).

6. Mojibake patterns. Detects 20 common UTF-8/CP1252 corruption pairs: Ã© → é, â€™ → ', â€œ → ", â€" → --, â€" → -- (em-dash), â€¢ → *, â€~ → ..., etc. Each match contributes +0.5, capped at +2.0. Mojibake is definitive evidence of encoding pipeline corruption.

7. Replacement character U+FFFD. The Unicode replacement character (�) indicates encoding failure during text conversion. Each occurrence contributes +0.3, capped at +1.5.

8. Unicode ligatures. Detects 9 common ligatures (ﬁ, ﬂ, ﬀ, ﬃ, ﬄ, Œ, œ, Æ, æ). Ligatures originate from PDF or word-processor typesetting engines, not from LLM token generation. Reduces score by 0.5 on texts over 500 words, a human signal.

9. Latin-1 high-byte corruption scan. Detects high density of Latin-1 high-byte sequences (0xC0-0xFF followed by 0x80-0xBF). Density above 2% of word count contributes +1.0, indicating double-encoding or incorrect character set conversion.

Score thresholds

Score	Meaning
0 to 1	Normal encoding profile with typical human artifacts: occasional double spaces, trailing whitespace, maybe a ligature.
2 to 3	Suspiciously clean (no double spaces in long text) or moderate encoding issues detected.
4 to 5	Multiple encoding corruption signals: mojibake patterns, replacement characters, high Latin-1 corruption density. Strong evidence of multi-tool pipeline corruption or deliberate encoding manipulation.

Limitations

The mojibake pattern list covers the 20 most common UTF-8/CP1252 corruption pairs but is not exhaustive. Less common corruption patterns (UTF-8 via ISO-8859-1, Shift-JIS mojibake, GBK corruption) are not detected. The paradoxically-clean double-space check is a statistical heuristic: a well-edited human document can legitimately have zero double spaces, and an LLM prompted to include typos can fake them.

The ligature list covers 9 common Unicode ligatures but not all typographic variants (e.g., st, ct ligatures from older fonts).

References

The mojibake patterns and encoding corruption heuristics are derived from standard Unicode/character-set conversion documentation and empirical analysis of text extraction pipelines. No specific 2024-2026 academic literature was found that directly addresses encoding artifacts as AI detection signals; this indicator relies on first-principles encoding analysis rather than published detection research.