Diacritics Forensics
Measures how diacritics are encoded rather than whether they are linguistically correct (E7) or which anomalous codepoints appear (E1). Flags decomposed (NFD) Unicode, stacked Zalgo-style combining marks, and orphan combining marks that signal machine processing, corruption, or filter-evasion obfuscation.
Technical description
Language-agnostic scan over Unicode character categories. Identifies nonspacing combining marks (category Mn) and computes three signals: (1) decomposition density -- the count of characters that NFC normalization would recompose, divided by text length, scored up to +3.0; (2) stacked combining runs -- sequences of two or more consecutive nonspacing marks on a single base, indicating Zalgo-style corruption or obfuscation, scored up to +3.0; (3) orphan combining marks -- nonspacing marks preceded by whitespace, a control character, or another mark rather than a base letter, indicating broken encoding, scored up to +2.0. The raw total is capped at 5.0. Distinct from E1, which only flags a binary NFC mismatch, and from E7, which checks per-language diacritic correctness.
How it works
Layer 1 (deterministic), language-agnostic. Identifies nonspacing combining marks (Unicode category Mn) and runs three checks: (1) decomposition density -- how many characters NFC normalization would recompose, scored by density; (2) stacked combining marks -- base characters carrying two or more nonspacing marks (Zalgo / obfuscation); (3) orphan combining marks -- marks with no valid base character. Scores are summed and capped at 5.0.
Why this matters
Clean human-authored documents use precomposed (NFC) diacritics. Text assembled by machines or passed through several tools often arrives decomposed, and adversarial actors stack or detach combining marks to corrupt text or evade filters. Because E1 already covers anomalous codepoints and E7 covers per-language diacritic correctness, E8 isolates the orthogonal signal of diacritic encoding, measured as a density rather than a binary flag.
Score thresholds
- 0-1
- Precomposed (NFC) text; no abnormal combining marks.
- 2-3
- Some decomposed characters or isolated combining-mark anomalies.
- 4-5
- Heavy decomposition, stacked Zalgo-style marks, or orphan combining marks (machine assembly, corruption, or obfuscation).
Limitations
E8 deliberately ignores the linguistic correctness of diacritics (handled per-language by E7) and anomalous codepoints such as homoglyphs and zero-width characters (handled by E1). Some scripts and transliteration schemes legitimately use combining marks (for example IPA or romanized Sanskrit), so a few decomposed characters alone are weak evidence and the density threshold is kept conservative. Texts shorter than 20 words are skipped.
References
- Creo A, Pudasaini S. (2025). SilverSpeak: evading AI-generated text detectors using homoglyphs. Workshop on Detecting AI-Generated Content (COLING 2025)
- Hellmeier M. (2025). Security and detectability analysis of Unicode text watermarking methods against large language models. arXiv preprint arXiv:2512.13325
- Unicode Consortium. (2024). UTS #39: Unicode security mechanisms