Diacritics Forensics
Looks at how diacritics are encoded rather than whether they are spelled correctly. It flags text whose accents are broken apart into separate marks, stacked into Zalgo-style clutter, or left floating with no letter to attach to, all of which point to machine processing, corruption, or deliberate obfuscation.
Technical description
E8 runs on every document regardless of language and fills a narrow, deliberate niche: the encoding of diacritics. Two neighbouring indicators cover the ground on either side. The Unicode Anomalies indicator (E1) flags unusual characters such as look-alike letters, zero-width characters, and variation selectors, while the Locale Specific Signals indicator (E7) checks whether each language's accents are spelled correctly, for example Romanian comma-below against cedilla, or Hungarian double-acute against tilde. E8 asks only one further question: are the accents that are present stored as clean single characters, or as separate combining marks that are decomposed, stacked, or orphaned?
How it works
E8 looks for combining marks, the small accent pieces that attach to a base letter, and runs three checks.
1. Decomposition density. Clean, human-written text stores an accented letter as a single character. Text assembled by machine or passed through several tools often arrives with the letter and its accent split into two pieces. E8 counts how many characters would collapse back into a single character if the text were normalised, then scores the result by how dense those split characters are. Unlike a simple yes-or-no normalisation flag, this reports a graded density, which separates a single stray split character from text that is split apart throughout.
2. Stacked marks (Zalgo). Two or more accents piled onto a single letter, the effect known as "Zalgo" text, does not occur in clean academic prose. It is used to corrupt text or to slip past keyword and moderation filters by spreading a word across many characters. Each stacked run adds to the score.
3. Orphan marks. An accent with no letter to attach to, appearing at the very start of the text or after a space or a line break, points to broken encoding or corruption. Each orphan mark adds to the score.
The three results are added together and capped. Very short passages are skipped.
Score thresholds
| Score | Meaning |
|---|---|
| 0 to 1 | Accents are stored as clean single characters, with no abnormal marks. |
| 2 to 3 | Some split characters or isolated mark anomalies, which can come from assembling text out of several tools. |
| 4 to 5 | Heavy splitting, stacked Zalgo-style marks, or orphan marks, all strong evidence of machine assembly, corruption, or obfuscation. |
Why this matters
Character-level tricks are now a documented way to slip AI-written text past detectors. One 2025 study showed that swapping letters for look-alike characters collapsed seven leading detectors, and recommended cleaning text back to a standard form as the main defence. Manipulating combining marks is the encoding-level companion to that look-alike attack: where look-alikes swap whole letters, which is the territory of E1, splitting and stacking accents exploit the fact that the same visible character can be stored in more than one way. A 2025 analysis of ten text-watermarking methods found that whitespace-based and accent-based encodings leave statistical traces that set them apart from clean text, and that catching them depends on measuring how far the text departs from a standard form rather than simply whether it does. The Unicode security guidelines exist for the same reason: strings that look identical can hide adversarial structure. E8 turns that idea into a graded forensic signal rather than a yes-or-no check. Separating this question from E1 and E7 also stops the same accent, the same hidden character, and the same split letter from being counted by two indicators at once.
Limitations
E8 says nothing about whether accents are spelled correctly, which is the job of E7, nor about unusual characters such as look-alike letters and zero-width characters, which belong to E1. Some writing systems and transliteration schemes legitimately use combining marks, so a few split characters on their own are weak evidence, and the density threshold is kept conservative. Because the check runs on all text regardless of language, it does not try to tell a legitimately decomposed script apart from an adversarial one beyond the density, stacking, and orphan tests. Very short passages are skipped.
References
- Creo A, Pudasaini S. SilverSpeak: evading AI-generated text detectors using homoglyphs. Workshop on Detecting AI-Generated Content (COLING 2025). 2025. https://aclanthology.org/2025.genaidetect-1.1/
- Hellmeier M. Security and detectability analysis of Unicode text watermarking methods against large language models. arXiv preprint arXiv:2512.13325. 2025. https://arxiv.org/abs/2512.13325
- Unicode Consortium. UTS #39: Unicode security mechanisms. https://www.unicode.org/reports/tr39/
- Imperceptible jailbreaking against large language models. arXiv preprint arXiv:2510.05025. 2025. https://arxiv.org/abs/2510.05025