Unicode Anomalies
Detects unusual Unicode characters hidden in text, such as look-alike letters from other alphabets or invisible formatting characters that can indicate copy-pasting from AI tools.
Technical description
Scans text for non-standard Unicode code points including: homoglyphs (visually identical characters from different scripts, e.g., Cyrillic 'a' U+0430 vs Latin 'a' U+0061), zero-width characters (ZWJ, ZWNJ, zero-width spaces), bidirectional override characters, unusual whitespace variants (thin space, hair space, em space), and control characters. Counts anomalies per 1000 characters and identifies specific character positions.
How it works
Layer 1 (deterministic): Iterates through each character checking its Unicode category and code point. Flags homoglyphs by comparing against a mapping of look-alike characters across scripts. Detects zero-width characters, unusual whitespace, and control characters. Reports exact positions and Unicode names for each anomaly.
Why this matters
AI-generated text sometimes contains unusual Unicode characters inherited from training data or injected during copy-paste from web interfaces. Homoglyphs can be used to bypass plagiarism detection, and invisible characters may be watermarking artifacts from AI platforms. These characters are invisible to the naked eye but reveal the text's digital provenance.
Score thresholds
- 0-1
- Standard ASCII/Latin characters only
- 2-3
- A few unusual characters, possibly from copy-paste
- 4-5
- Multiple homoglyphs or invisible characters indicating manipulation
Limitations
Multilingual documents legitimately contain non-Latin characters. Mathematical symbols and special notation use extended Unicode. Some word processors insert special whitespace characters during formatting.