ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
E2Text analysisForensicLayer 1 (Deterministic)

Encoding Artifacts

Detects encoding conversion errors like 'é' instead of 'e' or '’' instead of an apostrophe, which reveal the text was processed through multiple digital systems.

Technical description

Identifies mojibake patterns (UTF-8 bytes misinterpreted as Latin-1 or Windows-1252), HTML entities left unescaped (&,  , '), escaped Unicode sequences (\u0027), and double-encoded characters. Uses pattern matching for the most common encoding corruption sequences. Reports the likely original character and encoding path.

How it works

Layer 1 (deterministic): Matches against known mojibake patterns (e.g., 'é' = UTF-8 'e' read as Latin-1). Detects unescaped HTML entities. Identifies escaped Unicode sequences. Checks for double-encoding artifacts. Reports exact positions and likely original characters.

Why this matters

Encoding artifacts reveal the digital journey of text — they indicate copy-paste from web pages, chat interfaces, or document format conversions. AI-generated text that has been copied from a chatbot interface and pasted into a document editor often carries encoding artifacts that would not be present in natively authored text.

Score thresholds

0-1
Clean encoding with no artifacts
2-3
Minor encoding issues, possibly from document conversion
4-5
Widespread encoding artifacts indicating AI chatbot copy-paste

Limitations

Document format conversions (PDF to DOCX) commonly introduce encoding artifacts unrelated to AI. International characters may be corrupted by email systems. Older documents may have legitimate encoding issues from outdated software.