Locale Specific Signals
Detects large language model (LLM) artifacts specific to non-English text across 12 languages. AI models trained predominantly on English produce characteristic errors when generating text in other languages: mixed diacritics, English calques translated literally, and unnatural connector patterns.
Technical description
E7 uses a per-language strategy pattern to detect four categories of locale-specific LLM artifacts. Each of the 12 supported languages has its own strategy module implementing encoding checks, calque detection, connector ratio analysis, and translationese pattern detection. English text is skipped because the other indicators (C1-C10, E1-E6) already cover it. The indicator runs at Layer 1 using only pattern matching and word-list matching.
Supported languages
Romanian (ro), German (de), French (fr), Spanish (es), Portuguese (pt), Italian (it), Turkish (tr), Chinese (zh), Japanese (ja), Korean (ko), Russian (ru), Hungarian (hu).
How it works
1. Encoding and diacritic anomalies. Checks for language-specific encoding errors that indicate LLM origin. Examples: Romanian text mixing comma-below and cedilla diacritics (ș/ş, ț/ţ), Chinese text mixing simplified and traditional characters, Japanese text with incorrect character width, Russian text with Latin homoglyphs (Cyrillic letters replaced by visually similar Latin ones).
2. English calque detection. Each language strategy maintains a dictionary of common English expressions that LLMs translate literally into the target language. Examples in Romanian: "taking into account" -> "luând în considerare" (calque) vs. natural "ținând cont de", "in the context of" -> "în contextul" (calque) vs. natural alternatives. These literal translations are grammatically correct but stylistically foreign, they represent the LLM thinking in English and translating.
3. Connector ratio. Compares the frequency of translated/calqued connectors against idiomatic connectors for the target language. A high ratio of translated connectors signals English-first generation. For example, Romanian text overusing "in plus" (calque of "in addition") while underusing "totodata" or "de asemenea".
4. Translationese patterns. Detects structural artifacts of English-first generation that survive translation: passive calques ("este demonstrat" instead of natural reflexive "se demonstreaza"), future calques ("vor fi analizate" instead of "se vor analiza"), English adjective-before-noun word order in Romance languages, and gerund overuse (-and/-ind forms calqued from English -ing). This check directly addresses Gargova et al.'s (2025) finding that machine translation artifacts are language-specific and do not survive cross-lingual transfer, detection must target the specific source→target language pair's characteristic errors. Currently implemented for Romanian; other language strategies use the base class default (no-op) pending per-language calibration.
Score thresholds
| Score | Meaning |
|---|---|
| 0 to 1 | Text uses appropriate diacritics, idiomatic expressions, and natural connector patterns for the target language. |
| 2 to 3 | Some calques or encoding anomalies detected. Possible AI-assisted translation or non-native writing. |
| 4 to 5 | Heavy calque density, mixed encoding, and translated connector dominance. Strong signal of English-first LLM generation. |
Why this matters
Detecting AI-generated text in non-English languages presents fundamentally different challenges from English. Al Ali et al. (2025) demonstrated that the well-known English finding of "lower perplexity for AI text" reverses in morphologically rich languages like Czech: non-native human writers produce HIGHER entropy due to grammatical errors, while AI text sits in between. Gargova et al. (2025) showed that machine-translating non-English text to English for detection FAILS (0.44-0.51 accuracy for Bulgarian), while language-native detectors achieve 0.97 accuracy, the original artifacts do not survive translation.
The HERO system (Wang et al., EMNLP 2025) established that machine-translated and machine-paraphrased text carry statistical fingerprints distinct from both human text and directly generated text. Binary human/AI classifiers conflate these categories, flagging benign translation assistance as AI generation. E7 addresses this with language-native, artifact-specific detection rather than statistical perplexity proxies.
Al-Shaibani and Ahmed (2025/2026) found in the most comprehensive Arabic detection study to date (11.7K samples, 99.9% F1) that stylometric features are domain-specific within a language, formal academic prose and informal social media text require different feature weights. E7's strategy-per-language architecture mirrors this finding by allowing per-language calibration.
Limitations
Each language strategy is maintained independently with its own dictionaries and patterns. Coverage varies by language: Romanian has the most extensive dictionaries (developed first), while some languages have more limited calque lists. The indicator can be extended per language by adding to the locale_strategies module.
The calque detection approach is inherently conservative: it flags expressions known to be LLM-favored translations but cannot detect novel calques. As LLMs improve their multilingual capabilities, the calque signal may weaken for high-resource languages while remaining strong for mid-resource languages.
Chinese, Japanese, and Korean use word-counting heuristics adapted for Chinese, Japanese and Korean (CJK) text (character-based rather than space-delimited tokenization), which may slightly differ from the Latin-script languages in sensitivity.
The indicator returns a score of 0 and skipped metadata for English text. English-language documents should rely on E8 (Diacritics Forensics) for diacritic-level analysis.
References
- Wu Z, et al. DetectRL-X: towards reliable multilingual and real-world LLM-generated text detection. arXiv preprint arXiv:2605.15518. 2026. https://arxiv.org/abs/2605.15518
- Al-Shaibani M, Ahmed A. Arabic machine-generated text detection: stylometric analysis and cross-model evaluation. Expert Systems with Applications. 2026. https://www.sciencedirect.com/science/article/abs/pii/S0957417425042599
- Al Ali A, Helcl J, Libovicky J. Different time, different language: entropy vs perplexity in Czech AI text detection. 2025.
- Gargova S, et al. BuST: Bulgarian Siamese Transformer for machine-generated text detection. 2025.
- Wang J, et al. HERO: hierarchical length-robust detector for machine-generated text. Findings of EMNLP. 2025. https://aclanthology.org/2025.findings-emnlp.812/
- Tan Z, et al. DeTinyLLM: efficient detection of machine-generated text via compact paraphrase transformation. Information Fusion. 2025. https://www.sciencedirect.com/science/article/abs/pii/S1566253525007675
- Esperanto benchmark: back-translation evasion detection. 2025.