ChatGPT Fingerprint
Detects vocabulary and phrasing patterns specifically associated with OpenAI's ChatGPT models, such as the overuse of words like 'delve', 'crucial', and 'landscape'.
Technical description
Matches text against a curated lexicon of ~150 ChatGPT-characteristic words and phrases identified through large-scale corpus analysis. Key markers include: overused words ('delve', 'crucial', 'landscape', 'multifaceted', 'nuanced', 'comprehensive'), characteristic sentence starters ('It's important to note', 'It's worth mentioning'), and structural patterns (enumerated lists with 'Firstly/Secondly/Lastly'). Computes a fingerprint density score weighted by marker specificity. Marker weights track quantified 2024-2025 excess-vocabulary studies (Kobak et al. 2025; Liang et al. 2024); the lexicon spans 12 languages and a generic-phrase penalty isolates ChatGPT-specific signal. Model-agnostic sentence-level tells (negative parallelism, rule of three, connective overuse) are handled by the C-series (C10, C3). The lexicon is dated and refreshed periodically because some markers decay after public exposure.
How it works
Layer 1 (deterministic): Matches text against a ChatGPT-specific vocabulary dictionary. Weights matches by specificity (rare markers weighted more than common ones). Detects characteristic sentence structures and list patterns. Computes a composite fingerprint score from word frequency, phrase matching, and structural patterns.
Why this matters
Each major language model has a distinctive vocabulary fingerprint because of differences in training data, RLHF alignment, and sampling parameters. ChatGPT has been shown to dramatically increase the frequency of certain words in academic text. Identifying model-specific vocabulary patterns allows not just detecting AI text, but identifying which model likely generated it.
Score thresholds
- 0-1
- No ChatGPT-specific vocabulary detected
- 2-3
- Some ChatGPT-associated words present
- 4-5
- Strong ChatGPT vocabulary fingerprint throughout
Limitations
Some ChatGPT-associated words were common before AI (e.g., 'crucial'). Frequency thresholds must be calibrated to avoid false positives. Models are updated frequently, and vocabulary fingerprints may shift over time. Calibration finding (AAVR controlled triad of source-confirmed samples, 2026): on a finished, cleaned document, vendor attribution from prose is unreliable. ChatGPT, Claude and Gemini produced superimposable stylistic profiles on the same topic and prompt; the shared LLM signature (negative parallelism, rule of three, systematic hedging, rigid structure) fires across all three, and the classic lexical cliches are cross-vendor and now down-weighted for vendor discrimination. This indicator should be read as 'patterns associated with this vendor', and on text without category-F technical artifacts the correct report is 'LLM, vendor uncertain'. The signals that actually separate vendors are leaked output handles and bibliography integrity, not style.
References
- Kobak D, Gonzalez-Marquez R, Horvat E-A. (2025). Delving into LLM-assisted writing in biomedical publications through excess vocabulary. Science Advances
- Liang W, et al.. (2024). Monitoring AI-modified content at scale: the impact of ChatGPT on AI conference peer reviews. arXiv:2403.07183
- Tercon L, et al.. (2025). Linguistic characteristics of AI-generated text: a survey. arXiv:2510.05136