ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
F1Text analysisFingerprintLayer 1 (Deterministic)

ChatGPT Fingerprint

Reports how strongly a text exhibits the lexical and structural habits associated with ChatGPT (GPT-4 and GPT-4o) output: a characteristic weighted vocabulary together with the formatting tics the model carries over from its chat interface.

Technical description

F1 scores a document on two components and normalises the sum to the 0 to 5 scale. The lexical component sums, over a per-language dictionary of ChatGPT-associated phrases, weight × occurrences for each case-insensitive match; weights are integers from 1 (mildly elevated: crucial, robust, comprehensive) to 5 (strongly over-represented: delve, tapestry, in the realm of, treasure trove). A matched phrase in the shared cross-model generic set is multiplied by 0.25 before summing, leaving the ChatGPT-specific residue. The structural component adds fixed points: bold Markdown **…** (+3), em-dash density above 0.5 per sentence (+2), a bullet list of exactly three or five items (+3), and an affirmative closing paragraph (+2). The raw total R maps to the reported score as min(5.0, R / 15 × 5), so R ≥ 15 saturates. Twelve language dictionaries exist (English, Romanian, German, Spanish, French, Italian, Portuguese, Russian, Turkish, Chinese, Japanese, Korean); the document language selects one.

How it works

The implementation is deterministic and runs at Layer 1, using only compiled regular expressions over the document text.

Lexical scoring. Each language has a dictionary that maps a phrase to an integer weight from 1 to 5, where a higher weight marks a phrase that is more distinctive of ChatGPT output. The weights are calibrated to the measured fold-increase in frequency of each term in post-2022 text relative to a human baseline, so the most heavily over-represented words carry the most weight: delve, tapestry, in the realm of, treasure trove, unwavering and ever-evolving sit at weight 5, mid-distinctive terms such as multifaceted, leveraging, seamlessly and it is important to note at weight 4, and broadly common words that are only mildly elevated, such as crucial, robust and comprehensive, at weight 2. For every phrase that matches (case-insensitively), the indicator adds its weight multiplied by the number of occurrences. Phrases that are over-represented across many models rather than ChatGPT specifically are held in a shared cross-model set; when a matched phrase belongs to that set, its weight is multiplied by a generic factor of 0.25 before being added, so that signal shared with Claude, Gemini and others is discounted and the score reflects the ChatGPT-specific residue. A matched phrase of weight 4 or more is reported at warning severity; lighter matches are reported at informational severity.

Structural scoring. Four formatting signatures are checked, each contributing a fixed number of raw points. The presence of any bold Markdown span (text wrapped in double asterisks) contributes 3, since the model habitually emboldens key terms in its chat output. An em-dash density above 0.5 per sentence, counting em-dashes, en-dashes and the double-hyphen surrogate, contributes 2. A bulleted list with exactly three or exactly five items contributes 3, reflecting the model's preference for those list lengths. An upbeat closing paragraph, detected when the final paragraph matches a set of affirmative closing words, contributes 2.

Aggregation. The lexical sum and the four structural contributions are added into a single raw score, which is then mapped to the reported scale as the minimum of 5.0 and the raw score divided by 15 and multiplied by 5. The raw score and the full table of detected phrases with their counts and effective weights are returned in the metadata, so the reported score can be reconstructed from the text.

Score thresholds

Score Meaning
0 to 1 Few or no ChatGPT-associated phrases, no clustering of formatting tics. Typical of writing produced before the model era or edited to remove its surface habits.
2 to 3 A noticeable concentration of the characteristic vocabulary, or one or two formatting signatures. Common in lightly machine-assisted drafts and in writing by authors who have absorbed the house style.
4 to 5 The vocabulary and several formatting signatures co-occur densely: high-weight terms, bold key phrases, em-dashes and template-sized lists together. Strongly consistent with unedited ChatGPT output.

Why this matters

When ChatGPT was released, it left an unusually sharp mark on the written record. A study of more than fifteen million biomedical abstracts found an abrupt rise after 2022 in a small set of words, with some term families increasing several-fold, and showed that by 2024 the excess vocabulary was dominated by style words such as delve, intricate and underscore rather than by content. That over-representation is the empirical basis for the lexical weights: the words the model reaches for are precisely the ones whose frequency jumped, and their distinctiveness can be measured rather than guessed. The structural signatures have the same origin in the model's training. Optimised through human feedback to produce clear, well-formatted chat answers, the model emboldens key terms, closes on an upbeat note, and defaults to short, even lists; when that output is pasted into a manuscript without editing, the chat formatting travels with it. F1 isolates the part of this signal that is specific to ChatGPT by discounting the vocabulary it shares with other models, which is what separates a model-specific fingerprint from a generic "sounds like AI" detector.

Limitations

A vocabulary fingerprint ages. As writers and models adapt, the over-represented words shift from year to year, and some terms that were strongly diagnostic in 2024 have already begun to recede; the weights therefore require periodic recalibration against a current baseline, and a fixed dictionary will lose sensitivity over time. The fingerprint is also easy to defeat deliberately: a single pass through a paraphraser, or a manual find-and-replace of the highest-weight words, removes most of the lexical signal while leaving the meaning intact. In the other direction, every marker has legitimate uses, so a human author who favours em-dashes, writes upbeat conclusions, or works in a field where a flagged term is ordinary technical vocabulary can accumulate a moderate score without any machine involvement, which is why the indicator is framed as a fingerprint rather than a classifier. The lexical dictionaries cover twelve languages unevenly, with the English profile by far the most developed, so sensitivity is lower for the other languages and absent outside the twelve. Finally, the structural checks assume Markdown-style formatting; a document whose formatting was stripped during conversion to plain text will not register the bold or list signatures even if it began as chat output.

Theoretical background

F1 belongs to the tradition of lexical fingerprinting, in which an author or a source is identified from the relative frequencies of particular words rather than from content. The contemporary version of this idea is the excess-vocabulary method, which compares word frequencies before and after a known intervention, here the public release of large language models, and isolates the terms whose usage jumped; the same method that quantifies how much of a corpus was machine-assisted also yields, as a by-product, the ranked list of model-associated words that F1 turns into weights. A second strand asks why these particular words are over-produced, tracing the preference for terms like delve to the data and the feedback process used to align the model; this work supports treating the vocabulary as a stable, mechanistically-grounded signature rather than a coincidence. The discounting of cross-model generic phrases reflects a third idea: as different models converge on a shared assistant register, only the residual, model-specific vocabulary remains diagnostic of a particular system, so a fingerprint aimed at one model must actively subtract the signal common to all of them.

References

  1. Kobak D, González-Márquez R, Horvát EÁ, Lause J. Delving into LLM-assisted writing in biomedical publications through excess vocabulary. Science Advances. 2025. https://arxiv.org/abs/2406.07016
  2. Juzek TS, Ward ZB. Why does ChatGPT delve so much? Exploring the sources of lexical overrepresentation in large language models. Proceedings of COLING 2025. 2024. https://arxiv.org/abs/2412.11385
  3. Liang W, Zhang Y, Wu Z, Lepp H, Ji W, Zhao X, Cao H, Liu S, He S, Huang Z, Yang D, Potts C, Manning CD, Zou J. Quantifying large language model usage in scientific papers. Nature Human Behaviour. 2025. DOI: 10.1038/s41562-025-02273-8 https://www.nature.com/articles/s41562-025-02273-8
  4. Thelwall M, Kousha K. Have LLM-associated terms increased in article full texts in all fields? arXiv preprint arXiv:2604.07565. 2026. https://arxiv.org/abs/2604.07565