Terminology
Detects overuse of large language model (LLM)-characteristic vocabulary, decorative jargon, undefined key terms, section-level terminological imprecision, and low domain-noun richness. AI-generated academic text relies on a finite set of "excess" style words rather than genuine domain-specific terminology.
Technical description
C7 operationalises the "terminology precision" dimension from the Anti-AI Vibe Review spec using a comprehensive dictionary of LLM-excess vocabulary (462 entries derived from Kobak et al. 2025, GPTZero monthly lists, and the Antislop framework) plus four supporting sub-checks. The indicator measures five aspects of terminological quality: (1) the density of LLM-characteristic vocabulary per 1000 tokens, (2) the proportion of technical terms used only once (decorative jargon), (3) whether key recurring terms are introduced with definitional context, (4) whether Methods and Results sections specifically show elevated LLM-vocabulary density, and (5) the type-token ratio on domain-specific nouns. The indicator runs at Layer 2 because it requires lemmatisation and part-of-speech tagging.
How it works
Sub-check 1, excess vocabulary density. All tokens in the document are checked against a 462-entry dictionary of LLM-characteristic vocabulary. The dictionary includes 66% verbs (delve, underscore, showcase, elucidate, encompass, facilitate, leverage, necessitate, streamline etc.), 14% adjectives (crucial, intricate, pivotal, multifaceted, groundbreaking, meticulous, robust, comprehensive, nuanced, seamless, unparalleled etc.), adverbs (notably, additionally, particularly, predominantly, seamlessly, ultimately etc.), style nouns (realm, landscape, interplay, tapestry, insights, endeavors etc.), and formulaic phrases (it is important to note, studies have shown, further research is needed etc.). The density of matches per 1000 tokens is computed. A density above 80 contributes +2.0 to the score; above 40 contributes +1.0.
Sub-check 2, decorative jargon detection. For all nouns longer than five characters, per-lemma occurrence counts are tallied within the document. Technical terms appearing exactly once (hapax legomena) and longer than seven characters are flagged as decorative jargon: terms that signal technical competence but are never integrated into the document's conceptual framework. When more than 10 technical terms are present and over 60% of them are hapax, the sub-check fires, contributing +1.0.
Sub-check 3, missing definition detection. Nouns appearing more than three times are considered key terms. The text is scanned for definitional patterns (represents, is defined as, refers to, consists of, means in English; reprezinta, se defineste ca, consta in, se refera la in Romanian). If the document has more than two key terms and no definitional pattern is found anywhere, the sub-check fires, contributing +1.0.
Sub-check 4, technical-section terminology. The text is partitioned into IMRaD (Introduction, Methods, Results and Discussion) sections. The excess vocabulary density is computed per section. The Methods and Results sections are checked specifically: a density above 50 per 1000 tokens in either section fires the sub-check with +0.5 per section (max +1.0). These are the sections that should contain the most domain-specific vocabulary; high LLM-marker density in Methods or Results is a strong signal that the author is using template-generated prose rather than describing actual experimental or analytical work.
Sub-check 5, domain noun richness. All nouns that are not in the excess vocabulary dictionary and not stop words are extracted. Their type-token ratio (TTR) is computed. A TTR below 0.35 on domain nouns indicates that the text recycles a narrow set of domain-specific terms, a pattern observed in LLM-generated academic prose where the model defaults to a small repertoire of "safe" domain nouns. Contributes +0.5 to the score.
The five contributions sum to a theoretical maximum of 5.5 (2.0 + 1.0 + 1.0 + 1.0 + 0.5), with a hard clamp at 5.0.
Why this matters
Kobak and colleagues' landmark analysis of 15 million PubMed abstracts (2010-2024) demonstrated that at least 13.5% of 2024 biomedical abstracts show excess LLM-characteristic vocabulary, with some subcorpora reaching 40% [1]. The excess words are disproportionately style words: 66% verbs and 14% adjectives, qualitatively different from the content-noun spikes seen during the COVID-19 pandemic. The most explosive markers include delves (28x frequency increase), underscores (13.8x), and showcasing (10.7x). The study also documented that authors began actively avoiding flagged words like delve after April 2024, shifting to subtler overused common words like significant and additionally, an arms race between LLM vocabulary and authorial self-censorship.
Munoz-Ortiz and colleagues compared human and LLM-generated news text across six models and found that human texts exhibit richer vocabulary, more varied sentence lengths, and shorter constituents, while LLM outputs use more numbers, symbols, auxiliaries, and pronouns, suggesting more "objective" but less lexically varied language [2]. Differences between LLMs and humans exceeded differences between LLMs themselves, meaning the human/LLM vocabulary boundary is a robust signal.
The vocabulary problem extends beyond English. Heinisch tested ChatGPT, Perplexity, and Microsoft CoPilot on terminology work for higher education systems and found that LLMs failed to differentiate between domain-specific terminology across regions, fabricated plausible-sounding technical terms, and produced inconsistent outputs across identical prompts [3].
Leppanen and colleagues analyzed 56,000+ Massive Open Online Course (MOOC) essays pre- and post-ChatGPT and found that while essay length increased by 53%, vocabulary type-token ratio dropped from 0.617 to 0.577, student writing became longer but more repetitive [4]. This aligns with the domain-noun richness check in sub-check 5: LLM-assisted text expands in length but contracts in lexical variety.
Score thresholds
| Score | Meaning |
|---|---|
| 0 to 1 | Low density of LLM-characteristic vocabulary. Domain-specific nouns show healthy type-token diversity. Technical terms are used consistently and introduced with definitions. Methods and Results sections use appropriate domain terminology. |
| 2 to 3 | Moderate LLM vocabulary density (40-80 per 1000 tokens). Some decorative jargon or undefined key terms. Typical of AI-assisted text or formulaic academic writing. |
| 4 to 5 | Severe terminology deficit: LLM vocabulary exceeds 80 per 1000 tokens, most technical terms are one-off decorations, key terms lack definitions, and even Methods/Results sections are padded with generic style words. Highly consistent with unprompted LLM output. |
Limitations
The excess vocabulary dictionary is derived primarily from biomedical literature (PubMed abstracts) and may under-represent LLM vocabulary patterns in other disciplines (humanities, law, engineering). As LLM vocabulary preferences shift over time and authors adapt to avoid known markers, the dictionary will require periodic updates. The detection arms race documented by Kobak et al., where delves usage dropped after public identification, means that today's most diagnostic words may be tomorrow's avoided ones.
Sub-check 5 (domain noun TTR) uses a static 0.35 threshold. Domain noun richness varies naturally by document type: a focused methods paper will legitimately have lower domain-noun TTR than a broad review article. The threshold is deliberately conservative to avoid penalising legitimate discipline-specific repetition.
The definitional context check uses a small set of explicit definition patterns. A text that defines terms through apposition, example, or implicit context rather than explicit formulaic definitions will be incorrectly flagged by sub-check 3.
Technical-section terminology (sub-check 4) requires the IMRaD section classifier to recognise Methods and Results headings. Documents without standard headings will skip this sub-check.
References
- Kobak D, Gonzalez-Marquez R, Horvat E-A, Lause J. Delving into LLM-assisted writing in biomedical publications through excess vocabulary. Science Advances. 2025;11(27):eadt3813. DOI: 10.1126/sciadv.adt3813 https://www.science.org/doi/10.1126/sciadv.adt3813
- Munoz-Ortiz A, Gomez-Rodriguez C, Vilares D. Contrasting linguistic patterns in human and LLM-generated news text. Artificial Intelligence Review. 2024;57:265. DOI: 10.1007/s10462-024-10903-2 https://arxiv.org/abs/2308.09067
- Heinisch B. Large language models for terminology work: a question of the right prompt? Journal for Language Technology and Computational Linguistics (JLCL). 2025. https://jlcl.org/article/view/280
- Leppanen L, et al. How large language models are changing MOOC essay answers: a comparison of pre- and post-LLM responses. 2025.
- Schmalz V, Tack A. Can GPTZero's AI vocabulary distinguish between LLM-generated and student-written essays? Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA). 2025. https://aclanthology.org/2025.bea-1.71/
- Shaib C, Li Y, Tetreault J, Jaimes A. Measuring AI "slop" in text. arXiv preprint arXiv:2509.19163. 2025. https://arxiv.org/abs/2509.19163
- Jarvis S, et al. Lexical diversity analysis across ChatGPT versions vs. humans. arXiv preprint arXiv:2508.00086. 2025. https://arxiv.org/abs/2508.00086