ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
L4-TerminologyText analysisLLMLayer 4

Terminology Analysis

Uses a language model to judge how a text uses technical terminology, scoring misuse, decorative jargon, domain fit, and consistency, and combining them into a weighted score. It reads for meaning, which is what separates it from frequency-based vocabulary checks.

Technical description

L4-Terminology is the semantic layer above the deterministic vocabulary checks. Those catch overused words by counting frequency, which suits a known list of style words; L4-Terminology asks the question counting cannot answer: in this sentence, does the term carry meaning, is it the correct and precise term, does it belong to this field, and is the same concept named the same way throughout. It runs at Layer 4 when a model is configured and the text is at least 100 words, sends the document (first 8000 characters) with a four-dimension rubric, and aggregates the dimension scores into a weighted 0 to 5 score with the per-dimension breakdown retained.

How it works

The model is given the text with a rubric of four dimensions scored 0 to 5 independently.

Misuse asks whether a technical term is used incorrectly or imprecisely, including overgeneralization where a precise term is replaced by a vaguer one or stretched past its meaning (using prevalence for incidence); because domain correctness is the hardest of these to judge from memory, the model is instructed to flag here only when confident. Decorative jargon asks whether terms are used to sound authoritative without adding meaning; the prompt seeds this with the documented set of style words language models overuse (delve, intricate, underscore, pivotal, showcase, meticulous, notably), but the judgement is whether the word does real work in its sentence, which a frequency count cannot make. Domain fit asks whether the vocabulary is drawn from the wrong field or register. Consistency asks whether the same concept is named inconsistently or one term is used for two concepts.

Abstention. The prompt instructs the model to flag a term only when reasonably confident, and any finding it marks low-confidence is dropped at the code level.

Aggregation. The dimension scores s are combined by a weighted mean with weights misuse 0.30, decorative 0.30, domain 0.20, consistency 0.20, over the dimensions returned: score = Σ wᵢ sᵢ / Σ wᵢ, clamped to 0 to 5. When the model returns a single overall score instead of the rubric, that value is used directly. Each surviving finding is labelled with its dimension (Terminology misuse, Decorative jargon, Domain mismatch, Inconsistent terminology), and the per-dimension sub-scores and a summary are returned in the metadata.

Score thresholds

Score Meaning
0 to 1 Terminology is correct, precise, in-field, and used consistently.
2 to 3 Some decorative jargon, a loosely used term, or minor inconsistency.
4 to 5 Repeated misuse, hollow jargon, out-of-field vocabulary, or inconsistent naming.

Why this matters

When language models took up scientific writing, the clearest trace they left was in vocabulary: a study of more than fifteen million biomedical abstracts found an abrupt rise after 2022 in a small set of style words, with the 2024 excess vocabulary dominated by style words such as intricate and underscore rather than content. Catching those words by frequency is effective and belongs with the model-specific vocabulary checks. What frequency cannot tell is whether a given term is doing real work, whether it is the precise term the science calls for, or whether it has been stretched past its meaning, a failure that shows up as models generalize and summarize. Those are judgements about meaning, and they are where machine-assisted writing tends to go wrong in a way a word list misses. A check that asks them also ages better than a fixed list: the overused words shift from year to year as writers and models adapt, so a check that asks whether a term means something stays useful as the particular words change.

Limitations

L4-Terminology depends on a language model, so it is slower and costlier than the deterministic checks, and its judgement carries the model's biases. Domain-specific correctness is the hardest dimension to settle from memory, which is why it is confidence-gated and the indicator is tuned to abstain rather than over-flag; even so it can miss a misused term and question a correct one, especially in narrow specialities. Its findings are assessments rather than proofs. Results vary between runs and between models, and the text is truncated to 8000 characters, so a long document is judged on its opening. It judges the use of terms, not whether a claim built on them is true, which is the work of the fact and citation checks.

Theoretical background

L4-Terminology draws on the excess-vocabulary literature, which measures how the lexical surface of scientific writing shifted with the model era and identifies the style words that rose most; the indicator uses that finding to seed the decorative-jargon dimension while moving the judgement from frequency to meaning. The overgeneralization strand follows work showing that models systematically broaden claims and terms when summarising research, which is the failure the misuse dimension targets. The decision to judge meaning rather than presence is motivated by the observation that model-associated vocabulary decays as it is recognised and adapted away, so a frequency list loses sensitivity over time while a meaning-level check does not. The use of a model as the judge sits within the LLM-as-judge literature, whose results on rubric design and judge reliability shape the criterion-separated rubric and the abstention rule.

References

  1. Kobak D, González-Márquez R, Horvát EÁ, Lause J. Delving into LLM-assisted writing in biomedical publications through excess vocabulary. Science Advances. 2025. https://arxiv.org/abs/2406.07016
  2. Juzek TS, Ward ZB. Why does ChatGPT delve so much? Exploring the sources of lexical overrepresentation in large language models. Proceedings of COLING 2025. 2024. https://arxiv.org/abs/2412.11385
  3. Peters U, Chin-Yee B. Generalization bias in large language model summarization of scientific research. arXiv preprint arXiv:2504.00025. 2025. https://arxiv.org/abs/2504.00025
  4. Thelwall M, Kousha K. Have LLM-associated terms increased in article full texts in all fields? arXiv preprint arXiv:2604.07565. 2026. https://arxiv.org/abs/2604.07565
  5. Schroeder K, Wood-Doughty Z. Can you trust LLM judgments? Reliability of LLM-as-a-judge. arXiv preprint arXiv:2412.12509. 2024. https://arxiv.org/abs/2412.12509