LLM Tokenization Bias
Looks for the fingerprints a large language model (LLM) leaves when it invents numbers. Because of how these models split numbers into tokens, their output leans toward round and simple values, repeats the same digit in patterns such as 1.111 or 2.222, favours endings in 0 or 5, and uses only a few distinct levels of decimal precision. The indicator measures four such tendencies across all numeric values of the individual-patient data (IPD) and scores how strongly the data carries them. It works on the IPD.
Technical description
D25 is a contextual screen for the numeric signature of machine-generated data, in particular data produced by a large language model (LLM). LLMs encode text, including numbers, with sub-word tokenizers such as byte-pair encoding, where common short numeric strings map to single tokens while irregular high-precision values must be assembled from several tokens. The resulting inductive bias makes generated numbers lean toward simple, round, and repetitive forms that genuine instrument readings rarely take. The indicator flattens every numeric value from the numeric columns of the individual-patient data (IPD), requires at least fifty values, and computes four tendencies: the share of values that are integers, one-decimal, or simple fractions; the share whose significant decimals contain three or more consecutive identical digits; among genuinely continuous columns, the share whose last meaningful digit is 0 or 5; and the number of distinct decimal-precision levels present. Each tendency contributes to a capped score.
How it works
All numeric columns are flattened into one pool of values. The nice-number rate is the fraction that are effectively integers, equal to themselves at one decimal place, or within tolerance of a simple fraction drawn from {0.1, 0.2, 0.25, 0.33, 0.4, 0.5, 0.6, 0.67, 0.75, 0.8, 0.9}; it adds 2.0 above seventy percent or 1.0 above fifty-five. The nice-number count is also tested for significance: it follows a binomial distribution under a conservative null rate of 0.10 for a genuine continuous value reading as nice, and when an elevated rate above fifty-five percent is also improbable under that null, with a one-sided tail below one in ten thousand, a further 0.5 is added, so the score rewards a count that is both high and statistically beyond chance rather than a high point estimate a small sample could produce by luck. The digit-repetition rate is the fraction whose significant decimals contain three or more consecutive identical digits, found by matching the pattern for a repeated decimal digit against the value's representation after trailing-zero padding is stripped, so that the fixed-width formatting does not manufacture a false run of zeros; it adds 1.5 above ten percent. The round-five rate is computed only over columns whose standard deviation exceeds 1.0, treated as genuinely continuous: it is the fraction of those values whose last non-zero digit is 0 or 5, and it adds 1.0 above forty-five percent. The unique-precision count is the number of distinct decimal-place counts across the pool, and a count of two or fewer adds 0.5. The total is capped at 5.0, and each triggered rate emits a finding reporting its value and threshold. The metadata records the value count, all four metrics, a nice-number binomial p-value testing the nice-number count against a conservative 0.10 chance level, and a terminal-digit binomial p-value testing the round-five rate against the 0.2 chance level.
Score thresholds
| Score | Meaning |
|---|---|
| 0 to 1 | Digit patterns consistent with genuine measurement. |
| 2 to 3 | An elevated rate of nice numbers, digit repetition, or terminal 0/5 preference. |
| 4 to 5 | Several tokenizer-friendly tendencies together, consistent with LLM-generated or heavily coarsened data. |
Why this matters
The way a model tokenizes numbers is now understood to shape the numbers it produces. Singh and Strouse show that frontier LLMs assign single tokens to common one-, two-, and three-digit strings and that this tokenization carries inductive biases into the numeric output, so that simple and round values are systematically easier for the model to emit than irregular ones [1]. That bias becomes a fabrication risk when models are turned to generating data: Taloni and colleagues demonstrated that an LLM with a data-analysis tool can fabricate a clinical dataset that looks superficially plausible, which makes machine-specific numeric fingerprints valuable for detection [2]. The tendency to favour particular digits is not new to machines. Mosimann and colleagues showed decades ago that people asked to invent numbers cannot produce uniform digits and lean toward preferred values, establishing that questioned data should be examined for non-random digit behaviour [3]. D25 brings these together: nice-number, repeated-digit, terminal-0/5, and low-precision-diversity tendencies are each weak alone but jointly characteristic of values that were generated rather than measured. Recent forensic re-analyses, scoping reviews, and trustworthiness instruments place digit-pattern and terminal-digit checks among the standard screens for fabricated and machine-generated data [4, 5, 6, 7, 8].
Limitations
The four tendencies are heuristic fingerprints, not a model of any specific tokenizer, so a high score indicates that the data resembles machine-generated or heavily rounded numbers rather than proving which tool produced it. Genuinely coarse instruments, rating scales, and rounded reporting can raise the nice-number and terminal-0/5 rates without fabrication, so the round-five rate is restricted to columns with standard deviation above 1.0 to exclude inherently discrete variables, and a flag remains a prompt to inspect provenance. The digit-repetition test reads the value's decimal representation after stripping trailing zeros, so it depends on the precision at which values were stored and can miss repetition beyond ten decimal places. The pool mixes all numeric columns, so a single dominant column can drive the metrics. The minimum of fifty values means small datasets are skipped. Terminal-digit preference within reported tables is indicator S08 and natural heaping in the IPD is indicator D18, so D25 focuses on the combined machine-generation fingerprint across the IPD.
Theoretical background
D25 rests on the gap between how measurement produces numbers and how a language model produces them. A real continuous measurement is the sum of a true quantity and noise, so its recorded digits, below the instrument's resolution, are effectively uniform and high in precision, and the chance of a long run of one digit or of an exact simple fraction is small. A language model instead emits numbers token by token, and because its tokenizer represents round and short numeric strings as single high-frequency tokens, the path of least resistance favours those forms; the model also tends to repeat structure, yielding patterns such as 1.111 or 2.222 and a narrow set of precision levels. The four metrics each estimate one face of this contrast: nice-number rate captures the pull toward single-token values, digit-repetition rate captures the structural copying of a digit, terminal-0/5 rate captures heaping at the coarsest reporting grid, and the count of distinct precision levels captures the impoverished variety of decimal lengths in generated data. None is decisive alone, because genuine data can show any one of them for benign reasons, but their conjunction is improbable under real measurement, which is why the score accumulates across independent tendencies and is read as a combined fingerprint rather than a single test. Correcting the digit-repetition test for trailing-zero padding is essential to this logic, because otherwise fixed-width formatting would imprint a run of zeros on every low-precision value and the metric would measure the formatter rather than the data. The two binomial tests on the nice-number and terminal-digit counts add a significance layer to the otherwise threshold-based fingerprint, distinguishing a rate that is genuinely improbable under a continuous-measurement null from one that a modest sample size inflates by chance.
References
- Singh A, Strouse DJ. Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs. arXiv preprint. 2024. arXiv:2402.14903. https://arxiv.org/abs/2402.14903
- Taloni A, Scorcia V, Giannaccare G. Large language model advanced data analysis abuse to create a fabricated dataset in medical research. JAMA Ophthalmology. 2023;141(12):1174-1175. DOI: 10.1001/jamaophthalmol.2023.5162
- Mosimann JE, Wiseman CV, Edelman RE. Data fabrication: can people generate random digits? Accountability in Research. 1995;4(1):31-55. DOI: 10.1080/08989629508573866
- Mosimann JE, Dahlberg JE, Davidian NM, Krueger JW. Terminal digits and the examination of questioned data. Accountability in Research. 2002;9(2):75-92. DOI: 10.1080/08989620212969
- Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
- Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
- Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938