LLM Tokenization Bias
Detects digit patterns in IPD that reflect LLM tokenization preferences, certain digit sequences are more frequent because they correspond to common LLM tokens.
Technical description
Large language models (LLMs) encode numbers with sub-word tokenizers such as byte-pair encoding, where common short numeric strings map to single tokens while irregular high-precision values must be assembled from several tokens. This inductive bias makes generated numbers lean toward simple, round, and repetitive forms that genuine instrument readings rarely take. D25 flattens every numeric value from the numeric columns of the individual-patient data (IPD) and measures four tendencies: the share that are integers, one-decimal, or simple fractions; the share whose significant decimals contain three or more consecutive identical digits; among genuinely continuous columns, the share whose last meaningful digit is 0 or 5; and the number of distinct decimal-precision levels present.
How it works
Layer 2 (contextual): flattens all numeric IPD columns into one pool and requires at least fifty values. The nice-number rate is the fraction that are effectively integers, equal to themselves at one decimal place, or within tolerance of a simple fraction from {0.1, 0.2, 0.25, 0.33, 0.4, 0.5, 0.6, 0.67, 0.75, 0.8, 0.9}; it adds 2.0 above seventy percent or 1.0 above fifty-five, and a nice-number count that is also improbable under a conservative 0.10 null rate, with a one-sided binomial tail below one in ten thousand, adds a further 0.5 when the rate exceeds fifty-five percent. The digit-repetition rate is the fraction whose significant decimals carry three or more consecutive identical digits, matched after trailing-zero padding is stripped so fixed-width formatting cannot manufacture a false run of zeros; it adds 1.5 above ten percent. The round-five rate, computed only over columns with standard deviation above 1.0, is the fraction whose last non-zero digit is 0 or 5 and adds 1.0 above forty-five percent. A distinct-precision count of two or fewer adds 0.5. The total is capped at 5.0. Metadata records the value count, the four metrics, a nice-number binomial p-value against a conservative 0.10 chance level, and a terminal-digit binomial p-value testing the round-five rate against the 0.2 chance level.
Why this matters
When models are turned to generating data they leave numeric fingerprints distinct from measurement. Tokenization assigns single tokens to round and short numeric strings, so simple values are systematically easier for a model to emit, and a model with a data-analysis tool can fabricate a clinical dataset that looks superficially plausible. The tendency to favour particular digits is also a long-known signature of invented numbers. This combined fingerprint is useful where Benford and heaping tests are less informative.
Score thresholds
- 0-1
- Digit patterns consistent with genuine measurement
- 2-3
- An elevated rate of nice numbers, digit repetition, or terminal 0/5 preference
- 4-5
- Several tokenizer-friendly tendencies together, consistent with LLM-generated or heavily coarsened data
Limitations
The four tendencies are heuristic fingerprints, not a model of any specific tokenizer, so a high score indicates resemblance to machine-generated or heavily rounded numbers rather than proving which tool produced it. Coarse instruments, rating scales, and rounded reporting can raise the nice-number and terminal-0/5 rates without fabrication, so the round-five rate is restricted to columns with standard deviation above 1.0. The digit-repetition test reads the value after stripping trailing zeros, so it depends on storage precision. The pool mixes all numeric columns, so one dominant column can drive the metrics, and datasets under fifty values are skipped. Terminal-digit preference in reported tables is indicator S08 and natural heaping in the IPD is indicator D18; D25 focuses on the combined machine-generation fingerprint across the IPD.
References
- Singh A, Strouse DJ. (2024). Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs. arXiv preprint arXiv:2402.14903
- Taloni A, Scorcia V, Giannaccare G. (2023). Large language model advanced data analysis abuse to create a fabricated dataset in medical research. JAMA Ophthalmology
- Mosimann JE, Wiseman CV, Edelman RE. (1995). Data fabrication: can people generate random digits?. Accountability in Research
- Mosimann JE, Dahlberg JE, Davidian NM, Krueger JW. (2002). Terminal digits and the examination of questioned data. Accountability in Research 9(2):75-92
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
- Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
- Carlisle JB. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 72(8):944-952