D25Statistical analysisFabrication ExtendedLayer 2 (Contextual)

LLM Tokenization Bias

Detects digit patterns in IPD that reflect LLM tokenization preferences, certain digit sequences are more frequent because they correspond to common LLM tokens.

Technical description

Large language models (LLMs) encode numbers with sub-word tokenizers such as byte-pair encoding, where common short numeric strings map to single tokens while irregular high-precision values must be assembled from several tokens. This inductive bias makes generated numbers lean toward simple, round, and repetitive forms that genuine instrument readings rarely take. D25 flattens every numeric value from the numeric columns of the individual-patient data (IPD) and measures four tendencies: the share that are integers, one-decimal, or simple fractions; the share whose significant decimals contain three or more consecutive identical digits; among genuinely continuous columns, the share whose last meaningful digit is 0 or 5; and the number of distinct decimal-precision levels present.

How it works

Layer 2 (contextual): flattens all numeric IPD columns into one pool and requires at least fifty values. The nice-number rate is the fraction that are effectively integers, equal to themselves at one decimal place, or within tolerance of a simple fraction from {0.1, 0.2, 0.25, 0.33, 0.4, 0.5, 0.6, 0.67, 0.75, 0.8, 0.9}; it adds 2.0 above seventy percent or 1.0 above fifty-five, and a nice-number count that is also improbable under a conservative 0.10 null rate, with a one-sided binomial tail below one in ten thousand, adds a further 0.5 when the rate exceeds fifty-five percent. The digit-repetition rate is the fraction whose significant decimals carry three or more consecutive identical digits, matched after trailing-zero padding is stripped so fixed-width formatting cannot manufacture a false run of zeros; it adds 1.5 above ten percent. The round-five rate, computed only over columns with standard deviation above 1.0, is the fraction whose last non-zero digit is 0 or 5 and adds 1.0 above forty-five percent. A distinct-precision count of two or fewer adds 0.5. The total is capped at 5.0. Metadata records the value count, the four metrics, a nice-number binomial p-value against a conservative 0.10 chance level, and a terminal-digit binomial p-value testing the round-five rate against the 0.2 chance level.

Why this matters

When models are turned to generating data they leave numeric fingerprints distinct from measurement. Tokenization assigns single tokens to round and short numeric strings, so simple values are systematically easier for a model to emit, and a model with a data-analysis tool can fabricate a clinical dataset that looks superficially plausible. The tendency to favour particular digits is also a long-known signature of invented numbers. This combined fingerprint is useful where Benford and heaping tests are less informative.

Score thresholds

0-1: Digit patterns consistent with genuine measurement
2-3: An elevated rate of nice numbers, digit repetition, or terminal 0/5 preference
4-5: Several tokenizer-friendly tendencies together, consistent with LLM-generated or heavily coarsened data

Limitations

The four tendencies are heuristic fingerprints, not a model of any specific tokenizer, so a high score indicates resemblance to machine-generated or heavily rounded numbers rather than proving which tool produced it. Coarse instruments, rating scales, and rounded reporting can raise the nice-number and terminal-0/5 rates without fabrication, so the round-five rate is restricted to columns with standard deviation above 1.0. The digit-repetition test reads the value after stripping trailing zeros, so it depends on storage precision. The pool mixes all numeric columns, so one dominant column can drive the metrics, and datasets under fifty values are skipped. Terminal-digit preference in reported tables is indicator S08 and natural heaping in the IPD is indicator D18; D25 focuses on the combined machine-generation fingerprint across the IPD.