Digit Sequence Duplicates
Looks at the digits that make up a dataset's numbers and checks whether the same short digit patterns repeat too often. Genuinely measured values draw their digits from the full range of measurement, so their two- and three-digit combinations are varied. Machine-generated numbers tend to reuse familiar combinations, such as 14, 71, or 41 from constants like 3.14 and 2.71, and to repeat whole values, lowering the variety of digit patterns. The indicator measures the entropy of the digit pairs and triples and the rate of exact-duplicate values, flagging data that is too repetitive. It works on the individual-patient data.
Technical description
D21 is a contextual screen on the digit-level structure of a dataset's numeric values. It pools the numeric values across columns, excluding integer-valued low-cardinality columns such as binary flags and coded categories whose values legitimately repeat and would otherwise inflate the duplicate rate, and requires at least fifty pooled values. For each value it forms a clean digit string and extracts all overlapping two-digit and three-digit subsequences, accumulating their frequencies. It then computes the Shannon entropy of the bigram and trigram distributions, in bits, where higher entropy means more varied digit combinations, and the rate of exact-duplicate values, the fraction of values that coincide with another to six decimal places. Low bigram entropy, low trigram entropy, and a high duplicate rate each contribute to the score, since each reflects digit patterns less varied than genuine measurement produces.
How it works
The pooled values are converted to digit strings via a six-significant-figure representation with non-digits and leading zeros removed, and their bigrams and trigrams are counted. The Shannon entropy of each distribution is the negative sum of the probabilities times their base-two logarithms [4]; it is also reported as a Shannon efficiency, the fraction of the maximum entropy a uniform distribution over the possible n-grams would reach. The duplicate rate counts, over all pooled values, those whose rounded value occurs more than once. The score adds 2.0 when the bigram entropy is below 2.5 bits or 1.0 when below 3.0, adds 1.5 when the trigram entropy is below 4.5 bits, and adds 1.0 when the duplicate rate exceeds twenty-five percent. It also computes the Renyi order-two collision entropy of the bigram distribution, the negative base-two logarithm of the summed squared bigram probabilities, which equals the negative log of Friedman's index of coincidence; because this never exceeds the Shannon entropy and falls faster when a few pairs dominate, a further 0.5 is added when it drops below two bits while the Shannon bigram entropy is already below three, corroborating concentrated digit reuse [9]. The score is capped at 5.0. A single finding summarises the triggered conditions. The metadata records the total values analysed, the bigram and trigram entropies, their normalised efficiency forms, the bigram and trigram collision entropies, the duplicate rate, and the counts of unique bigrams and trigrams.
Score thresholds
| Score | Meaning |
|---|---|
| 0 | Digit patterns are as varied as genuine measurement produces. |
| 1 to 2 | Mildly reduced digit variety or a single repetitive signal. |
| 3 to 5 | Low digit entropy together with frequent duplicate values, consistent with generated or copied numbers. |
Why this matters
The digits of a measured quantity are, at the fine end, close to random, because they reflect the full precision of the measurement, so across many genuine values the short digit combinations are diverse and exact repeats are rare. Generated numbers behave differently. Mosimann and colleagues showed that neither people nor naive generators produce truly random digits, gravitating instead to familiar patterns [1], and that the terminal-digit structure of data reveals its origin [2]. A language model asked to produce numbers reuses the digit combinations of well-known constants and templates and repeats whole values more than measurement would, which Taloni and colleagues observed in a model-fabricated clinical dataset whose numbers did not behave like real measurements [3]. Entropy quantifies this loss of variety directly: a low entropy of digit bigrams or trigrams means the same combinations recur, and a high duplicate rate means whole values recur, both signatures of numbers drawn from a narrow internal repertoire rather than measured from the world. Entropy and its normalisation are the foundational measures of information content [4, 5], and recent forensic re-analyses, scoping reviews, and trustworthiness instruments place digit-pattern and repetition checks among the standard screens for fabricated and machine-generated data [6, 7, 8].
Limitations
The check pools values across columns, so it characterises the dataset as a whole rather than any single variable, and a few highly repetitive columns can drive the signal. The entropy thresholds are calibrated for pooled biomedical values and can misjudge datasets dominated by one narrow-range variable, where genuine values legitimately share digit patterns and lower the entropy. Excluding integer-valued low-cardinality columns removes the most common false-positive source, binary and coded fields, but a genuine integer measurement with many repeats, such as a heart rate, still contributes its natural duplicates. The six-significant-figure digit extraction discards precision beyond that and is insensitive to scale, so two values differing only in magnitude can share digit strings. The thresholds on entropy and duplicate rate are heuristic. The terminal-digit uniformity tests are indicators S8 and D34 and value duplication in tables is indicator S14, so D21 focuses on the entropy of digit subsequences and whole-value duplication in the pooled data.
Theoretical background
D21 rests on information theory applied to the digit composition of numbers. If a set of measured values spans a reasonable range at full precision, the two- and three-digit subsequences within their digit strings approach a broad distribution, and the Shannon entropy of that distribution, the average number of bits needed to identify a subsequence, is high, near the maximum permitted by the number of possible combinations. A generating process that reuses a small repertoire of digit patterns concentrates the distribution on a few combinations, lowering its entropy, and one that repeats whole values raises the duplicate rate; both reduce the effective information content of the data below what genuine measurement carries. The bigram and trigram views are complementary: bigrams capture pairwise digit reuse and trigrams capture longer repeated motifs such as the digits of a memorised constant, so a model that pads its output with familiar numbers depresses the trigram entropy in particular. Excluding integer low-cardinality columns is necessary because their values are codes, not measurements, so their inevitable repetition would lower the entropy and raise the duplicate rate for reasons unrelated to fabrication, masking or mimicking the very signal the indicator seeks among the continuous values where it is meaningful. The Renyi collision entropy refines the Shannon view by weighting the heaviest repeats more: as the Renyi order rises above one the entropy is governed increasingly by the most frequent n-grams, so the order-two collision entropy, the negative log of the probability that two independently drawn bigrams coincide, falls below the Shannon entropy precisely when a small set of digit pairs carries most of the mass, the concentrated reuse a narrow generator produces [9].
References
- Mosimann JE, Wiseman CV, Edelman RE. Data fabrication: Can people generate random digits? Accountability in Research. 1995;4(1):31-55. DOI: 10.1080/08989629508573866
- Mosimann JE, Dahlberg JE, Davidian NM, Krueger JW. Terminal digits and the examination of questioned data. Accountability in Research. 2002;9(2):75-92. DOI: 10.1080/08989620212969
- Taloni A, Scorcia V, Giannaccare G. Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA Ophthalmology. 2023;141(12):1174-1175. DOI: 10.1001/jamaophthalmol.2023.5162
- Shannon CE. A Mathematical Theory of Communication. Bell System Technical Journal. 1948;27(3):379-423. DOI: 10.1002/j.1538-7305.1948.tb01338.x
- Cover TM, Thomas JA. Elements of Information Theory. 2nd ed. Hoboken, NJ: John Wiley & Sons; 2006. ISBN 978-0471241959. https://doi.org/10.1002/047174882X
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
- Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
- Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
- Rényi A. On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. Berkeley, CA: University of California Press; 1961:547-561. https://projecteuclid.org/proceedings/berkeley-symposium-on-mathematical-statistics-and-probability/Proceedings-of-the-Fourth-Berkeley-Symposium-on-Mathematical-Statistics-and/Chapter/On-Measures-of-Entropy-and-Information/bsmsp/1200512181