Digit Sequence Duplicates
Looks at the digits that make up a dataset's numbers and checks whether the same short digit patterns repeat too often. Genuinely measured values draw their digits from the full range of measurement, so their two- and three-digit combinations are varied. Machine-generated numbers tend to reuse familiar combinations (such as 14, 71, 41 from constants like 3.14 and 2.71) and to repeat whole values. The indicator measures the entropy of digit pairs and triples and the rate of exact-duplicate values.
Technical description
A contextual screen on the digit-level structure of a dataset's numeric values. It pools values across columns, excluding integer-valued low-cardinality columns (binary flags, coded categories whose values legitimately repeat and would inflate the duplicate rate), and requires at least fifty pooled values. For each value it forms a digit string and extracts overlapping bigrams and trigrams, accumulating frequencies, then computes the Shannon entropy (bits) of each distribution (higher = more varied) and the rate of exact-duplicate values (coinciding to six decimals). Low bigram entropy, low trigram entropy, and a high duplicate rate each contribute to the score.
How it works
Layer 2 (contextual): pooled values are converted to digit strings (six significant figures, non-digits and leading zeros removed) and their bigrams and trigrams counted. The Shannon entropy of each distribution is the negative sum of probabilities times their base-two logarithms. The duplicate rate counts pooled values whose rounded value occurs more than once. Score adds 2.0 when bigram entropy is below 2.5 bits or 1.0 below 3.0, adds 1.5 when trigram entropy is below 4.5 bits, and adds 1.0 when the duplicate rate exceeds twenty-five percent. The Renyi order-two collision entropy of the bigram distribution (minus log2 of the summed squared bigram probabilities, the negative log of Friedman's index of coincidence) adds a further 0.5 when it falls below two bits while the Shannon bigram entropy is already below three. Capped at 5.0. Metadata records total_values, bigram_entropy, trigram_entropy, their normalised Shannon-efficiency forms, bigram_collision_entropy, trigram_collision_entropy, duplicate_value_rate, and the unique bigram and trigram counts.
Why this matters
The digits of a measured quantity are, at the fine end, close to random, reflecting the full precision of measurement, so across many genuine values the short digit combinations are diverse and exact repeats rare. Generated numbers behave differently. Mosimann and colleagues showed neither people nor naive generators produce truly random digits, gravitating to familiar patterns, and that terminal-digit structure reveals origin. A language model reuses the digit combinations of well-known constants and templates and repeats whole values more than measurement would, which Taloni and colleagues observed in a model-fabricated clinical dataset. Entropy quantifies this loss of variety: low entropy of digit bigrams or trigrams means combinations recur, and a high duplicate rate means whole values recur, both signatures of numbers from a narrow internal repertoire.
Score thresholds
- 0
- Digit patterns are as varied as genuine measurement produces.
- 1-2
- Mildly reduced digit variety or a single repetitive signal.
- 3-5
- Low digit entropy together with frequent duplicate values, consistent with generated or copied numbers.
Limitations
Pools values across columns, so it characterises the dataset as a whole rather than any single variable, and a few highly repetitive columns can drive the signal. The entropy thresholds are calibrated for pooled biomedical values and can misjudge datasets dominated by one narrow-range variable, where genuine values share digit patterns and lower entropy. Excluding integer-valued low-cardinality columns removes the most common false-positive source (binary and coded fields), but a genuine integer measurement with many repeats (a heart rate) still contributes natural duplicates. The six-significant-figure extraction discards finer precision and is insensitive to scale, so two values differing only in magnitude can share digit strings. The thresholds are heuristic. Terminal-digit uniformity tests are S8 and D34, and value duplication in tables is S14.
References
- Mosimann JE, Wiseman CV, Edelman RE. (1995). Data fabrication: Can people generate random digits?. Accountability in Research 4(1):31-55
- Mosimann JE, Dahlberg JE, Davidian NM, Krueger JW. (2002). Terminal digits and the examination of questioned data. Accountability in Research 9(2):75-92
- Taloni A, Scorcia V, Giannaccare G. (2023). Large Language Model Advanced Data Analysis Abuse to Create a Fake Data Set in Medical Research. JAMA Ophthalmology 141(12):1174-1175
- Shannon CE. (1948). A Mathematical Theory of Communication. Bell System Technical Journal 27(3):379-423
- Cover TM, Thomas JA. (2006). Elements of Information Theory. 2nd ed. Hoboken, NJ: John Wiley & Sons. ISBN 978-0471241959
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
- Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380
- Rényi A. (1961). On measures of entropy and information. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability 1:547-561