D21Statistical analysisFabrication DetectionLayer 2 (Contextual)

Digit Sequence Duplicates

Looks at the digits that make up a dataset's numbers and checks whether the same short digit patterns repeat too often. Genuinely measured values draw their digits from the full range of measurement, so their two- and three-digit combinations are varied. Machine-generated numbers tend to reuse familiar combinations (such as 14, 71, 41 from constants like 3.14 and 2.71) and to repeat whole values. The indicator measures the entropy of digit pairs and triples and the rate of exact-duplicate values.

Technical description

A contextual screen on the digit-level structure of a dataset's numeric values. It pools values across columns, excluding integer-valued low-cardinality columns (binary flags, coded categories whose values legitimately repeat and would inflate the duplicate rate), and requires at least fifty pooled values. For each value it forms a digit string and extracts overlapping bigrams and trigrams, accumulating frequencies, then computes the Shannon entropy (bits) of each distribution (higher = more varied) and the rate of exact-duplicate values (coinciding to six decimals). Low bigram entropy, low trigram entropy, and a high duplicate rate each contribute to the score.

How it works

Layer 2 (contextual): pooled values are converted to digit strings (six significant figures, non-digits and leading zeros removed) and their bigrams and trigrams counted. The Shannon entropy of each distribution is the negative sum of probabilities times their base-two logarithms. The duplicate rate counts pooled values whose rounded value occurs more than once. Score adds 2.0 when bigram entropy is below 2.5 bits or 1.0 below 3.0, adds 1.5 when trigram entropy is below 4.5 bits, and adds 1.0 when the duplicate rate exceeds twenty-five percent. The Renyi order-two collision entropy of the bigram distribution (minus log2 of the summed squared bigram probabilities, the negative log of Friedman's index of coincidence) adds a further 0.5 when it falls below two bits while the Shannon bigram entropy is already below three. Capped at 5.0. Metadata records total_values, bigram_entropy, trigram_entropy, their normalised Shannon-efficiency forms, bigram_collision_entropy, trigram_collision_entropy, duplicate_value_rate, and the unique bigram and trigram counts.

Why this matters

The digits of a measured quantity are, at the fine end, close to random, reflecting the full precision of measurement, so across many genuine values the short digit combinations are diverse and exact repeats rare. Generated numbers behave differently. Mosimann and colleagues showed neither people nor naive generators produce truly random digits, gravitating to familiar patterns, and that terminal-digit structure reveals origin. A language model reuses the digit combinations of well-known constants and templates and repeats whole values more than measurement would, which Taloni and colleagues observed in a model-fabricated clinical dataset. Entropy quantifies this loss of variety: low entropy of digit bigrams or trigrams means combinations recur, and a high duplicate rate means whole values recur, both signatures of numbers from a narrow internal repertoire.

Score thresholds

0: Digit patterns are as varied as genuine measurement produces.
1-2: Mildly reduced digit variety or a single repetitive signal.
3-5: Low digit entropy together with frequent duplicate values, consistent with generated or copied numbers.

Limitations

Pools values across columns, so it characterises the dataset as a whole rather than any single variable, and a few highly repetitive columns can drive the signal. The entropy thresholds are calibrated for pooled biomedical values and can misjudge datasets dominated by one narrow-range variable, where genuine values share digit patterns and lower entropy. Excluding integer-valued low-cardinality columns removes the most common false-positive source (binary and coded fields), but a genuine integer measurement with many repeats (a heart rate) still contributes natural duplicates. The six-significant-figure extraction discards finer precision and is insensitive to scale, so two values differing only in magnitude can share digit strings. The thresholds are heuristic. Terminal-digit uniformity tests are S8 and D34, and value duplication in tables is S14.