ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D27Statistical analysisFabrication ExtendedLayer 1 (Deterministic)

Uniform Intra-Column Precision

Detects when all values in a column have exactly the same number of decimal places, which is a signature of computer-generated rather than instrument-measured data.

Technical description

The number of significant decimal places a stored value carries reflects how it was recorded: a value measured to two decimals but ending in a round digit, such as 4.50, is stored as the float 4.5 and shows one significant decimal. Genuine fixed-precision recording therefore produces a spread of significant-decimal counts, because a fraction of values land on round endings, while a process that places every value on a fixed decimal grid and never on a round ending yields a column whose count is identical for all values. D27 examines each numeric column of the individual-patient data (IPD) with at least ten values and five distinct values, counts the significant decimals of the distinct values, and flags columns whose decimal-place distribution has near-zero normalized Shannon entropy, a precision overwhelmingly concentrated on one count of at least one, unless the precision matches a known instrument.

How it works

Layer 1 (deterministic): for each numeric column it drops missing entries and requires at least ten values and at least five distinct values, since a constant or near-constant column is uniform trivially. The significant decimal count of each distinct value is obtained by formatting to ten places, stripping trailing zeros, and counting the digits after the point. The most common count is the modal precision; columns whose mode is zero are integer columns and are not assessed. A qualifying column has a modal precision of at least one and is flagged when the normalized Shannon entropy of its decimal-place distribution is at or below 0.2, near zero, which grades the earlier all-identical rule and catches near-perfect uniformity. A flagged column is then checked against the instrument-precision dictionary (as in D17): if its name matches an instrument whose source is an instrument and whose expected decimals equal the observed precision, the uniformity is expected and not counted as suspicious. The proportion of qualifying columns that are suspicious sets the score (4.0 above eighty percent, 3.0 above sixty, 2.0 above forty, 1.0 above twenty). Skipped when fewer than five qualifying columns exist. Metadata records the qualifying and suspicious column counts, the count of suspicious columns that matched an instrument name, the proportion, the mean precision entropy, and per-column details.

Why this matters

Decimal precision is a recording artefact that fabrication tends to get wrong. Digit and precision anomalies are recognised signatures distinguishing invented from measured data, because a fabricator rarely reproduces the incidental structure that genuine measurement leaves behind, and people cannot match the natural distribution of trailing digits that determines recorded precision. The concern is sharper for machine generation: a model fabricating a dataset naturally emits values at a single fixed precision across every row, producing exactly the uniform-precision signature this indicator targets.

Score thresholds

0-1
Decimal precision varies naturally across most columns
2-3
A substantial share of non-integer columns show perfectly uniform precision
4-5
Almost all non-integer columns show perfectly uniform precision, consistent with grid-generated data

Limitations

The indicator reads significant decimals from stored floats, so it cannot see trailing zeros present in the original record; 4.50 is indistinguishable from 4.5, which is what makes the absence of any round-ending value informative but also ties the test to how data were stored. The criterion is strict: a fabricated column with even one round-ending value is not flagged, and a small distinct-rich real column could be uniform by chance. Columns are matched to instruments by name, so an unrecognised name removes the exemption and a misleading name could grant it wrongly. Datasets with fewer than five qualifying columns are skipped. Precision relative to a named instrument's resolution is indicator D17 and natural digit heaping is indicator D18; D27 focuses on within-column precision uniformity in the IPD.

References

  1. Buyse M, George SL, Evans S, et al.. (1999). The role of biostatistics in the prevention, detection and treatment of fraud in clinical trials. Statistics in Medicine
  2. Mosimann JE, Wiseman CV, Edelman RE. (1995). Data fabrication: can people generate random digits?. Accountability in Research
  3. Taloni A, Scorcia V, Giannaccare G. (2023). Large language model advanced data analysis abuse to create a fabricated dataset in medical research. JAMA Ophthalmology
  4. Brown NJL, Heathers JAJ. (2017). The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science 8(4):363-369
  5. Mosimann JE, Dahlberg JE, Davidian NM, Krueger JW. (2002). Terminal digits and the examination of questioned data. Accountability in Research 9(2):75-92
  6. Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380
  7. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
  8. Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512