Uniform Intra-Column Precision
Looks at how many decimal places the values in each column carry. When a column is measured to a fixed precision, some readings naturally land on a round value and so show fewer decimal places, for example 4.50 recorded as 4.5. A column in which every distinct value has exactly the same number of decimal places, never once landing on a round ending, is unnatural and points to values generated on a fixed grid rather than measured. The indicator counts the decimal places of the distinct values in each non-integer column and flags columns whose precision is perfectly uniform and not explained by a known instrument. It works on the individual-patient data (IPD).
Technical description
D27 is a deterministic screen for artificial decimal-precision uniformity in individual-patient data (IPD). The number of significant decimal places a stored value carries reflects how it was recorded: a value measured to two decimals but ending in a round digit, such as 4.50, is stored as the float 4.5 and so shows one significant decimal. Genuine fixed-precision recording therefore produces a spread of significant-decimal counts, because a fraction of values land on round endings, while a process that places every value on a fixed decimal grid and never on a round ending yields a column where the count is identical for all values. The indicator examines each numeric column with at least ten values and at least five distinct values, counts the significant decimals of the distinct values, and identifies columns where that count is the same for every distinct value and is at least one. Such a column is reported as suspicious unless its precision matches a known instrument from the precision dictionary, in which case fixed precision is expected.
How it works
For each numeric column the indicator drops missing entries and requires at least ten values. It then takes the distinct values and requires at least five of them, because a constant or near-constant column is uniform trivially and carries no evidence of a fabricated precision pattern. The significant decimal count of each distinct value is obtained by formatting to ten places, stripping trailing zeros, and counting the digits after the point. The most common count is the modal precision; columns whose mode is zero are integer columns and are not assessed. A qualifying column is one with a modal precision of at least one, and it is flagged when the normalized Shannon entropy of its decimal-place distribution is at or below 0.2, that is near zero so the precision is overwhelmingly a single count; this grades the earlier rigid rule that required every distinct value to share the count and also catches near-perfect uniformity. A flagged column is then checked against the instrument-precision dictionary used by D17: if the column name matches an instrument whose source is an instrument and whose expected decimals equal the observed precision, the uniformity is expected and the column is not counted as suspicious. The proportion of qualifying columns that are suspicious sets the score, 4.0 above eighty percent, 3.0 above sixty, 2.0 above forty, and 1.0 above twenty. The analysis is skipped when fewer than five qualifying columns exist. The metadata records the qualifying and suspicious column counts, the number of suspicious columns that matched an instrument name, the proportion, the mean precision entropy, and per-column details including each column's precision entropy.
Score thresholds
| Score | Meaning |
|---|---|
| 0 to 1 | Decimal precision varies naturally across most columns. |
| 2 to 3 | A substantial share of non-integer columns show perfectly uniform precision. |
| 4 to 5 | Almost all non-integer columns show perfectly uniform precision, consistent with grid-generated data. |
Why this matters
Decimal precision is a recording artefact that fabrication tends to get wrong. Buyse and colleagues set out the role of biostatistics in detecting fraud in clinical trials and identify digit and precision anomalies among the signatures that distinguish invented from measured data, because a fabricator rarely reproduces the incidental structure that genuine measurement leaves behind [1]. Mosimann and colleagues showed experimentally that people asked to generate numbers cannot match the natural distribution of digits, which extends to the trailing digits that determine recorded precision: real fixed-precision data sheds a decimal place whenever a value lands on a round ending, whereas invented values placed deliberately on a grid do not [2]. The concern is sharper for machine generation. Taloni and colleagues demonstrated that a language model can fabricate a clinical dataset that looks plausible at a glance, and such a process naturally emits values at a single fixed precision across every row, producing exactly the uniform-precision signature this indicator targets [3]. Checking the distinct values, and excusing columns whose precision matches a known instrument, keeps the test focused on the unnatural absence of any round-ending value rather than on the ordinary fixed precision of a real device. The granularity of reported numbers as a fingerprint of how they were produced is the same principle the GRIM family of tests and the forensic study of terminal digits exploit [4, 5], and recent forensic re-analyses, scoping reviews, and trustworthiness instruments place precision and digit checks among the standard screens for fabricated and machine-generated data [6, 7, 8].
Limitations
The indicator reads significant decimal places from stored floating-point values, so it cannot see trailing zeros that were present in the original record; a value recorded as 4.50 is indistinguishable from 4.5, which is the very property that makes the absence of any such value informative but also means the test depends on how the data were stored. The criterion is strict, requiring every distinct value to share one precision, so a fabricated column with even a single round-ending value will not be flagged, and conversely a small but distinct-rich real column could be uniform by chance. Columns are matched to instruments by name, so an unrecognised instrument name removes the expected-precision exemption and a misleading name could grant it wrongly. The minimum of five qualifying columns means narrow datasets are skipped. Precision relative to a named instrument's resolution is indicator D17 and natural digit heaping is indicator D18, so D27 focuses specifically on within-column precision uniformity in the IPD.
Theoretical background
D27 rests on the arithmetic of rounding. When a quantity is recorded to a fixed number of decimal places, the last retained digit is uniform over the ten possibilities for data without strong digit preference, so a fraction of about one in ten values ends in zero and is stored with one fewer significant decimal, one in a hundred ends in two zeros, and so on. The expected distribution of significant-decimal counts in a genuine fixed-precision column is therefore spread, concentrated at the nominal precision but with a predictable tail at lower counts, and the probability that a column of several distinct values shows no reduction at all falls geometrically as the number of distinct values grows. A generation process that draws values on an exact grid, or a model that emits each value at the same nominal precision, removes this tail and yields a degenerate distribution at a single count. The indicator detects that degeneracy. Assessing the distinct values rather than all rows is essential, because repetition of one value would otherwise manufacture apparent uniformity, and the minimum distinct-value requirement ensures the geometric improbability of zero reduction is high enough for the uniformity to be meaningful. Exempting instrument-matched columns encodes the one legitimate source of genuine fixed precision, a device that reports to a set resolution, so that the residual flagged columns are those whose uniformity has no measurement explanation.
References
- Buyse M, George SL, Evans S, et al. The role of biostatistics in the prevention, detection and treatment of fraud in clinical trials. Statistics in Medicine. 1999;18(24):3435-3451. DOI: 10.1002/(SICI)1097-0258(19991230)18:24<3435::AID-SIM365>3.0.CO;2-O
- Mosimann JE, Wiseman CV, Edelman RE. Data fabrication: can people generate random digits? Accountability in Research. 1995;4(1):31-55. DOI: 10.1080/08989629508573866
- Taloni A, Scorcia V, Giannaccare G. Large language model advanced data analysis abuse to create a fabricated dataset in medical research. JAMA Ophthalmology. 2023;141(12):1174-1175. DOI: 10.1001/jamaophthalmol.2023.5162
- Brown NJL, Heathers JAJ. The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science. 2017;8(4):363-369. DOI: 10.1177/1948550616673876
- Mosimann JE, Dahlberg JE, Davidian NM, Krueger JW. Terminal digits and the examination of questioned data. Accountability in Research. 2002;9(2):75-92. DOI: 10.1080/08989620212969
- Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
- Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512