ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D29Statistical analysisFabrication ExtendedLayer 1 (Deterministic)

Missing Data Implausible

Looks at how missing values are spread through a dataset. Real clinical and survey data almost always has some gaps, from instrument failures, dropouts, and skipped items, and those gaps fall unevenly across variables and participants. A dataset with no missing values at all, or with every variable missing at exactly the same rate, or with almost every row sharing one identical missingness pattern, is more consistent with a generated or mechanically edited table than with genuine collection. The indicator measures the overall completeness, the per-column rates, and the per-row patterns, and scores the combination of signals. It works on the individual-patient data (IPD).

Technical description

D29 is a deterministic screen for implausibly clean or over-structured missingness in individual-patient data (IPD). It requires at least three columns and twenty rows, then derives the per-column missing rates, the overall missing rate as total missing cells over the full grid, and the distribution of per-row missingness patterns. It raises independent signals: a dataset with no missing values at all; a structure in which every column is either fully present or fully absent with at least one absent; a set of columns with missing data that all share one rate above the low-rate floor; an overall rate that is positive but below that floor; and a single per-row missingness pattern that covers more than four fifths of the rows. Each signal adds to the score, which is capped, so the strongest evidence is the co-occurrence of several of these departures from the uneven, partial incompleteness that real studies show.

How it works

The per-column missing rate is the fraction of rows null in that column, and the overall rate is total nulls divided by rows times columns. Five checks contribute. An overall rate of exactly zero adds 2.5, because a perfectly complete clinical or survey dataset is rare. If every column is either fully present or fully absent and at least one is fully absent, 1.0 is added for the binary all-or-nothing structure. If at least two columns carry missing data, a chi-square test of equal missing proportions across them fails to reject homogeneity implausibly strongly (a p-value above 0.99, the columns fitting one shared rate almost exactly), and that shared rate exceeds the two percent floor, 1.5 is added; the floor ensures that a couple of stray missing values matching by chance does not count as an artificial fixed-percentage rule, and the original 0.001 rate-tolerance is used when SciPy is unavailable. An overall rate that is positive but below two percent adds 0.5. If the most common per-row missingness pattern, represented as the tuple of which cells are null, covers more than eighty percent of rows, 1.0 is added. The total is capped at 5.0. Each triggered check emits a finding with the relevant counts, and the metadata records the row and column counts, the overall rate, the total missing cells, the counts of fully present and fully absent columns, the dominant row-pattern fraction, the number of distinct row-missingness patterns, and the chi-square homogeneity statistic and p-value of the missing proportions.

Score thresholds

Score Meaning
0 to 1 Missingness is partial and varies naturally across variables and rows.
2 to 3 One or more strong departures, such as zero missingness or a uniform per-column rate.
4 to 5 Several implausible missingness signals together, consistent with generated or mechanically edited data.

Why this matters

Missing data is governed by mechanisms that real studies cannot avoid and fabrication tends to mishandle. The standard framework of Little and Rubin distinguishes data missing completely at random, missing at random, and missing not at random, and shows that genuine incompleteness arises from heterogeneous causes that vary by variable and participant, producing the uneven rates and diverse per-row patterns that this indicator expects [1]. George and Buyse describe how central statistical monitoring of clinical trials uses anomalies in data structure, including implausibly complete or over-regular records, to detect fraud that on-site monitoring misses, since a fabricator who types a clean table rarely reproduces the organic gaps of real collection [2]. The concern sharpens with machine generation. Taloni and colleagues showed that a language model can fabricate a clinical dataset that looks plausible, and such a process naturally emits a fully complete grid or applies a single fixed missingness rule, leaving exactly the zero-missing, uniform-rate, or dominant-pattern signatures the indicator targets [3]. Requiring the uniform-rate signal to exceed a non-trivial floor keeps the test focused on a deliberate fixed-percentage rule rather than on the stray single values that real data scatters across columns. The missing-at-random taxonomy originates with Rubin [4], and Little's test of whether data are missing completely at random formalises the homogeneity of missingness that fabrication can over-impose [5]; recent forensic re-analyses, scoping reviews, and trustworthiness instruments treat implausible missingness patterns as a routine integrity check [6, 7, 8].

Limitations

The checks are heuristic and a flag is a prompt to inspect provenance rather than proof of fabrication. Legitimately complete datasets exist, for example registries that mandate complete entry or analyses already restricted to complete cases, and these will trigger the zero-missing signal. Structured missingness can be genuine: when participants miss a study visit, all variables measured at that visit are absent for them together, which can produce a shared per-column rate and a dominant per-row pattern without any fabrication, so the indicator can flag real longitudinal dropout. The dominant-row-pattern check counts the all-complete pattern, so a real dataset with low overall missingness, in which most rows are simply complete, can raise that signal. Rates are matched at the granularity of one over the row count, so the uniform-rate tolerance behaves as an exact-count match. The minimum of three columns and twenty rows excludes small tables. Distributional cleanliness of the values themselves is indicator D28, so D29 focuses specifically on the structure of missingness in the IPD.

Theoretical background

D29 rests on the contrast between the stochastic, heterogeneous origin of real missingness and the regularity of generated or edited tables. In a genuine study each variable is collected by its own instrument or item, with its own failure and non-response rate, and each participant follows an individual trajectory of attendance and completion, so the missingness indicator matrix is a mixture of many independent processes. That mixture yields per-column rates that differ, a long list of distinct per-row patterns, and an overall rate that, while it can be low, is essentially never exactly zero across a real dataset of any size. A fabricated table departs from this in characteristic ways: typing a complete dataset gives exactly zero missingness; applying a single deletion rule of a fixed percentage gives every affected column the same rate; deleting whole variables gives the all-or-nothing column structure; and any mechanical rule applied uniformly collapses the diversity of per-row patterns onto one dominant pattern. Each check isolates one of these collapses of the natural mixture, and because the underlying real processes are independent the simultaneous appearance of several collapses is far less probable than any one alone, which is why the score accumulates across checks and is read as strongest when multiple signals coincide. The floor on the uniform-rate check encodes the recognition that exact agreement on a tiny number of missing cells is an expected coincidence under the real mixture, whereas agreement on a substantial shared rate is the fingerprint of a single imposed rule.

References

  1. Little RJA, Rubin DB. Statistical Analysis with Missing Data. 3rd ed. Hoboken, NJ: John Wiley and Sons; 2019. DOI: 10.1002/9781119482260
  2. George SL, Buyse M. Data fraud in clinical trials. Clinical Investigation. 2015;5(2):161-173. DOI: 10.4155/cli.14.116
  3. Taloni A, Scorcia V, Giannaccare G. Large language model advanced data analysis abuse to create a fabricated dataset in medical research. JAMA Ophthalmology. 2023;141(12):1174-1175. DOI: 10.1001/jamaophthalmol.2023.5162
  4. Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581-592. DOI: 10.1093/biomet/63.3.581
  5. Little RJA. A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association. 1988;83(404):1198-1202. DOI: 10.1080/01621459.1988.10478722
  6. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
  7. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
  8. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512