ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D29Statistical analysisFabrication ExtendedLayer 1 (Deterministic)

Missing Data Implausible

Checks whether the pattern of missing values in the IPD is plausible; fabricated datasets often have either no missing data at all, or suspiciously structured missing patterns.

Technical description

Real clinical and survey data almost always has gaps from instrument failures, dropouts, and skipped items, and those gaps fall unevenly across variables and participants. D29 requires at least three columns and twenty rows of individual-patient data (IPD), then derives the per-column missing rates, the overall missing rate as total missing cells over the full grid, and the distribution of per-row missingness patterns. It raises independent signals: no missing values at all; every column either fully present or fully absent with at least one absent; columns with missing data whose proportions are implausibly homogeneous by a chi-square test, fitting one shared rate above the low-rate floor almost exactly; an overall rate positive but below that floor; and a single per-row missingness pattern covering more than four fifths of rows.

How it works

Layer 1 (deterministic): the per-column missing rate is the fraction of rows null, and the overall rate is total nulls over rows times columns. Five checks contribute. An overall rate of exactly zero adds 2.5. If every column is fully present or fully absent and at least one is fully absent, 1.0 is added. If at least two columns carry missing data, a chi-square test of equal missing proportions across them fails to reject homogeneity implausibly strongly (p above 0.99) and that shared rate exceeds the two percent floor, 1.5 is added, the floor ensuring stray values matching by chance do not count as an artificial fixed-percentage rule, with the 0.001 rate-tolerance used when SciPy is unavailable. An overall rate positive but below two percent adds 0.5. If the most common per-row null pattern covers more than eighty percent of rows, 1.0 is added. The total is capped at 5.0. Metadata records the row and column counts, the overall rate, the total missing cells, the fully present and absent column counts, the dominant row-pattern fraction, the number of distinct row-missingness patterns, and the chi-square homogeneity statistic and p-value of the missing proportions.

Why this matters

Genuine incompleteness arises from heterogeneous mechanisms that vary by variable and participant, producing uneven rates and diverse per-row patterns, so implausibly complete or over-regular records are a recognised target of central statistical monitoring for fraud. A fabricator who types a clean table rarely reproduces organic gaps, and a language model generating a dataset naturally emits a fully complete grid or applies a single fixed missingness rule, leaving the zero-missing, uniform-rate, or dominant-pattern signatures this indicator targets.

Score thresholds

0-1
Missingness is partial and varies naturally across variables and rows
2-3
One or more strong departures, such as zero missingness or a uniform per-column rate
4-5
Several implausible missingness signals together, consistent with generated or mechanically edited data

Limitations

The checks are heuristic and a flag prompts inspection rather than proving fabrication. Legitimately complete datasets exist, such as registries mandating complete entry or complete-case analyses, and these trigger the zero-missing signal. Structured missingness can be genuine: when a participant misses a visit all variables from that visit are absent together, producing a shared per-column rate and a dominant per-row pattern without fabrication. The dominant-row-pattern check counts the all-complete pattern, so a real dataset with low missingness, where most rows are simply complete, can raise it. Rates match at the granularity of one over the row count, and small tables below three columns or twenty rows are excluded. Distributional cleanliness of the values is indicator D28; D29 focuses on the structure of missingness in the IPD.

References

  1. Little RJA, Rubin DB. (2019). Statistical Analysis with Missing Data (3rd ed.). John Wiley and Sons
  2. George SL, Buyse M. (2015). Data fraud in clinical trials. Clinical Investigation
  3. Taloni A, Scorcia V, Giannaccare G. (2023). Large language model advanced data analysis abuse to create a fabricated dataset in medical research. JAMA Ophthalmology
  4. Rubin DB. (1976). Inference and missing data. Biometrika 63(3):581-592
  5. Little RJA. (1988). A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association 83(404):1198-1202
  6. Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380
  7. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
  8. Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512