ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
T8Image forensicsTable AnalysisLayer 1 (Deterministic)

Missing Data (Table)

Flags tables that are suspiciously perfect. Real data almost always has some missing entries, some outliers, and uneven variability between columns; a large table with zero missing values, zero outliers, and unnaturally uniform column spreads is a hallmark of fabrication. Deliberate scientific markers (dashes, not-significant) are distinguished from genuine missingness.

Technical description

Extracts the table grid by OCR, gates on it holding statistical data, and reads three signals. Missing rate: empty and not-available cells over total cells, excluding deliberate markers (dashes, ellipses, n.s., ref, significance symbols); zero missing on a table over 50 cells is flagged. Outliers: values more than three SD from the column mean across columns with at least 10 values; zero outliers over 50 numeric values is flagged. SD uniformity: the coefficient of variation of column SDs below 0.05 is flagged. The score ladders from 0 to 5 as flags accumulate.

How it works

Layer 1 (deterministic): computes the missing rate (excluding scientific markers), counts outliers beyond three SD per numeric column, and measures the coefficient of variation of column SDs. Zero missing on a large table scores 2.5, plus zero outliers scores 4.0, plus uniform SDs scores 5.0; a table with real missingness scores 0 to 1.0. Each flag becomes a finding.

Why this matters

Absence of the normal imperfections of data is a recognized fabrication signal: genuine measurement is messy, with dropouts, instrument failures, and occasional extreme values, so authentic tables carry missingness, outliers, and uneven column spreads. Statistical detective work has exposed fabricated datasets partly because they were too clean and regular to have arisen from real sampling. Excluding deliberate markers and skipping non-statistical tables keeps the screen honest, so the joint absence of imperfections is read only where it is meaningful.

Score thresholds

0-1
The table has the missing values and variability expected of real data
2-3
A large table with no missing values at all, which is unusual
4-5
No missing values and no outliers, optionally with unnaturally uniform column spreads, consistent with fabricated data

Limitations

A complete table is not proof of fabrication: small, curated, or fully observed datasets legitimately have no missing values, and controlled measurements can have no outliers, so the flags are suggestive and the higher bands require more than one signal. The outlier test uses the mean and SD, which an outlier inflates, so one extreme can mask others. The missing count depends on OCR recognising blanks and NA cells and on the marker list matching the table's conventions. Uniform SDs can arise from genuinely comparable measures. The statistical-data gate skips mostly-text tables. Granularity and copy-paste are separate indicators; T8 stays on completeness, outliers, and spread uniformity.

References

  1. Simonsohn U. (2013). Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science 24(10):1875-1888
  2. Carlisle JB. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 72(8):944-952
  3. Nuijten MB, Hartgerink CHJ, van Assen MALM, Epskamp S, Wicherts JM. (2016). The prevalence of statistical reporting errors in psychology (1985-2013). Behavior Research Methods 48(4):1205-1226