Missing Data (Table)
Flags tables that are suspiciously perfect. Real experimental data almost always has some missing entries, some statistical outliers, and uneven variability between columns; a large table with zero missing values, zero outliers, and unnaturally uniform column spreads is a hallmark of fabrication. The indicator measures the missing-data rate, counts outliers, and checks whether column standard deviations are implausibly similar, while distinguishing genuine missingness from the dashes and not-significant markers that scientific tables use deliberately. It runs only on tables that hold statistical data.
Technical description
T8 is a deterministic, generator-agnostic screen for data that is too clean to be real. Genuine measurement is messy: participants drop out, instruments fail, and natural variation produces occasional extreme values, so authentic tables carry a nonzero rate of missing entries and a sprinkling of outliers, and their columns vary in spread. Fabrication, by contrast, tends to produce complete, tidy, homogeneous tables because inventing realistic gaps and extremes is extra work that is easy to skip. T8 extracts the table grid by OCR, confirms it holds statistical data so that a text or layout table is not judged, and reads three signals: the fraction of cells that are genuinely missing, the number of statistical outliers across numeric columns, and the coefficient of variation of the column standard deviations. Zero missing on a large table, zero outliers across many values, and near-identical column spreads each raise the score, and together they reach the maximum.
How it works
After OCR extraction and a statistical-data gate, the missing-data count includes empty cells and explicit not-available placeholders, but deliberately excludes scientific markers such as dashes, ellipses, not-significant, reference, and significance symbols, which are meaningful entries rather than gaps. The missing rate is the count over the total cells, and zero missing on a table of more than fifty cells is flagged.
For each numeric column with at least ten values, outliers are counted as values more than three standard deviations from the column mean, and zero outliers across more than fifty numeric values is flagged. For each column with at least five values the standard deviation is computed, and when at least three columns are present the coefficient of variation of those standard deviations is taken; a value below 0.05, meaning the columns differ in spread by less than five percent, is flagged as suspiciously uniform.
The score follows a ladder: a table with more than five percent missing scores 0, a table with a little missing scores 1.0, zero missing on a large table scores 2.5, zero missing combined with zero outliers scores 4.0, and all three flags together score 5.0. Each flag becomes a finding. The metadata records the cell and missing counts, the missing rate, the numeric-value and outlier counts, the three flag states, the standard-deviation coefficient of variation, and the number of columns assessed.
Score thresholds
| Score | Meaning |
|---|---|
| 0 to 1 | The table has the missing values and variability expected of real data. |
| 2 to 3 | A large table with no missing values at all, which is unusual. |
| 4 to 5 | No missing values and no outliers, optionally with unnaturally uniform column spreads. Consistent with fabricated data. |
Why this matters
Absence of the normal imperfections of data is a recognized fabrication signal, and statistical detective work has repeatedly exploited it: Simonsohn exposed fabricated datasets partly because they were too clean and too regular to have arisen from real sampling, lacking the noise and the occasional extreme that genuine data carry [1]. Forensic re-analysis of trials applies the same intuition, treating distributions that are too tidy as a marker of non-random or invented data [2], and large-scale consistency screening of the literature shows how often reported data depart from what real measurement produces [3]. The discipline of the screen lies in not crying wolf: scientific tables are full of deliberate dashes and not-significant markers that are not missing data, and excluding them keeps the missing-rate honest, while the statistical-data gate stops a text table, which is trivially complete and outlier-free, from being read as a suspiciously perfect dataset. By combining missingness, outliers, and spread uniformity, T8 distinguishes the believable mess of real data from the implausible neatness of invention.
Limitations
A complete table is not proof of fabrication: small, carefully curated, or fully observed datasets legitimately have no missing values, and well-controlled measurements can have no extreme outliers, so the flags are suggestive rather than conclusive and the score is built to require more than one signal for the higher bands. The outlier test uses the mean and standard deviation, which an outlier itself inflates, so a single extreme value can mask others. The missing-data count depends on optical character recognition recognising blank and not-available cells correctly, and on the marker list matching the table's conventions. Uniform column standard deviations can arise from genuinely comparable measures on the same scale. The statistical-data gate skips mostly-text tables. The granularity and copy-paste screens are separate indicators, so T8 stays on completeness, outliers, and spread uniformity.
Theoretical background
T8 rests on the statistics of real data collection. Missingness is a near-universal feature of empirical datasets because the processes that generate data, recruitment, measurement, and recording, all fail intermittently, so the probability that a sizeable table has exactly zero missing entries is low. Outliers are similarly expected: any distribution with tails produces occasional values several standard deviations from the centre, and across many measurements at least a few should appear, so their total absence is itself informative. Variability between columns reflects that different variables have different natural scales and dispersions, so column standard deviations should differ; their near-equality suggests the numbers were generated by one process to a target rather than measured. Fabrication optimises for surface plausibility and overlooks these second-order properties, producing tables that are cleaner, tamer, and more uniform than sampling allows. Reading the joint absence of missingness, outliers, and spread variation therefore tests whether a table bears the statistical fingerprints of having been collected.
References
- Simonsohn U. Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science. 2013;24(10):1875-1888. DOI: 10.1177/0956797613480366
- Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
- Nuijten MB, Hartgerink CHJ, van Assen MALM, Epskamp S, Wicherts JM. The prevalence of statistical reporting errors in psychology (1985-2013). Behavior Research Methods. 2016;48(4):1205-1226. DOI: 10.3758/s13428-015-0664-2