S15Statistical analysisStatistical ConsistencyLayer 1 (Deterministic)

No Imperfections

Flags datasets that look too clean to be real. Genuine data collection produces missing values, occasional extreme observations, and dropout, so a study whose tables have no gaps, whose numeric columns contain no outliers, and whose text claims perfect follow-up or zero attrition is unusual enough to warrant a second look. The indicator runs three checks for these signs of suspicious perfection and combines them.

Technical description

A deterministic screen for the absence of real-world imperfections, combining three sub-checks whose activation pattern sets the score. (1) Zero missing: across all tables, if there are more than fifty cells and none is empty or a missing marker, the data are flagged as suspiciously complete; the recognised markers include blanks, N/A variants, the placeholder symbols journals use (em dash, en dash, hyphen, middle dot, bullet, ellipsis), and textual markers like not-detected and not-reported, so a table that does mark its gaps is not mistaken for a complete one. (2) Zero outliers: in any numeric column with at least ten parsed values, it checks whether every value lies within the robust band median +/- 3*MAD (the median absolute deviation scaled by 1.4826, per Leys et al. 2013, 2019), a rule that resists the masking by which an outlier inflates the mean and SD and so hides itself. (3) Text signals: it scans for perfection claims such as no dropout, 100% follow-up, no missing data, zero attrition, complete data, no adverse events.

How it works

Layer 1 (deterministic): the missing-data check tallies all table cells, classifies a cell as missing when its trimmed lower-cased text matches a known marker, and flags zero-missing only when the cell count exceeds fifty and the missing count is zero (warning). The outlier check parses each column's numerics and, for columns with at least ten values, tests whether all lie within median +/- 3*MAD with the MAD scaled by 1.4826 (a column whose scaled MAD is zero, meaning half or more values are identical, is ignored); if at least one column was checked and none had an outlier, the sign is set (info). The text check matches a fixed phrase list (info). Score: all three signs give 5.0; zero-missing plus zero-outliers give 4.0; zero-missing alone 2.5; a text claim alone or zero-outliers alone 1.5; none 0. Metadata records zero_missing, zero_outliers, and text_signals.

Why this matters

Real data are messy, and the absence of that mess is informative. Simonsohn showed fabrication can be exposed by statistics alone when reported results are too clean or too similar to have come from genuine sampling, identifying two cases from the implausibly low variability of their summary statistics. Carlisle's forensic re-analyses treat improbably consistent and complete data as an integrity signal across many trials, and the classic biostatistical account of fraud lists the lack of normal imperfections (complete follow-up, absent outliers) among the patterns distinguishing invented from genuine data. A fabricator focused on a clean publishable result rarely adds the missing values, extremes, and dropout real studies accumulate, so their absence, especially in combination, is a coherent fingerprint. Because each sign has innocent explanations, S15 reserves its highest scores for their co-occurrence.

Score thresholds

0: The data show the normal imperfections of real collection.
1-2: A single weak sign: a perfection claim in the text, or numeric columns with no outliers.
2-3: Tables with more than fifty cells and no missing values at all.
4-5: Several signs together: complete data, no outliers, and explicit perfection claims.

Limitations

Each sub-check is a heuristic with benign explanations, so a high score prompts inspection rather than proving misconduct. The zero-outlier check is weakest: for a small sample, having no value beyond the robust three-MAD band is the normal expectation, not a surprise, so the sign is meaningful only in larger columns, and S15 never relies on it alone for a high score; the median and MAD band is itself robust to the very outlier it seeks, unlike a mean and SD band. Genuinely complete data exist (small, well-managed, or registry studies), so zero missing is not proof. The text-phrase list is fixed and literal, so it misses paraphrases and can match a legitimate fact such as a small safety study that truly saw no adverse events. The checks depend on correct table parsing and a complete marker list. The thresholds (fifty cells, ten values, three standard deviations) are directional. Outlier and distributional checks on individual-patient data are in the D series.