Textbook Data (Table)
Flags data that matches statistical theory too closely to be real. Genuine measurements scatter; they are never perfectly normal, never exactly equal in spread across variables, and never land on the round effect sizes that textbooks use as examples. Data that is too good, distributions that are flawlessly Gaussian, column spreads that are identical, and effect sizes that fall exactly on conventional benchmarks, is a classic fabrication signal, the same one that first exposed Mendel's suspiciously perfect genetics data. The indicator tests numeric columns for excessive normality, uniform spread, and textbook effect sizes. It runs only on tables that hold statistical data.
Technical description
T9 is a deterministic-input, model-based screen for data that is suspiciously close to statistical ideals. Real data carries sampling noise, so empirical distributions deviate from perfect normality, the spreads of different variables differ, and observed effect sizes take arbitrary values rather than the round numbers that serve as conventions. Fabrication that aims to look textbook-correct overshoots, producing data that fits theory more closely than sampling allows. T9 extracts the table grid by OCR, confirms it holds statistical data, and applies three checks to the numeric columns: whether every column is excessively normal by the Shapiro-Wilk test, whether the column standard deviations are implausibly uniform, and whether any pair of columns has a Cohen's effect size sitting exactly on a conventional benchmark. Each check that fires is a flag, and the flag count sets the score. As a Layer 2 indicator it applies statistical models rather than closed-form arithmetic.
How it works
After OCR extraction and a statistical-data gate, the numeric columns with at least ten values are collected. The normality check runs the Shapiro-Wilk test on each column and flags the table only when every column returns a p-value above 0.99, meaning the data is not just consistent with normality but almost indistinguishable from a perfect Gaussian, which real samples rarely achieve. The spread-uniformity check computes the standard deviation of each column and flags the table when the coefficient of variation of those standard deviations falls below 0.05, that is, when the columns differ in spread by less than five percent. The effect-size check computes Cohen's d for each pair of columns using the pooled standard deviation and flags any pair whose d falls within 0.01 of a conventional benchmark of 0.2, 0.5, or 0.8.
The flag count maps to the score: zero flags scores 0, one scores 2.0, two scores 3.5, and three scores 4.5. Each flag becomes a finding, escalating to error severity when all three fire together. The metadata records the number of numeric columns, the flag count, and the flag descriptions.
Score thresholds
| Score | Meaning |
|---|---|
| 0 to 1 | The data shows the natural deviation from theory expected of real measurements. |
| 2 to 3 | One or two textbook-perfect properties, which may be coincidence or a clean dataset. |
| 4 to 5 | Excessive normality, uniform spreads, and benchmark effect sizes together. Consistent with data fabricated to look textbook-correct. |
Why this matters
The oldest documented case of suspected data fabrication turned on data that fit theory too well: Fisher's reanalysis of Mendel's pea-breeding results found that the observed ratios matched the predicted three-to-one so closely that the chi-square was far smaller than chance should allow, the signature of data adjusted toward expectation [1]. The same logic underlies modern fabrication detection, where datasets are exposed not because they are wrong but because they are implausibly clean and regular, lacking the noise that sampling guarantees [2]. Forensic re-analysis of trials treats distributions that are too tidy as a marker of invention rather than measurement [3]. Effect sizes are a particularly telling case: a fabricator reaching for a plausible result naturally writes down the conventional small, medium, or large value, whereas a real study almost never lands exactly on one. By testing for normality that is too perfect, spreads that are too equal, and effect sizes that are too round, T9 reads the fingerprints of data built to match a textbook rather than collected from the world.
Limitations
Fitting theory closely is not always fabrication: large, clean, well-controlled datasets can be genuinely close to normal, and standardized instruments can produce comparable spreads, so the flags are suggestive and the higher score bands require more than one. The Shapiro-Wilk threshold of 0.99 is strict, which keeps false positives low but can miss subtler over-fitting. The effect-size check compares all column pairs, including pairs that do not represent a treatment and control of the same measure, so a coincidental match to a round benchmark across unrelated columns is possible and grows more likely as the number of columns increases; a single effect-size flag should therefore be read cautiously. The test depends on optical character recognition and on having enough values per column. The statistical-data gate skips mostly-text tables. The granularity tests T2 to T4 address a different kind of impossibility, so T9 stays on excessive goodness of fit.
Theoretical background
T9 rests on the inevitability of sampling noise. Any finite sample from a population is a random draw, so its histogram departs from the population shape, its variance differs from the true variance, and any statistic computed from it scatters around its expected value. The probability that a real sample is almost exactly normal, or that several variables have almost exactly equal spread, or that a difference lands within a hundredth of a textbook benchmark, is small precisely because noise pushes real data off these ideal points. Fabrication inverts the relationship: a person constructing data to look correct anchors on the ideal, the normal curve, a target spread, a conventional effect, and produces values closer to it than chance permits. Fisher formalised this for Mendel with a goodness-of-fit statistic that was suspiciously good, and the same reasoning generalises: when the deviation from theory is smaller than sampling would produce, the data was more likely shaped by an expectation than generated by a process. Reading several such over-fits together compounds the implausibility, which is why T9 weighs the joint presence of textbook-perfect properties.
References
- Fisher RA. Has Mendel's work been rediscovered? Annals of Science. 1936;1(2):115-137. DOI: 10.1080/00033793600200111
- Simonsohn U. Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science. 2013;24(10):1875-1888. DOI: 10.1177/0956797613480366
- Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938