ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
G7-imgImage forensicsChart AnalysisLayer 1 (Deterministic)

Too-Perfect Data

Reads the plotted numeric values off a chart and tests whether their distribution is unnaturally clean: too-perfect normality, near-zero skewness and kurtosis, a complete absence of outliers, or an implausibly low spread. Axis tick labels are removed first, because they are equidistant by construction and would otherwise make any chart look uniform. It works from optical character recognition (OCR) of the plotted numbers, with no model.

Technical description

G7 is a deterministic, generator-agnostic screen for fabricated data, applying the forensic-statistics principle that invented numbers are too regular: real measurements carry sampling noise, occasional outliers, and asymmetry, while fabricated data tends toward suspicious cleanliness. The indicator extracts every number from the figure by OCR, separates the plotted data values from the axis tick labels (which are uniform by design and must not enter the test), and runs five distributional checks on the data values: a Shapiro-Wilk normality test, the skewness, the excess kurtosis, the presence of outliers by the interquartile-range (IQR) rule, and the coefficient of variation (CV). Each suspiciously clean result adds to a 0 to 5 score (capped). It requires the image to be at least 32 by 32 pixels and at least ten plotted data values after axis labels are removed.

How it works

The indicator runs deterministically at Layer 1 using extract_numbers, which OCRs every number in the figure, and detect_axes.

The first step removes the axis tick labels, because they form an evenly spaced coordinate grid rather than measured data. Writing the bounding box of an OCR number as (x, y, w, h), its center is (x + w/2, y + h/2). A number is classified as an axis label, and excluded, when its center lies to the left of the detected y-axis line or below the detected x-axis line; when an axis line is not found, the fixed fallbacks center_x <= 0.12 · W and center_y >= 0.88 · H are used, where W and H are the image width and height. The surviving positive values x_1, ..., x_n are the plotted data, and the five tests below run only when n >= 10.

The normality test is the Shapiro-Wilk test, defined for 3 <= n <= 5000. Its statistic is W = ( Σ_{i=1}^{n} a_i · x_(i) )² / Σ_{i=1}^{n} (x_i − x̄)², where x_(1) <= ... <= x_(n) are the sorted values, x̄ = (1/n) Σ x_i is the sample mean, and the weights a_i are the tabulated constants derived from the expected values and the covariance matrix of the order statistics of a standard normal sample. W lies in (0, 1] and approaches 1 as the sample approaches perfect normality; the test converts W into a p-value, and a p-value above 0.99, meaning the data matches a normal distribution implausibly well, contributes 1.5 at warning severity.

Skewness and kurtosis are computed from the central moments m_k = (1/n) Σ_{i=1}^{n} (x_i − x̄)^k. The sample skewness is g_1 = m_3 / m_2^(3/2), which is zero for a symmetric distribution, and an absolute value |g_1| < 0.01 contributes 1.0 at warning severity. The sample excess kurtosis is g_2 = m_4 / m_2² − 3, which is zero for a normal distribution, and an absolute value |g_2| < 0.1 contributes 0.5 at info severity.

The outlier test uses the interquartile range. With Q1 and Q3 the 25th and 75th percentiles and IQR = Q3 − Q1, a value is an outlier when it falls outside the fence [Q1 − 1.5 · IQR, Q3 + 1.5 · IQR]. When n > 30 and the number of outliers is exactly zero, the data is unnaturally clean and contributes 1.0 at warning severity.

The dispersion test is the coefficient of variation CV = s / x̄, where s = sqrt[ (1/n) Σ (x_i − x̄)² ] is the standard deviation. A value CV < 0.02 together with a mean x̄ > 1, an implausibly tight spread, contributes 1.0 at warning severity.

The five contributions are summed and the score is reported as min(5.0, total). The metadata records n, the total numbers found and how many were removed as axis labels, and the Shapiro-Wilk p-value, skewness, excess kurtosis, outlier count, and CV.

Score thresholds

Score Meaning
0 to 1 The plotted values carry normal sampling messiness: some skew, some spread, the occasional outlier.
2 to 3 One or two cleanliness signals: near-perfect symmetry, suspicious normality, or an unusually tight spread.
4 to 5 Several signals together: perfectly normal, symmetric, outlier-free, and tightly clustered. Consistent with fabricated data.

Why this matters

Detecting fabrication from the shape of the data alone has a strong track record. Simonsohn showed that two fabrication cases could be identified from reported means and standard deviations because the values were inconsistent with random sampling and excessively similar, ruling out benign explanations once the raw data was examined [1]. Carlisle applied the same logic at scale, screening 5087 randomized trials and finding baseline data that was too similar, or too dissimilar, to be consistent with the sampling variability genuine randomization produces [2]. A recent review of the field catalogues these methods and states the underlying signature plainly: fabricated data exhibits unexpectedly uniform distributions and an absence of the natural variation that real measurements carry [3]. G7 turns that signature into deterministic tests on the values a chart actually plots. The one prerequisite is reading the right numbers, which is why the indicator first separates plotted data from the axis grid: recovering the data values from a chart is the task of chart data extraction, and the axis ticks it also recovers are a coordinate scale, not a sample [4]. Testing the data, and only the data, for too-perfect normality, symmetry, and tightness is a direct, model-free fabrication screen.

Limitations

G7 needs at least ten plotted data values read by OCR after axis labels are removed, so charts that print only a few values, or whose numbers OCR cannot read, are not scored, and bar charts that show only bars and an axis yield no data values to test. The axis-label split is positional and assumes a conventional left y-axis and bottom x-axis; an axis on another side, or numbers in a title or legend, can be misclassified. The cleanliness thresholds are deliberately strict so that ordinary data does not trip them, which means subtle fabrication that keeps some noise will pass. The pooled values are treated as one sample, so a chart that plots several genuinely different series together can look messier than any one series. Digit-level fabrication signatures, mean-and-standard-deviation plausibility, and p-value distributions live in sibling chart indicators, so G7 stays on the shape of the plotted distribution to avoid duplicating them.

Theoretical background

G7 operationalises the forensic-statistics consensus that fabricated data is too clean. Genuine measurements are draws from a noisy process, so they are rarely perfectly normal, never perfectly symmetric, almost always contain an outlier or two in a large sample, and carry a non-trivial spread. Each test targets one of those properties: Shapiro-Wilk for too-good normality, skewness for forced symmetry, kurtosis for a manufactured Gaussian peak, the IQR rule for a suspicious lack of extremes, and the CV for an implausibly tight cluster. The decisive design choice is what enters the tests. Because a chart's axis is a uniform coordinate grid rather than a sample, including its tick labels would manufacture exactly the cleanliness the indicator hunts for, so they are removed first and only the plotted data is tested. Each signal is a property of the data distribution rather than a learned fingerprint, which keeps the screen independent of how the chart was produced.

References

  1. Simonsohn U. Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science. 2013;24(10):1875-1888. DOI: 10.1177/0956797613480366
  2. Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
  3. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025. https://pmc.ncbi.nlm.nih.gov/articles/PMC12121900/
  4. Luo J, Li Z, Wang J, Lin CY. ChartOCR: Data Extraction from Charts Images via a Deep Hybrid Framework. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 2021. https://openaccess.thecvf.com/content/WACV2021/html/Luo_ChartOCR_Data_Extraction_From_Charts_Images_via_a_Deep_Hybrid_WACV_2021_paper.html