ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
G4-imgImage forensicsChart AnalysisLayer 1 (Deterministic)

Error Bars

Screens the error bars on a bar chart for the signatures of decorative or fabricated whiskers: error bars not centered on their bar tops, near-identical or pixel-identical lengths across conditions, a uniform symmetric stamped template, and error bars present on only some of the bars. It works from the detected bar and error-bar geometry alone, with no model.

Technical description

G4 is a deterministic, generator-agnostic screen for error bars that were drawn as decoration rather than computed from data. Genuine error bars (a standard deviation (SD), standard error of the mean (SEM), or confidence interval (CI)) vary in length from one condition to the next, sit centered on the bar top (the mean), and are present on every bar that reports a measurement. Error bars that were copied as a graphic, or invented to make a figure look rigorous, break one or more of those expectations. The indicator detects the bars and the error-bar segments, then sums four signals, centering, length uniformity, a symmetric stamped template, and coverage consistency, into a 0 to 5 score (capped). It requires the image to be at least 32 by 32 pixels and at least two detected error bars.

How it works

The indicator runs deterministically at Layer 1 using detect_bars and detect_error_bars, the latter linking each error-bar segment to its nearest parent bar. The analysis runs when at least two error bars are detected.

The centering test compares each error bar to the top of its bar. For an error bar with vertical center y_c linked to a bar whose top is at pixel y_top, the offset is o = |y_c − y_top|, and an offset o > 3 pixels is a centering issue, since a genuine whisker is anchored on the mean it represents. With c such issues the signal contributes min(2.0, 0.5 · c), and each off-center bar is reported with its region.

The uniformity test examines the error-bar lengths. Each error bar has pixel length ℓ_i = y_bottom,i − y_top,i, and with mean ℓ̄ = (1/k) Σ ℓ_i and standard deviation s = sqrt[ (1/k) Σ (ℓ_i − ℓ̄)² ] the coefficient of variation is CV = s / ℓ̄. A value CV < 0.05 means the lengths are effectively identical and contributes 2.0 at error severity; when every integer length is the same value the finding escalates to a pixel-identical exact-duplicate pattern, otherwise it reports near-identical lengths with the CV.

The template test treats symmetry as a confirming factor rather than a standalone signal. An error bar is symmetric when its upper half (y_c − y_top) and lower half (y_bottom − y_c) differ by less than 1 pixel. Symmetry on its own is not a tell, because a mean plus or minus a standard deviation is symmetric by construction on a linear axis, so it contributes 1.0 at warning severity only in combination with length uniformity (CV < 0.05): identical-length and symmetric whiskers are a single stamped graphic reused across the chart.

The coverage test checks how many bars carry a whisker. Let the chart have b bars and let the error bars cover q distinct bars through their parent links. When b >= 4 the coverage ratio is r = q / b, and a ratio r < 0.5, meaning error bars sit on fewer than half the bars, contributes 1.0 at warning severity.

The four contributions are summed and reported as min(5.0, total). The metadata records the bar and error-bar counts, the centering-issue count, the length CV, the pixel-identical, all-symmetric, and template flags, and the coverage ratio.

Score thresholds

Score Meaning
0 to 1 Error bars of varying length, centered on their bar tops, present on every bar. Consistent with whiskers computed from data.
2 to 3 One signal present: off-center error bars, near-identical lengths, a uniform symmetric template, or error bars on only some bars.
4 to 5 Several signals together, such as a pixel-identical symmetric template that is also off-center or partial. Consistent with decorative or fabricated error bars.

Why this matters

Error bars on bar charts are a known weak point in the scientific record, on both the honesty and the integrity axes. A systematic review of 703 articles showed that bar graphs with error bars hide the underlying distribution, since many different datasets produce the same bar and the same whisker, so the error bar carries little information and is easy to draw without reference to data [1]. That weakness is compounded by widespread misreporting: an audit of three cardiovascular journals found that 64 percent of articles misused the SEM, routinely choosing the variability measure that yields the smallest, most reassuring error bar [3]. Against that backdrop, error bars that are decorative or fabricated leave geometric traces. Forensic-statistics reviews of how to detect fabricated data center on excessive uniformity, the observation that invented numbers and summary statistics are implausibly regular because people and templates do not reproduce the natural variation of real measurements [2]. G4 reads that uniformity directly off the figure: identical or pixel-identical whisker lengths, a symmetric stamped template repeated across conditions, error bars floating off their bar tops, or error bars present on only some bars are each a way the drawing departs from whiskers that were computed per condition. None of these requires reading the numbers; they are properties of how the error bars were placed.

Limitations

G4 analyses a chart only when at least two error bars are detected, and its signals depend on the bar and error-bar detection recovering the geometry; charts whose error bars are too faint, too short, or occluded are not scored. The thresholds are directional rather than exact. Some flagged patterns can occur honestly: experiments with genuinely similar variability across conditions produce similar error-bar lengths, and a chart may legitimately omit error bars on a reference or control bar, so the uniformity and coverage signals are screening cues that warrant review rather than proof. The symmetry signal is deliberately confined to the uniform case to avoid flagging the ordinary symmetric error bar. Whether the reported mean and SD are arithmetically possible for the stated sample size, and whether a bar's drawn height matches its printed value, live in sibling chart indicators, so G4 stays on the error-bar geometry to avoid duplicating them.

Theoretical background

G4 formalises the difference between a computed error bar and a drawn one. A computed whisker inherits the variability of its condition, so across a chart the lengths vary, each whisker is centered on its mean, and every measured bar carries one. A decorative or fabricated whisker is placed by hand or copied as a graphic, so it tends toward uniformity, toward a symmetric template reused unchanged, toward misalignment with the bar top, and toward inconsistent presence. Each signal is a structural property of placement rather than a learned fingerprint, which keeps the screen robust and free of dependence on any particular chart tool. The uniformity signal is the strongest, grounded in the forensic-statistics principle that fabricated quantities are too regular; the centering, template, and coverage signals add independent geometric evidence of decoration.

References

  1. Weissgerber TL, Milic NM, Winham SJ, Garovic VD. Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biology. 2015;13(4):e1002128. DOI: 10.1371/journal.pbio.1002128
  2. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025. https://pmc.ncbi.nlm.nih.gov/articles/PMC12121900/
  3. Wullschleger M, Aghlmandi S, Egger M, Zwahlen M. High Incorrect Use of the Standard Error of the Mean (SEM) in Original Articles in Three Cardiovascular Journals Evaluated for 2012. PLoS One. 2014;9(10):e110364. DOI: 10.1371/journal.pone.0110364