ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
S2Statistical analysisStatistical ConsistencyLayer 1 (Deterministic)

CI-Effect Consistency

Checks arithmetic relationships among reported statistics. A confidence interval must contain the effect estimate it accompanies, and for a ratio measure the estimate must sit at the geometric centre of its interval. Separately, the standard error must equal the standard deviation over the square root of the sample size. An interval that does not bracket its estimate, a ratio interval not centred on its estimate, or a standard error inconsistent with its SD and N is flagged. It works on the reported numbers alone, with no model.

Technical description

S2 is a deterministic, generator-agnostic screen for three exact relationships that genuine statistics obey. (1) Containment: a confidence interval (CI) is the estimate plus or minus a multiplier times its standard error, so it brackets the estimate; an interval that excludes its nearby effect size is a reporting error. (2) Log-symmetry of ratio intervals: the CI of a ratio measure, an odds ratio (OR), relative risk (RR), or hazard ratio (HR), is symmetric on the log scale, so the estimate equals the geometric mean of the confidence limits; an estimate far off that geometric centre, even when contained, is inconsistent. (3) Standard-error identity: the standard error (SE) of a mean is the standard deviation (SD) over the square root of the sample size (N). S2 scans the text for CIs in bracket, colon, and to-styles, finds the nearest effect size of the common types within a proximity window, checks containment, applies the log-symmetry check to ratio measures, and separately checks the square-root identity on SE, SD, and N triplets. The markers are matched as whole words, and the confidence level is read from the text rather than fixed at 95 percent.

How it works

A confidence interval is matched by a confidence level (regex \b\d{2}(?:\.\d+)?%) followed by CI and two bounds. For each interval the nearest effect size within 200 characters is found among the whole-word markers \bOR\b, \bRR\b, \bHR\b (ratio measures) and \bd\b (a difference), and the interval is flagged (severity error) when the effect falls outside the closed interval low <= effect <= high. When a ratio effect is contained, a second check computes the geometric centre geo = sqrt(low * high) and flags (severity warning) when abs(log(effect / geo)) > log(1.5), that is when the estimate is more than a factor of 1.5 off the geometric centre of its own interval. Cohen's d is a difference, symmetric on the linear scale, so the log-symmetry check is not applied to it.

Separately, every standard error marked by \bSE\b = value is paired with its nearest \bSD\b and \b[Nn]\b within 200 characters, the expected value SD / sqrt(N) is computed, and the triplet is flagged (severity warning) when abs(SE - SD / sqrt(N)) >= 0.25 * SE. The number of flags (non-containment, log-asymmetry, and SE inconsistency) sets the score: 0 flags 0.0, one 2.5, two or more 4.0. The metadata records cis_found, ci_effect_ok, ci_effect_flags, log_asym_flags, se_sd_flags, and total_flags.

Score thresholds

Score Meaning
0 to 1 Confidence intervals bracket their estimates, ratio intervals are centred on them, and standard errors match the square-root identity.
2 to 3 One inconsistency: an interval excluding its estimate, a ratio interval off its geometric centre, or a standard error inconsistent with its SD and N.
4 to 5 Two or more such inconsistencies, consistent with reporting errors or fabricated statistics.

Why this matters

These relationships are arithmetic identities, not tendencies, so a violation is an unambiguous error. A confidence interval is constructed as the estimate plus or minus a multiplier times its standard error, the arithmetic Altman and Bland set out in showing how a CI is recovered from an estimate and a p-value [3], and the companion identity SE equals SD over the square root of N ties the three quantities together, a distinction the same authors stress because the two are routinely confused [1]. For ratio measures the interval is symmetric on the log scale, so the estimate is the geometric mean of its limits; confidence intervals are widely recommended yet often misread, which is why machine checking of their internal consistency is useful [2]. Automated recomputation of reported statistics has repeatedly exposed internal inconsistencies in a large share of articles [4], the maintained statcheck toolkit now performs such checks at scale [6], scoping reviews catalogue statistical-consistency screening among misconduct-detection methods [5], and broader surveys of the data-anomaly toolkit place these checks among the standard first passes [7]. A confidence interval that excludes its estimate, a ratio interval not centred on it, or a standard error inconsistent with its SD and N is exactly such a recomputable error, and the same logic underlies forensic screening of trial data for fabrication [8].

Limitations

The check depends on extracting intervals, effects, and triplets correctly from free text, so unusual notation or an effect reported far from its interval is missed, and the nearest-within-200-characters heuristic can pair an interval with the wrong neighbouring effect. The log-symmetry check uses a generous factor of 1.5, so only grossly off-centre ratio intervals are flagged and the ordinary asymmetry of rounding passes; it is applied only to ratio measures, since a difference such as Cohen's d is symmetric on the linear scale. The standard-error check assumes the value is the standard error of a mean, so an SE of a proportion or a model coefficient can be mismatched. Markers are matched as whole words, so a nonstandard abbreviation is not recognised, and only two-digit confidence levels are detected. Granularity and arithmetic-sum checks live in sibling indicators S1, S3, and S4.

Theoretical background

A confidence interval at any level is the estimate plus or minus a critical value times the standard error, so on the scale of the analysis it is centred on the estimate and must contain it. For an effect estimated on the additive scale, such as a mean difference or Cohen's d, that centring is arithmetic on the reported values directly. For a ratio measure the estimation is done on the log scale, where the interval is symmetric, so back on the natural scale the estimate is the geometric mean of the limits rather than their arithmetic mean; checking that the estimate equals the square root of the product of the limits therefore tests a genuine constraint that a fabricated or mis-transcribed interval can violate while still bracketing the estimate. The standard error of a mean is the standard deviation divided by the square root of the sample size, which follows from the variance of a sample mean being the population variance over the sample size; the three reported quantities are thus bound by one equation. Because all three relationships are exact, a clear violation cannot be sampling variation and points to a transcription error, a miscalculation, or fabrication. The tolerances and proximity windows are practical concessions to free-text reporting.

References

  1. Altman DG, Bland JM. Standard deviations and standard errors. BMJ. 2005;331(7521):903. DOI: 10.1136/bmj.331.7521.903
  2. Cumming G, Finch S. Inference by eye: confidence intervals and how to read pictures of data. American Psychologist. 2005;60(2):170-180. DOI: 10.1037/0003-066X.60.2.170
  3. Altman DG, Bland JM. How to obtain the confidence interval from a P value. BMJ. 2011;343:d2090. DOI: 10.1136/bmj.d2090
  4. Nuijten MB, Polanin JR. "statcheck": Automatically detect statistical reporting inconsistencies to increase reproducibility of meta-analyses. Research Synthesis Methods. 2020;11(5):574-579. DOI: 10.1002/jrsm.1408
  5. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
  6. Nuijten MB, Epskamp S. statcheck: Extract Statistics from Articles and Recompute P-Values. R package version 1.5.0. 2024. DOI: 10.32614/CRAN.package.statcheck
  7. Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
  8. Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938