ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
R9Statistical analysisMethodological CoherenceLayer 2 (Contextual)

Clinical Significance

Checks whether the paper interprets its results beyond the p-value, by reporting an effect size or a confidence interval that shows how large and how precise an effect is. A p-value tells only whether an effect is unlikely to be chance, not whether it matters, so results presented as p-values alone, or a marginal p-value described as "no effect" without an interval, can mislead about practical importance. The indicator pairs each p-value with nearby effect-size and interval reporting and flags those left unsupported. It reads the extracted p-values and the article text.

Technical description

R9 is a contextual check that statistical significance is accompanied by an indication of magnitude and precision. It scans the text for effect-size measures, such as Cohen's d, odds ratio, risk ratio, hazard ratio, number needed to treat, mean difference, and eta squared, and for confidence-interval reporting at any confidence level. For each extracted p-value it checks whether an effect size or a confidence interval appears within about two hundred characters; a p-value with neither nearby is counted as reported in isolation. It additionally flags a marginal p-value, between 0.04 and 0.10, that sits near a phrase such as "not significant" or "no significant difference" without a confidence interval, because dismissing a marginal result as a null effect without an interval misrepresents what the data show. The score rises with the share of isolated p-values and with such marginal misinterpretations. It also flags the stronger null-acceptance misreading: an explicit claim of no effect or no difference near a non-significant p-value (at or above 0.05) with no nearby confidence interval, the error of treating failure to reject the null as evidence for it.

How it works

The positions of all effect-size and confidence-interval mentions are collected, with the confidence-interval pattern recognising any confidence level rather than only the default ones. Each p-value's recorded location is compared against these positions: if neither an effect size nor an interval lies within the proximity window the p-value is isolated, otherwise it is covered. The score is 0.0 when no p-value is isolated, 2.0 when some but not all are, and 4.0 when every p-value is isolated. Each marginal p-value near a non-significance phrase without a nearby interval adds 0.5 and a finding, and isolated p-values produce findings up to a small cap with a summary for the remainder. The score is capped at 5.0. The metadata records the counts of effect sizes and intervals found, the number of isolated p-values, the number of marginal misinterpretations, and the share of p-values accompanied by an effect size or interval (the coverage rate). Separately, for each non-significant p-value (at or above 0.05) the indicator searches for an explicit no-effect or no-difference claim within the proximity window; if one appears with no nearby confidence interval it counts a null-acceptance misreading, adds a finding, and adds 0.5 to the score, with the count recorded in the metadata.

Score thresholds

Score Meaning
0 Effect sizes or confidence intervals accompany the p-values, or there are no p-values.
2 Some p-values are reported without a nearby effect size or interval.
4 to 5 All p-values are reported in isolation, or several marginal results are misread as null.

Why this matters

A p-value answers only whether an observed effect is surprising under a null hypothesis, not how large or how important it is, and treating it as the whole story is a long-criticised practice. The American Statistical Association's statement on p-values, by Wasserstein and Lazar, states explicitly that a p-value does not measure the size of an effect or the importance of a result and that scientific conclusions should not be based on whether a p-value passes a threshold, which is the over-reliance this indicator detects [1]. Sullivan and Feinn argue that the effect size is the main finding of a quantitative study and that reporting only a significant p-value is inadequate for a reader to understand the result, motivating the requirement that each p-value be accompanied by a measure of magnitude [2]. Gardner and Altman made the parallel case for confidence intervals, showing that an interval conveys both the estimate and its precision and supports estimation rather than a bare accept-reject decision, which is why the indicator credits a nearby interval as sufficient context [3]. The marginal-result check targets a specific and common misreading, in which a p-value just above the threshold is reported as proof of no effect, when only an interval can distinguish a precisely estimated null from an inconclusive one. Greenland and colleagues catalogued the specific ways p-values, intervals, and power are misread, including this very error of equating a non-significant result with no effect, which grounds the marginal check [4]. Modern statistical-reporting checklists ask reviewers to confirm that effect sizes and intervals accompany p-values [5], and research-integrity screening, through expert-derived warning signs [6], audits of fabricated trials [7], the INSPECT-SR instrument [8], and reviews of the data-detective toolkit [9], treats significance reported without magnitude or precision as a quality signal.

Limitations

The pairing of a p-value with an effect size or interval is by textual proximity, so a genuinely related measure reported far away, in a table or a different sentence, can be missed and the p-value wrongly counted as isolated, while an unrelated measure that happens to sit nearby can wrongly count as support. Detection depends on the vocabulary of effect sizes and interval notations recognised, so a measure phrased outside that vocabulary is not credited. The marginal-result check keys on a fixed set of non-significance phrases and a fixed p-value band, so other phrasings or boundary values are missed. The indicator confirms that a magnitude or interval is reported, not that it is correct or correctly interpreted. Whether a reported p-value is consistent with its test statistic is the recomputation indicators, and whether the chosen test fits the design is indicator R1, so R9 focuses on whether statistical significance is reported with the context needed to judge its importance.

Theoretical background

R9 rests on the distinction between statistical and practical significance. A hypothesis test summarises evidence against a null into a single tail probability, which depends jointly on the effect size and the sample size, so a small and unimportant effect can yield a small p-value in a large sample, and a large and important effect can yield a non-significant p-value in a small one. The p-value alone therefore cannot convey importance, and two further quantities are needed: an effect size, which expresses the magnitude of the effect on an interpretable scale, and a confidence interval, which expresses the range of effects compatible with the data and so its precision. Reporting both turns a binary verdict into an estimate with uncertainty, which is what allows a reader to judge clinical or practical relevance. The marginal case is the sharpest illustration of the gap: a p-value of, say, 0.07 reported as evidence of no effect conflates failing to reject the null with accepting it, an error a confidence interval resolves by showing whether the plausible effects are uniformly negligible or include important values. The indicator operationalises these principles by treating an adjacent effect size or interval as the minimal context a p-value requires, and by singling out the marginal misreading as the most consequential failure of that context. The null-acceptance check extends this to the symmetric error catalogued by Greenland and colleagues, in which a non-significant result is reported as proof of no effect; because only an interval can show whether the compatible effects are uniformly negligible, an unqualified no-effect claim without one is treated as a distinct misreading.

References

  1. Wasserstein RL, Lazar NA. The ASA's statement on p-values: context, process, and purpose. The American Statistician. 2016;70(2):129-133. DOI: 10.1080/00031305.2016.1154108
  2. Sullivan GM, Feinn R. Using effect size: or why the P value is not enough. Journal of Graduate Medical Education. 2012;4(3):279-282. DOI: 10.4300/JGME-D-12-00156.1
  3. Gardner MJ, Altman DG. Confidence intervals rather than P values: estimation rather than hypothesis testing. BMJ. 1986;292(6522):746-750. DOI: 10.1136/bmj.292.6522.746
  4. Greenland S, Senn SJ, Rothman KJ, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology. 2016;31(4):337-350. DOI: 10.1007/s10654-016-0149-3
  5. Mansournia MA, Collins GS, Nielsen RO, et al. CHecklist for statistical Assessment of Medical Papers: the CHAMP statement. British Journal of Sports Medicine. 2021;55(18):1002-1003. DOI: 10.1136/bjsports-2020-103651
  6. Parker L, Boughton S, Lawrence R, Bero L. Experts identified warning signs of fraudulent research: a qualitative study to inform a screening tool. Journal of Clinical Epidemiology. 2022;151:1-17. DOI: 10.1016/j.jclinepi.2022.07.006
  7. Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021;76(4):472-479. DOI: 10.1111/anae.15263
  8. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
  9. Crone G, Green CD. Tools of the data detective: a review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861