Clinical Significance
Checks whether the paper discusses the practical or clinical importance of results beyond just statistical significance, as relying solely on p-values is a known methodological weakness.
Technical description
R9 checks that statistical significance is accompanied by an indication of magnitude and precision. It scans the text for effect-size measures (Cohen's d, odds ratio, risk ratio, hazard ratio, number needed to treat, mean difference, eta squared) and for confidence-interval reporting at any confidence level. For each extracted p-value it checks whether an effect size or interval appears within about two hundred characters; a p-value with neither is counted as reported in isolation. It also flags a marginal p-value (0.04 to 0.10) near a phrase such as 'not significant' without a confidence interval, because dismissing a marginal result as null without an interval misrepresents what the data show. The score rises with the share of isolated p-values and such marginal misinterpretations. It also flags the stronger null-acceptance misreading (Greenland 2016): an explicit no-effect or no-difference claim near a non-significant p-value (at or above 0.05) with no nearby confidence interval.
How it works
Layer 2 (contextual): the positions of all effect-size and confidence-interval mentions are collected, with the interval pattern recognising any confidence level rather than only the default ones. Each p-value's location is compared against these positions: neither an effect size nor an interval within the proximity window marks it isolated, otherwise covered. The score is 0.0 when none are isolated, 2.0 when some but not all are, and 4.0 when all are. Each marginal p-value near a non-significance phrase without a nearby interval adds 0.5 and a finding; isolated p-values produce findings up to a small cap with a summary for the rest. Capped at 5.0. The metadata records the effect-size and interval counts, the isolated-p-value count, the marginal-misinterpretation count, and the coverage rate (the share of p-values with a nearby effect size or interval). For each non-significant p-value (at or above 0.05) it also searches for an explicit no-effect claim with no nearby interval, counting a null-acceptance misreading, adding a finding, and adding 0.5 to the score.
Why this matters
A p-value answers only whether an effect is surprising under a null, not how large or important it is. The ASA's statement on p-values says explicitly that a p-value does not measure effect size or importance and that conclusions should not rest on whether it passes a threshold. The effect size is the main finding of a quantitative study, and a confidence interval conveys both the estimate and its precision, supporting estimation rather than a bare accept-reject decision. The marginal-result check targets the common misreading in which a p-value just above the threshold is reported as proof of no effect, which only an interval can resolve.
Score thresholds
- 0
- Effect sizes or confidence intervals accompany the p-values, or there are no p-values
- 2
- Some p-values are reported without a nearby effect size or interval
- 4-5
- All p-values are reported in isolation, or several marginal results are misread as null
Limitations
Pairing a p-value with an effect size or interval is by textual proximity, so a related measure reported far away (in a table or another sentence) can be missed and the p-value wrongly counted as isolated, while an unrelated measure nearby can wrongly count as support. Detection depends on the recognised vocabulary of effect sizes and interval notations, so a measure phrased outside it is not credited. The marginal check keys on a fixed set of phrases and p-value band, missing other phrasings or boundary values. The indicator confirms a magnitude or interval is reported, not that it is correct or correctly interpreted. P-value recomputation is the recomputation indicators and design-test fit is indicator R1; R9 focuses on whether significance is reported with the context needed to judge importance.
References
- Wasserstein RL, Lazar NA. (2016). The ASA's statement on p-values: context, process, and purpose. The American Statistician
- Sullivan GM, Feinn R. (2012). Using effect size: or why the P value is not enough. Journal of Graduate Medical Education
- Gardner MJ, Altman DG. (1986). Confidence intervals rather than P values: estimation rather than hypothesis testing. BMJ
- Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, Altman DG. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology 31(4):337-350
- Mansournia MA, Collins GS, Nielsen RO, Nazemipour M, Jewell NP, Altman DG, Campbell MJ. (2021). CHecklist for statistical Assessment of Medical Papers: the CHAMP statement. British Journal of Sports Medicine 55(18):1002-1003
- Parker L, Boughton S, Lawrence R, Bero L. (2022). Experts identified warning signs of fraudulent research: a qualitative study to inform a screening tool. Journal of Clinical Epidemiology 151:1-17
- Carlisle JB. (2021). False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia 76(4):472-479
- Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380