ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
R5Statistical analysisMethodological CoherenceLayer 2 (Contextual)

Power Calculation

Checks whether the paper justifies its sample size with a power analysis and, when it does, whether the numbers are sensible. A study that reports no sample-size calculation, especially with a small sample, may be underpowered, and reported parameters that are off, such as a significance level above the usual 0.05 or a target power below 80 percent, weaken the justification. The indicator looks for the power-analysis keywords and extracts the significance level, power, effect size, and sample size to check them. It reads the article text, preferring the Methods section.

Technical description

R5 is a contextual check for the presence and plausibility of a sample-size justification. It searches the text, favouring the Methods section, for power-analysis cues such as power analysis, sample size calculation, power calculation, and a priori power, and treats two or more supportive cues (effect size, Cohen's d, Type I error) together with a power percentage as equivalent evidence. When a calculation is present it extracts the significance level alpha, the target power as a percentage, the effect size, and the calculated sample size, and checks them: a significance level above the conventional 0.05 inflates the false-positive rate, and a target power below 80 percent is below convention. When a calculation is absent it judges the study by the largest sample size found, treating a small sample as a serious omission and an adequate one as a moderate concern. Where regression is mentioned with a stated number of predictors, it applies the rule that the sample should be at least ten times the number of predictors. Independently of the score, it classifies the sample-size justification the paper offers into the Lakens taxonomy: a power analysis, an accuracy or interval-width argument, a heuristic (a rule of thumb or similarity to prior work), a resource constraint, or none.

How it works

Power cues and supportive cues are matched by regular expression over the search text. If a calculation is found, alpha is read from an expression such as alpha equals a decimal, the target power from a percentage adjacent to the word power, the effect size from a d or effect-size expression, and the calculated N. A significance level above 0.05 adds 1.0 with a finding, since a stricter level below 0.05 is conservative and not penalised; a target power below 80 percent adds 1.0; an assumed effect size at or above Cohen's d of 1.5, far beyond the conventional large-effect benchmark of 0.8, adds 1.0, since an inflated effect understates the sample size required (the sample-size samba). If no calculation is found, the largest N in the text or the statistical triplets is taken: a maximum N below 30 sets the score to 4.0 with a finding, and an adequate or unknown N sets it to 2.0 as a softer note. In any branch, if regression is mentioned and the maximum N is below ten times the stated number of predictors, 1.0 is added. The score is capped at 5.0. The metadata records whether a calculation was found, the extracted alpha, power, and effect size, whether that effect size is implausibly large, the maximum N and whether it reaches 30, the regression predictor count, and the Lakens justification type (power analysis, accuracy, heuristic, resource constraint, or none), classified from the cues present as a diagnostic that does not change the score.

Score thresholds

Score Meaning
0 A power analysis is present with sensible parameters.
2 to 3 No sample-size justification with an adequate sample, or a present calculation with a questionable parameter.
4 to 5 No justification with a small sample, or several parameter problems together.

Why this matters

Justifying the sample size is a basic expectation of sound design, and getting the justification right matters as much as having one. Cohen established the framework of power analysis and the conventional target of 80 percent power, along with the effect-size benchmarks the calculation depends on, so a stated power well below that convention signals a study planned to miss real effects [1]. Button and colleagues showed that low power not only reduces the chance of detecting a true effect but also lowers the probability that a statistically significant finding is real and inflates the apparent effect size, which is why an unjustified small sample is a substantive reliability concern rather than a formality [2]. The reporting standards make the justification mandatory: CONSORT requires authors to explain how the sample size was determined, so its absence is a documented reporting gap [3]. Flagging a significance level only when it exceeds 0.05 reflects the asymmetry of the risk, since a liberal level raises false positives while a stricter level is a conservative choice that does not. Schulz and Grimes warned that investigators often work backward from a feasible sample to an optimistically large assumed effect, the sample-size samba, which lets an underpowered study appear adequately powered, so an assumed effect far above the large-effect benchmark deserves scrutiny [4]. Lakens set out the modern menu of defensible ways to justify a sample size, against which a bare or convention-only justification can be judged [5], and the analysis is reinforced by statistical-reporting checklists [6] and research-integrity screening that examines sample-size justification among its trustworthiness items [7,8,9].

Limitations

Detection is keyword-based, so a sample-size justification phrased without the expected cues is missed and the study wrongly treated as lacking one, while a passing mention of effect size can be read as a calculation. The extracted parameters are the first matches found, which can belong to a different analysis, and the regression check relies on a stated predictor count and the largest N, neither of which is always the relevant figure. The thresholds, the 80 percent power convention, the 0.05 alpha, and the ten-per-predictor rule, are conventions rather than hard limits, and a defensible departure will be flagged, so a finding is a prompt to read the justification rather than a verdict. The indicator does not recompute the sample size from the stated parameters, so it does not detect an internally wrong calculation, only a missing one or out-of-convention inputs. Consistency of the reported sample size across sections is indicator R2, so R5 focuses on the presence and plausibility of the power justification.

Theoretical background

R5 rests on the role of statistical power in the reliability of a result. Power is the probability that a study detects an effect of a given size when it is real, and it rises with the sample size, the effect size, and the significance level; fixing a target power and an expected effect size therefore determines the sample size needed, which is what a power calculation reports. A study that omits this step has not shown that it could detect the effect it sought, and when its sample is small the omission is consequential, because low power both misses true effects and, as the post-study odds make clear, lowers the chance that a significant finding is true while exaggerating its magnitude. The parameter checks encode the conventions that make a calculation credible: the significance level bounds the false-positive rate, so a level above 0.05 loosens that bound and is the direction worth flagging, whereas a stricter level tightens it and is unobjectionable; the 80 percent power convention marks the accepted floor for adequate sensitivity; and the ten-observations-per-predictor heuristic guards regression against overfitting that a nominal power calculation might not capture. By reading parameters rather than recomputing them, the indicator targets the checkable surface of the justification, the presence of the calculation and the sanity of its inputs, which is where most reporting failures lie. The effect-size check follows the same logic in reverse: because the required sample falls as the assumed effect grows, an assumed effect far above the large-effect benchmark is a way to make a small sample appear sufficient, so it is flagged for justification rather than accepted at face value. The justification-type classification reflects Lakens's argument that a defensible sample size can rest on grounds other than power, so the indicator records which grounds are offered rather than treating the absence of a power analysis as the absence of any justification.

References

  1. Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates; 1988. https://www.worldcat.org/oclc/17877467
  2. Button KS, Ioannidis JPA, Mokrysz C, Nosek BA, Flint J, Robinson ESJ, Munafò MR. Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience. 2013;14(5):365-376. DOI: 10.1038/nrn3475
  3. Schulz KF, Altman DG, Moher D; CONSORT Group. CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials. BMJ. 2010;340:c332. DOI: 10.1136/bmj.c332
  4. Schulz KF, Grimes DA. Sample size calculations in randomised trials: mandatory and mystical. The Lancet. 2005;365(9467):1348-1353. DOI: 10.1016/S0140-6736(05)61034-3
  5. Lakens D. Sample size justification. Collabra: Psychology. 2022;8(1):33267. DOI: 10.1525/collabra.33267
  6. Mansournia MA, Collins GS, Nielsen RO, et al. CHecklist for statistical Assessment of Medical Papers: the CHAMP statement. British Journal of Sports Medicine. 2021;55(18):1002-1003. DOI: 10.1136/bjsports-2020-103651
  7. Parker L, Boughton S, Lawrence R, Bero L. Experts identified warning signs of fraudulent research: a qualitative study to inform a screening tool. Journal of Clinical Epidemiology. 2022;151:1-17. DOI: 10.1016/j.jclinepi.2022.07.006
  8. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
  9. Crone G, Green CD. Tools of the data detective: a review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861