ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
R10Statistical analysisMethodological CoherenceLayer 2 (Contextual)

Prespecification

Checks whether the paper states its hypothesis and primary endpoint up front in the Methods, and whether the Results introduce analyses that were never planned. Presenting a finding discovered in the data as if it had been predicted, called HARKing, or running unplanned subgroup analyses until something turns up, inflates the chance of a false result. The indicator looks for a declared hypothesis, counts the primary endpoints declared, and flags subgroup analyses that appear only in the Results. It reads the Methods and Results section text.

Technical description

R10 is a contextual check for pre-specification and against HARKing, hypothesising after the results are known. It requires a Methods section and looks there for a hypothesis or primary-endpoint declaration, and treats a trial- or pre-registration reference anywhere in the text, such as a registry identifier or a stated registration, as an equivalent pre-specification signal. It counts the primary-endpoint declarations in the Methods, treating more than three as a sign that the pre-specification has been diluted into multiple primaries. It compares subgroup mentions between Methods and Results, flagging a subgroup analysis that appears in the Results but was not pre-specified in the Methods as possible subgroup fishing. It notes post-hoc or exploratory analyses, which are acceptable when openly labelled but contribute a mild signal. The score reflects, from low to high, a clearly pre-specified study, a declared exploratory extension, and the clear HARKing markers of many primaries or undeclared subgroups.

How it works

The Methods section is searched for hypothesis cues such as hypothesis, hypothesised, primary endpoint, primary outcome, primary objective, and we aimed to; their absence, together with the absence of any trial- or pre-registration reference (a registry identifier such as an NCT or ISRCTN number, clinicaltrials.gov, PROSPERO, or a stated registration), raises the score to at least 2.0 with a finding, since a registered study has lodged its plan in advance. Primary-endpoint declarations are counted within the Methods section rather than across the whole paper, so that a single endpoint re-stated in the abstract, results, and discussion is not mistaken for several; more than three declared primaries raises the score to at least 4.0 with a finding. A subgroup mention present in the Results but absent from the Methods is flagged as subgroup fishing and raises the score to at least 4.0. A post-hoc or exploratory mention in the Results raises the score to at least 2.0, reflecting that the analysis is exploratory even when properly labelled. The score is capped at 5.0. The metadata records whether a hypothesis was found, whether a registration reference was found, the number of primary endpoints declared in the Methods, whether subgroup fishing was detected, and whether a post-hoc or exploratory analysis was declared.

Score thresholds

Score Meaning
0 A hypothesis and a single primary endpoint are pre-specified, with no undeclared analyses.
2 No clear hypothesis declaration, or an openly labelled exploratory analysis.
4 to 5 More than three declared primary endpoints, or a subgroup analysis that appears only in the Results.

Why this matters

Pre-specification is what separates a confirmatory test from an exploratory search, and abandoning it undermines the meaning of a p-value. Kerr named and characterised HARKing, presenting a hypothesis formed after seeing the data as though it had been stated in advance, and showed both that the practice is common and that it converts what is really an exploratory finding into a spurious confirmation [1]. Chan and colleagues demonstrated the related phenomenon empirically, finding that the outcomes emphasised in published trials frequently differ from those in their protocols and that the changes favour statistical significance, so a primary endpoint that multiplies or shifts between plan and report is a documented marker of selective presentation [2]. Simmons and colleagues showed how the flexibility to add unplanned analyses, including subgroup comparisons run after the main result, inflates the false-positive rate, which is why a subgroup analysis surfacing only in the Results is treated as fishing unless it was pre-specified [3]. The indicator targets the observable traces of these practices: a missing hypothesis, a proliferation of primaries, and undeclared subgroup analyses. Hardwicke and Wagenmakers set out how preregistration calibrates confidence by fixing the hypotheses and analyses in advance, which is the pre-specification the indicator credits when it finds a registration reference [4], and the COMPare project documented how often reported outcomes diverge from the registered protocol in practice [5]. Statistical-reporting checklists ask reviewers to confirm prespecification and registration [6], and research-integrity screening, through expert-derived warning signs [7], the INSPECT-SR instrument [8], and reviews of the data-detective toolkit [9], treats undeclared analyses and outcome switching as trustworthiness signals.

Limitations

The check operates on the text of the Methods and Results, so it depends on those sections being identified and on the relevant statements appearing in them; a hypothesis or pre-specified subgroup stated in a protocol or supplement but not the main text will be missed, producing a false flag. Detection is keyword-based, so an unconventionally phrased hypothesis is not recognised, and a subgroup or post-hoc mention is matched literally without understanding its role. Counting primary endpoints in the Methods reduces but does not remove ambiguity, since co-primary endpoints can be legitimate when declared with multiplicity control, which the indicator does not assess. A genuinely pre-specified subgroup analysis described only in the Results will be flagged. The indicator detects the markers of HARKing and outcome multiplication, not the intent behind them, so a flag is a prompt to compare the paper against its protocol rather than proof of misconduct.

Theoretical background

R10 rests on the logic of confirmatory inference. A hypothesis test is interpretable only when the hypothesis and the outcome were fixed before the data were seen, because the error rate a p-value claims to control is defined over the single pre-specified comparison; once the hypothesis is chosen after inspecting the results, or the outcome is selected from many, the effective number of implicit comparisons grows and the nominal error rate no longer holds. HARKing and outcome switching exploit exactly this, presenting the most striking of many possible findings as if it were the one planned, which is why pre-specification in the Methods is the property the indicator seeks. Multiple declared primary endpoints dilute the single pre-committed test into several, raising the family-wise error unless adjusted, and an undeclared subgroup analysis is a comparison drawn from a large unstated set, the classic route to a false positive. The indicator cannot read the registry entry itself, but it does detect a registration reference in the text and treats it as the strongest available evidence of pre-specification, falling back on the Methods section as the record of what was planned and treating divergence between that record and the Results, or an implausible multiplicity within the plan itself, as the signature of departed pre-specification. Counting primaries within the Methods rather than the whole paper reflects that pre-specification is a property of the plan, not of how often the plan is later referenced.

References

  1. Kerr NL. HARKing: hypothesizing after the results are known. Personality and Social Psychology Review. 1998;2(3):196-217. DOI: 10.1207/s15327957pspr0203_4
  2. Chan AW, Hróbjartsson A, Haahr MT, Gøtzsche PC, Altman DG. Empirical evidence for selective reporting of outcomes in randomized trials: comparison of protocols to published articles. JAMA. 2004;291(20):2457-2465. DOI: 10.1001/jama.291.20.2457
  3. Simmons JP, Nelson LD, Simonsohn U. False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science. 2011;22(11):1359-1366. DOI: 10.1177/0956797611417632
  4. Hardwicke TE, Wagenmakers EJ. Reducing bias, increasing transparency and calibrating confidence with preregistration. Nature Human Behaviour. 2023;7(1):15-26. DOI: 10.1038/s41562-022-01497-2
  5. Goldacre B, Drysdale H, Dale A, et al. COMPare: a prospective cohort study correcting and monitoring 58 misreported trials in real time. Trials. 2019;20(1):118. DOI: 10.1186/s13063-019-3173-2
  6. Mansournia MA, Collins GS, Nielsen RO, et al. CHecklist for statistical Assessment of Medical Papers: the CHAMP statement. British Journal of Sports Medicine. 2021;55(18):1002-1003. DOI: 10.1136/bjsports-2020-103651
  7. Parker L, Boughton S, Lawrence R, Bero L. Experts identified warning signs of fraudulent research: a qualitative study to inform a screening tool. Journal of Clinical Epidemiology. 2022;151:1-17. DOI: 10.1016/j.jclinepi.2022.07.006
  8. Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
  9. Crone G, Green CD. Tools of the data detective: a review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861