Uniformly Positive
Checks whether almost every statistical test in a paper came out significant and whether the paper says anything about safety or harms. Real research reports a mix of results and almost always discusses adverse events. A paper where nearly everything works and nothing goes wrong is a red flag for selective reporting or fabrication. The indicator measures the proportion of significant p-values and looks for safety language, raising the score when results are uniformly positive and safety is unmentioned.
Technical description
A contextual screen for the uniformly-positive pattern combined with absent safety reporting. It counts reported p-values and the proportion significant at 0.05, requiring more than five comparisons. The proportion maps to a base score: at or below 0.70 gives 0; 0.70-0.90 gives 2.0 (predominantly positive); 0.90-1.00 gives 3.0 (suspiciously positive); exactly 1.00 gives 4.0 (all significant). It then scans the text for adverse-event and safety vocabulary (adverse events/reactions, side effects, complications, tolerability, toxicity, withdrawals, discontinuations, harms, with word boundaries so a stem like harm does not match inside an unrelated word); when none is present and the base score is positive, it adds half a point. Capped at 5.0.
How it works
Layer 2 (contextual): reported p-values are counted and the number significant at 0.05 gives the proportion positive. With five or fewer comparisons, or none, the score is neutral. Otherwise the proportion sets the base score through the bands. A case-insensitive safety-keyword scan (word-bounded stems) adds half a point only when no safety language is found and the base score is positive. The significant count is also tested against the base rate with a binomial: when its upper tail under the 0.84 rate falls below five percent (significantly more positive than the field), a further half point is added. One finding summarises the proportion and whether safety language was present, severity rising at the upper band. Metadata records total_comparisons, significant_count, proportion_positive, ae_mentioned, literature_positive_rate (the about-84-percent empirical base rate of positive results, Fanelli 2010, against which the proportion is read), and positive_binomial_p.
Why this matters
Genuine studies do not uniformly succeed and they report observed harms, so an all-positive paper silent on safety departs from how real research reads. Chan and colleagues, comparing protocols to publications, found significant outcomes were far more likely to be fully reported than non-significant ones, so a published all-positive result set is often the visible tip of a larger, partly hidden analysis. The argument that many published findings are unreliable rests partly on this excess of positive results relative to power, and the test of excess significance formalises that intuition. The safety dimension reflects that fabricated or heavily selected studies often omit the adverse events any real intervention produces, so the joint pattern (everything significant, nothing harmful) is more suspicious than either alone.
Score thresholds
- 0
- A natural mix of significant and non-significant results.
- 2-3
- Predominantly or suspiciously positive results, the higher end reflecting absent safety reporting.
- 4-5
- Every reported test significant, the top reached when safety and adverse events are never mentioned.
Limitations
Counts all reported p-values without distinguishing primary efficacy tests from baseline or safety analyses where significance means something different, though including such tests generally lowers the proportion and makes the indicator more conservative. It cannot tell whether a significant result is beneficial or harmful in direction. The safety scan detects the presence of safety vocabulary, not whether safety data were analysed, so a passing mention avoids the bonus while harms reported in unusual terms may be missed; the vocabulary is broad but finite. The proportion thresholds and half-point bonus are directional. It overlaps in spirit with the p-value clustering and excess-significance checks (S11, S12), which assess significance statistically; D7 adds the safety dimension and a simple proportion view and is best read alongside them.
References
- Chan AW, Hrobjartsson A, Haahr MT, Gotzsche PC, Altman DG. (2004). Empirical Evidence for Selective Reporting of Outcomes in Randomized Trials: Comparison of Protocols to Published Articles. JAMA 291(20):2457-2465
- Ioannidis JPA. (2005). Why most published research findings are false. PLoS Medicine 2(8):e124
- Ioannidis JPA, Trikalinos TA. (2007). An exploratory test for an excess of significant findings. Clinical Trials 4(3):245-253
- Fanelli D. (2010). "Positive" Results Increase Down the Hierarchy of the Sciences. PLoS ONE 5(4):e10068
- Carlisle JB. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 72(8):944-952
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
- Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380