D7Statistical analysisFabrication DetectionLayer 2 (Contextual)

Uniformly Positive

Checks whether almost every statistical test in a paper came out significant and whether the paper says anything about safety or harms. Real research reports a mix of results, with some endpoints reaching significance and others not, and it almost always discusses adverse events. A paper where nearly everything works and nothing goes wrong is a red flag for selective reporting or fabrication. The indicator measures the proportion of significant p-values and looks for safety language, raising the score when results are uniformly positive and safety is unmentioned. It works on the reported p-values and the article text.

Technical description

D7 is a contextual screen for the uniformly-positive pattern of selectively reported or fabricated studies, combined with the absence of safety reporting. It counts the reported p-values and the proportion that are significant at the 0.05 level, requiring more than five comparisons before judging. The proportion maps to a base score: at or below 0.70 it scores 0, between 0.70 and 0.90 it scores 2.0 as predominantly positive, between 0.90 and 1.00 it scores 3.0 as suspiciously positive, and exactly 1.00 scores 4.0 as all significant. These bands are read against the empirical literature base rate of positive results, about 84 percent across disciplines and higher in the softer sciences [4], so a proportion well above that already-inflated baseline is the signal. It then scans the article text for adverse-event and safety language using a vocabulary that covers adverse events and reactions, side effects, complications, tolerability, toxicity, withdrawals, discontinuations, and harms, and when no such language is present and the base score is positive it adds half a point. The total is capped at 5.0.

How it works

The reported p-values are counted and the number significant at 0.05 gives the proportion positive. With five or fewer comparisons, or no p-values, the indicator returns a neutral score, since a handful of significant results is unremarkable. Otherwise the proportion sets the base score through the bands above. The safety scan applies a case-insensitive set of keywords, with word boundaries on the stems so that, for example, harm is not matched inside an unrelated word; if none is found, and only then, the no-safety bonus of half a point is added. Beyond the proportion bands, the significant count is tested directly against the base rate: under a binomial with success probability equal to the 84 percent literature rate, the upper-tail probability of at least this many significant results is computed, and when it falls below five percent, so the article is significantly more positive than the field, a further half point is added, formalising the excess-significance idea the bands approximate [3, 4]. A single finding summarises the proportion of significant tests and whether safety language was present, with severity rising once the score reaches the upper band. The metadata records the total comparisons, the significant count, the proportion positive, whether adverse-event language was mentioned, the literature base rate of positive results (about 84 percent, Fanelli 2010), and the binomial upper-tail probability of the significant count against that rate.

Score thresholds

Score	Meaning
0	A natural mix of significant and non-significant results.
2 to 3	Predominantly or suspiciously positive results, with the higher end reflecting absent safety reporting.
4 to 5	Every reported test significant, the top reached when safety and adverse events are never mentioned.

Why this matters

Genuine studies do not uniformly succeed, and they report the harms they observe, so a paper that is all positive and silent on safety departs from how real research reads. Selective outcome reporting is a documented and widespread distortion: Chan and colleagues, comparing trial protocols against their publications, found that statistically significant outcomes were far more likely to be fully reported than non-significant ones, so a published all-positive result set is often the visible tip of a larger, partly hidden analysis [1]. The broader argument that a large share of published findings are unreliable rests partly on this excess of positive results relative to what power allows [2], and the test of excess significance formalises the same intuition by comparing observed to expected significant counts [3]. The added safety dimension reflects that fabricated or heavily selected studies often omit the adverse events that any real intervention produces, so the joint pattern, everything significant and nothing harmful, is more suspicious than either alone. The base rate of positive results in the published literature is itself high but not uniform, about 84 percent across disciplines [4], so a single article reporting almost only significant tests sits well above even that inflated baseline, and recent forensic re-analyses, scoping reviews, and trustworthiness instruments treat such uniform positivity as a screening signal [5, 6, 7, 8].

Limitations

The check counts all reported p-values without distinguishing primary efficacy tests from baseline comparisons or safety analyses, where significance carries a different meaning, although including such tests generally lowers the proportion and makes the indicator more conservative. It cannot tell whether a p-value is significant in a beneficial or harmful direction. The safety scan detects the presence of safety vocabulary, not whether safety data were actually analysed, so a paper that merely mentions safety in passing avoids the bonus, while one reporting harms in unusual terms may be missed; the vocabulary is broad but finite. The proportion thresholds and the half-point bonus are directional rather than calibrated. This indicator overlaps in spirit with the p-value clustering and excess-significance checks, indicators S11 and S12, which assess significance patterns statistically; D7 adds the safety-reporting dimension and the simple proportion view, so it is best read alongside them rather than as a substitute.

Theoretical background

D7 rests on two regularities of genuine research. First, statistical power is finite, so even a real and effective intervention does not produce significance on every endpoint and subgroup; the expected number of significant results is the sum of the per-test powers, which for realistic studies leaves a visible minority non-significant, and a proportion approaching one is therefore improbable without selection. Selective outcome reporting produces exactly this inflation, because non-significant outcomes are dropped or demoted between protocol and publication, leaving a record that is more positive than the underlying analysis. Second, any intervention that acts on the body produces adverse effects at some rate, so a faithful report contains safety information; its complete absence, especially alongside uniform efficacy, suggests a narrative curated to show only benefit. D7 combines a quantitative reading of the first regularity, the proportion of significant tests, with a lexical reading of the second, the presence of safety language, and adds them because the two failures reinforce each other: a study that is both all-positive and silent on harm fits the profile of selective reporting or fabrication far better than chance. Reading the significant count as a binomial draw against the literature base rate turns the proportion bands into an explicit test, the probability of at least the observed number of significant results when each carries the field's average chance of being positive, which is the excess-significance logic of Ioannidis and Trikalinos, so a count whose upper tail is small is flagged as beyond what selection at the ordinary rate would produce [3, 4].

References

Chan AW, Hrobjartsson A, Haahr MT, Gotzsche PC, Altman DG. Empirical Evidence for Selective Reporting of Outcomes in Randomized Trials: Comparison of Protocols to Published Articles. JAMA. 2004;291(20):2457-2465. DOI: 10.1001/jama.291.20.2457
Ioannidis JPA. Why most published research findings are false. PLoS Medicine. 2005;2(8):e124. DOI: 10.1371/journal.pmed.0020124
Ioannidis JPA, Trikalinos TA. An exploratory test for an excess of significant findings. Clinical Trials. 2007;4(3):245-253. DOI: 10.1177/1740774507079441
Fanelli D. "Positive" Results Increase Down the Hierarchy of the Sciences. PLoS ONE. 2010;5(4):e10068. DOI: 10.1371/journal.pone.0010068
Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861