Carlisle-Stouffer IPD
Looks at whether the two treatment groups of a trial really look randomly allocated. When participants are randomised, comparing the groups on each baseline variable, such as age or weight, gives p-values that scatter evenly between 0 and 1, because any differences are pure chance. Fabricated trials betray themselves in two opposite ways: groups drawn from clearly different populations give too many significant differences, and groups whose means were forced to match give p-values bunched near 1. The indicator runs these baseline comparisons on the individual-patient data, then tests the resulting p-values for the expected even scatter. It works on the individual-patient data (IPD).
Technical description
D31 applies the Carlisle method of baseline-distribution testing, combined with Stouffer's method of pooling p-values, to the individual-patient data (IPD) rather than to the baseline p-values printed in a paper. It locates a group or treatment column by name, splitting the rows into two arms; when no such column is found it splits by row index as a proxy and reduces its confidence. It requires at least five numeric baseline columns and at least ten rows per group, runs a Welch two-sample t-test on each baseline column between the arms, and collects the p-values. Under genuine randomisation these p-values are distributed Uniform[0,1], so the indicator tests that uniformity directly with the Kolmogorov-Smirnov statistic, pools the p-values with Stouffer's combined Z, and checks the proportion that are significant and the mean. It also applies the Cramer-von Mises test against Uniform[0,1], which integrates the squared deviation across the whole distribution and is more sensitive than Kolmogorov-Smirnov to the tail departures fabrication produces, and uses the more significant of the two for scoring. Departures in either direction, an excess of significant differences or a suspicious absence of them, raise the score.
How it works
A group column is detected from name keywords such as group, treatment, arm, or allocation, and the first two of its levels define the arms; absent such a column the rows are halved by index and a one-point proxy penalty is applied at the end. Welch t-tests on the numeric baseline columns yield the p-values, with columns constant in both arms skipped, and at least five p-values are required. The more significant of the Kolmogorov-Smirnov and Cramer-von Mises tests against Uniform[0,1] adds 2.5 when its p-value is below 0.01 or 1.5 when below 0.05. Stouffer's combined Z, the sum of the inverse-normal transforms of the p-values divided by the square root of their count, is computed with the p-values clipped away from 0 and 1; an absolute Z above 3 adds 1.5, since under randomisation Z is standard normal and such a value is strongly directional. A proportion of significant comparisons above thirty percent adds 1.0, a proportion below one tenth of a percent across at least ten tests adds 1.5, and a mean p-value more than 0.20 from one half adds 0.5. The total is capped at 5.0 after any proxy penalty, and each triggered check emits a finding. The metadata records the p-value count, the proportion significant, the proportion of high p-values, the mean, the Kolmogorov-Smirnov statistic and p-value, the Cramer-von Mises statistic and p-value, the Stouffer Z, the group column, and whether the proxy split was used.
Score thresholds
| Score | Meaning |
|---|---|
| 0 to 1 | Baseline comparisons scatter as randomisation would produce. |
| 2 to 3 | The baseline p-values depart from uniform in one or more ways. |
| 4 to 5 | Strong non-uniformity, an excess of significant differences or of forced non-significance, consistent with non-randomised or fabricated arms. |
Why this matters
The behaviour of baseline comparisons is one of the most established forensic signals for trial integrity. Carlisle showed, in the analysis that helped expose a large body of fabricated anaesthesia trials, that the distributions of baseline summary statistics under genuine randomisation are predictable and that their departure from that expectation quantifies the implausibility of the data [1]. Pooling the individual p-values into a single statistic is Stouffer's method, introduced to combine independent tests into one standard-normal Z, which gives a sensitive directional summary of whether the comparisons lean toward significance or toward forced agreement [2]. Carlisle later applied these ideas to the individual-patient data of submitted trials and found duplicated and distributionally impossible records among the features that distinguished false datasets, which is the setting this indicator targets by testing the raw IPD rather than the reported table [3]. Running the comparisons on the IPD is more powerful than reading a paper's baseline p-values, because it cannot be evaded by selective reporting, and combining the Kolmogorov-Smirnov uniformity test with the Stouffer Z catches both the scatter departures and the systematic directional shifts that the two characteristic modes of fabrication produce. The same baseline-distribution reasoning underlies large-scale screening of the published literature [4] and formal tests for fraudulent, over-balanced baseline data [5], and recent scoping reviews and trustworthiness instruments place baseline-balance checks among the standard screens for problematic trials [6, 7, 8].
Limitations
The test depends on correctly identifying the treatment arms, and when no group column is found the row-index proxy can pair unrelated halves of the data, which is why a confidence penalty is applied and the proxy result should be read cautiously. With only a handful of baseline columns the uniformity tests have low power, so a fabricated trial with few variables can pass. Genuine structure can mimic the signal: stratified or matched designs reduce baseline scatter legitimately, and a trial with strong baseline imbalance for non-fraudulent reasons can raise the significant-proportion check. The Welch t-test compares means only, so fabrication that matches means while distorting variances is not directly caught here. The keyword match for the group column can select the wrong column if a baseline variable shares a treatment-like name. Reported baseline p-values in the paper text are handled by the table-level baseline indicators, so D31 focuses on the reconstructed comparisons across the IPD.
Theoretical background
D31 rests on the distributional consequence of randomisation. If allocation is truly random, then for any baseline variable the two arms are samples from the same distribution, the null hypothesis of the balance test holds exactly, and the test's p-value is by construction Uniform[0,1]. Across many baseline variables the collection of p-values is therefore an approximately uniform sample, and two summaries capture its expected shape: the Kolmogorov-Smirnov statistic measures the largest gap between the observed and uniform cumulative distributions, detecting any departure including clustering, while Stouffer's transform maps each p-value through the inverse normal and averages, converting a uniform sample into a standard normal whose mean is zero. Fabrication disturbs this in complementary ways. Constructing arms from genuinely different populations drives the p-values toward zero, lowering the Kolmogorov-Smirnov fit, raising the significant proportion, and sending the Stouffer Z strongly negative. Forcing the arms to match, a common over-correction, drives the p-values toward one, again breaking uniformity, suppressing the significant proportion to zero, and sending the Stouffer Z strongly positive. Because the Stouffer Z aggregates the direction and magnitude of every comparison, it is the more powerful detector of these systematic shifts, while the Kolmogorov-Smirnov test guards against subtler non-uniformities that leave the mean near one half; using both, and reading the IPD rather than the reported table, gives the indicator its breadth. The Cramer-von Mises test sharpens this further: by integrating the squared distance across the whole unit interval rather than taking the single largest gap as Kolmogorov-Smirnov does, it is more sensitive to the tail pile-ups the two fabrication modes create, so taking the more significant of the two improves detection without inflating false positives on genuinely scattered p-values [9].
References
- Carlisle JB. The analysis of 168 randomised controlled trials to test data integrity. Anaesthesia. 2012;67(5):521-537. DOI: 10.1111/j.1365-2044.2012.07128.x
- Stouffer SA, Suchman EA, DeVinney LC, Star SA, Williams RM Jr. The American Soldier: Adjustment During Army Life (Volume 1). Princeton, NJ: Princeton University Press; 1949. https://www.worldcat.org/title/american-soldier-adjustment-during-army-life/oclc/317585
- Carlisle JB. False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia. 2021;76(4):472-479. DOI: 10.1111/anae.15263
- Carlisle JB. Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia. 2017;72(8):944-952. DOI: 10.1111/anae.13938
- Proschan MA, Shaw PA. Diagnosing fraudulent baseline data in clinical trials. PLoS ONE. 2020;15(10):e0239121. DOI: 10.1371/journal.pone.0239121
- Crone G, Green CD. Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology. 2025;35(3):359-380. DOI: 10.1177/09593543241311861
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology. 2021;136:189-202. DOI: 10.1016/j.jclinepi.2021.05.012
- Wilkinson J, Heal C, Antoniou GA, et al. A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology. 2024;175:111512. DOI: 10.1016/j.jclinepi.2024.111512
- Csörgő S, Faraway JJ. The exact and asymptotic distributions of Cramér-von Mises statistics. Journal of the Royal Statistical Society: Series B (Methodological). 1996;58(1):221-234. DOI: 10.1111/j.2517-6161.1996.tb02077.x