Carlisle-Stouffer IPD
Applies the Carlisle-Stouffer method for baseline balance testing across the full IPD, not just reported baseline p-values in the paper.
Technical description
Under genuine randomisation, comparing two trial arms on each baseline variable gives p-values distributed Uniform[0,1], because any differences are chance. Fabrication betrays itself in two opposite ways: arms from clearly different populations give too many significant differences, and arms whose means were forced to match give p-values bunched near 1. D31 applies the Carlisle method of baseline-distribution testing with Stouffer's pooling of p-values to the individual-patient data (IPD) rather than the printed baseline table. It locates a group column, runs a Welch t-test on each numeric baseline column between the arms, and tests the collected p-values for uniformity with the Kolmogorov-Smirnov statistic and Stouffer's combined Z. A Cramer-von Mises test against Uniform[0,1], more sensitive than Kolmogorov-Smirnov in the tails, is also applied, and the more significant of the two drives the uniformity score.
How it works
Layer 2 (contextual): a group column is detected from keywords (group, treatment, arm, allocation), its first two levels defining the arms; absent one, the rows are halved by index with a one-point proxy penalty. Welch t-tests on the numeric baseline columns yield p-values (columns constant in both arms skipped; at least five required). The more significant of the Kolmogorov-Smirnov and Cramer-von Mises tests against Uniform[0,1] adds 2.5 below p 0.01 or 1.5 below 0.05. Stouffer's combined Z, the summed inverse-normal transforms over the square root of their count with p-values clipped from 0 and 1, adds 1.5 when its absolute value exceeds 3. A significant proportion above thirty percent adds 1.0, a proportion below one tenth of a percent across at least ten tests adds 1.5, and a mean more than 0.20 from one half adds 0.5. Capped at 5.0 after the proxy penalty. Metadata records the p-value count, the proportions of significant and of high p-values, the mean, the Kolmogorov-Smirnov statistic and p-value, the Stouffer Z, the Cramer-von Mises statistic and p-value, the group column, and whether the proxy split was used.
Why this matters
Baseline-comparison behaviour is one of the most established forensic signals for trial integrity: Carlisle's distribution testing helped expose a large body of fabricated trials, and applying it to the raw IPD cannot be evaded by selective reporting of baseline p-values. Stouffer's method pools the comparisons into a single standard-normal Z that sensitively detects systematic directional shift, catching both an excess of significant differences (non-randomised or differently drawn arms) and a suspicious absence of them (forced balance).
Score thresholds
- 0-1
- Baseline comparisons scatter as randomisation would produce
- 2-3
- The baseline p-values depart from uniform in one or more ways
- 4-5
- Strong non-uniformity, an excess of significant differences or of forced non-significance, consistent with non-randomised or fabricated arms
Limitations
The test depends on correctly identifying the arms, and the row-index proxy used when no group column is found can pair unrelated halves, which is why a confidence penalty applies. With few baseline columns the uniformity tests have low power. Stratified or matched designs reduce baseline scatter legitimately, and genuine baseline imbalance can raise the significant-proportion check. The Welch t-test compares means only, so fabrication that matches means while distorting variances is not directly caught. The keyword match can select the wrong column if a baseline variable shares a treatment-like name. Reported baseline p-values in the paper text are handled by the table-level baseline indicators; D31 focuses on the reconstructed comparisons across the IPD.
References
- Carlisle JB. (2012). The analysis of 168 randomised controlled trials to test data integrity. Anaesthesia
- Stouffer SA, Suchman EA, DeVinney LC, Star SA, Williams RM Jr. (1949). The American Soldier: Adjustment During Army Life (Volume 1). Princeton University Press
- Carlisle JB. (2021). False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia
- Carlisle JB. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 72(8):944-952
- Proschan MA, Shaw PA. (2020). Diagnosing fraudulent baseline data in clinical trials. PLoS ONE 15(10):e0239121
- Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380
- Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
- Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
- Csörgő S, Faraway JJ. (1996). The exact and asymptotic distributions of Cramér-von Mises statistics. Journal of the Royal Statistical Society Series B 58(1):221-234