ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
D31Statistical analysisFabrication ExtendedLayer 2 (Contextual)

Carlisle-Stouffer IPD

Applies the Carlisle-Stouffer method for baseline balance testing across the full IPD, not just reported baseline p-values in the paper.

Technical description

Under genuine randomisation, comparing two trial arms on each baseline variable gives p-values distributed Uniform[0,1], because any differences are chance. Fabrication betrays itself in two opposite ways: arms from clearly different populations give too many significant differences, and arms whose means were forced to match give p-values bunched near 1. D31 applies the Carlisle method of baseline-distribution testing with Stouffer's pooling of p-values to the individual-patient data (IPD) rather than the printed baseline table. It locates a group column, runs a Welch t-test on each numeric baseline column between the arms, and tests the collected p-values for uniformity with the Kolmogorov-Smirnov statistic and Stouffer's combined Z. A Cramer-von Mises test against Uniform[0,1], more sensitive than Kolmogorov-Smirnov in the tails, is also applied, and the more significant of the two drives the uniformity score.

How it works

Layer 2 (contextual): a group column is detected from keywords (group, treatment, arm, allocation), its first two levels defining the arms; absent one, the rows are halved by index with a one-point proxy penalty. Welch t-tests on the numeric baseline columns yield p-values (columns constant in both arms skipped; at least five required). The more significant of the Kolmogorov-Smirnov and Cramer-von Mises tests against Uniform[0,1] adds 2.5 below p 0.01 or 1.5 below 0.05. Stouffer's combined Z, the summed inverse-normal transforms over the square root of their count with p-values clipped from 0 and 1, adds 1.5 when its absolute value exceeds 3. A significant proportion above thirty percent adds 1.0, a proportion below one tenth of a percent across at least ten tests adds 1.5, and a mean more than 0.20 from one half adds 0.5. Capped at 5.0 after the proxy penalty. Metadata records the p-value count, the proportions of significant and of high p-values, the mean, the Kolmogorov-Smirnov statistic and p-value, the Stouffer Z, the Cramer-von Mises statistic and p-value, the group column, and whether the proxy split was used.

Why this matters

Baseline-comparison behaviour is one of the most established forensic signals for trial integrity: Carlisle's distribution testing helped expose a large body of fabricated trials, and applying it to the raw IPD cannot be evaded by selective reporting of baseline p-values. Stouffer's method pools the comparisons into a single standard-normal Z that sensitively detects systematic directional shift, catching both an excess of significant differences (non-randomised or differently drawn arms) and a suspicious absence of them (forced balance).

Score thresholds

0-1
Baseline comparisons scatter as randomisation would produce
2-3
The baseline p-values depart from uniform in one or more ways
4-5
Strong non-uniformity, an excess of significant differences or of forced non-significance, consistent with non-randomised or fabricated arms

Limitations

The test depends on correctly identifying the arms, and the row-index proxy used when no group column is found can pair unrelated halves, which is why a confidence penalty applies. With few baseline columns the uniformity tests have low power. Stratified or matched designs reduce baseline scatter legitimately, and genuine baseline imbalance can raise the significant-proportion check. The Welch t-test compares means only, so fabrication that matches means while distorting variances is not directly caught. The keyword match can select the wrong column if a baseline variable shares a treatment-like name. Reported baseline p-values in the paper text are handled by the table-level baseline indicators; D31 focuses on the reconstructed comparisons across the IPD.

References

  1. Carlisle JB. (2012). The analysis of 168 randomised controlled trials to test data integrity. Anaesthesia
  2. Stouffer SA, Suchman EA, DeVinney LC, Star SA, Williams RM Jr. (1949). The American Soldier: Adjustment During Army Life (Volume 1). Princeton University Press
  3. Carlisle JB. (2021). False individual patient data and zombie randomised controlled trials submitted to Anaesthesia. Anaesthesia
  4. Carlisle JB. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 72(8):944-952
  5. Proschan MA, Shaw PA. (2020). Diagnosing fraudulent baseline data in clinical trials. PLoS ONE 15(10):e0239121
  6. Crone G, Green CD. (2025). Tools of the data detective: A review of statistical methods to detect data and result anomalies in psychology. Theory & Psychology 35(3):359-380
  7. Bordewijk EM, Li W, van Eekelen R, Wang R, Showell M, Mol BW, van Wely M. (2021). Methods to assess research misconduct in health-related research: A scoping review. Journal of Clinical Epidemiology 136:189-202
  8. Wilkinson J, Heal C, Antoniou GA, et al.. (2024). A survey of experts to identify methods to detect problematic studies: stage 1 of the INveStigating ProblEmatic Clinical Trials in Systematic Reviews project. Journal of Clinical Epidemiology 175:111512
  9. Csörgő S, Faraway JJ. (1996). The exact and asymptotic distributions of Cramér-von Mises statistics. Journal of the Royal Statistical Society Series B 58(1):221-234