Methodology & Validation

How ResAIKit detects research integrity risks across 145 indicators, and what the scientific literature tells us about detection accuracy.

Four-Layer Detection Architecture

ResAIKit processes each manuscript through four progressively sophisticated layers. Lower layers are deterministic and fast; higher layers use external APIs and large language models for deeper analysis. Each indicator contributes a 0-5 score; scores are weighted and aggregated into a 0-100 composite per module.

Layer 1 — Deterministic

Regex, dictionaries, FFT, GRIM, Benford

Pattern matching, digit distribution analysis, JPEG error level analysis. Zero false positives for well-defined rules (GRIM: mean*N integer check). Execution: under 1 second.

Layer 2 — NLP

spaCy POS tagging, dependency parsing, Hunspell

Syntactic variation analysis, voice consistency, spelling/perplexity metrics. Complements Layer 1 with linguistic features. Execution: 2-5 seconds.

Layer 3 — External APIs

CrossRef, PubMed, Semantic Scholar, OpenAlex

Citation verification against academic databases. Detects hallucinated references (DOIs that do not exist, author/journal mismatches). Execution: 10-30 seconds.

Layer 4 — LLM Analysis

Claude, GPT, Gemini, DeepSeek

Semantic coherence, factual consistency, terminology appropriateness. Most sensitive layer; requires human review of flagged passages. Execution: 30-60 seconds.

Module Detection Capabilities

Text Analysis (31 indicators)

The text module combines stylistic fingerprinting with forensic linguistics. Indicators C1-C10 measure stylistic features associated with LLM-generated text: exaggerated generality (C1), artificial discourse structure (C3), weak methodological anchoring (C4), and promotional phrasing (C10). Indicators E1-E7 detect forensic artifacts: Unicode anomalies, encoding traces, and rhythm variation patterns characteristic of specific AI models. The F-series identifies model-specific fingerprints (ChatGPT, Claude, Gemini, Grok, Perplexity) using n-gram preference dictionaries and token distribution analysis.

Published benchmarks for AI-text detection in academic writing report accuracy rates of 70-85% for well-trained classifiers, with higher false-positive rates for non-native English text. ResAIKit mitigates this through multi-indicator corroboration: no single indicator triggers a high-risk flag; patterns must converge across stylistic, forensic, and fingerprint dimensions.

Image Analysis (47 indicators)

Image indicators span generic forensics (I1-I10: ELA, FFT, noise consistency, clone detection, metadata analysis) and domain-specific checks for Western blots (W1-W4), microscopy (M1-M10), charts (G1-G12), and tables (T1-T14). Error Level Analysis (I1) detects JPEG recompression artifacts from image editing. ORB feature matching (I6) identifies copy-move forgery. GRIM/SPRITE tests (G12) verify that reported means and standard deviations are mathematically possible given sample sizes.

Image manipulation detection methods have been extensively validated in the digital forensics literature. ELA-based methods achieve over 90% detection for spliced images with different compression histories. Western blot duplication detection via SSIM-based band comparison has been validated against known retracted papers. However, AI-generated images present a growing challenge that current indicators address through texture anomaly detection (I2, M5, M9) rather than provenance verification.

Statistical Analysis (65 indicators)

Statistical indicators apply three categories of checks. Consistency checks (S1-S20) verify arithmetic, CI boundaries, GRIM/GRIMMER/SPRITE/DEBIT test consistency, terminal digit preference, and Benford distribution. Methodological checks (R1-R12) assess design-test compatibility, sample size reporting, multiple comparison correction, and pre-registration fidelity. Fabrication indicators (D1-D34) analyze individual participant data for implausible patterns: absent correlations, excessive Gaussianity, inlier clustering, and cross-variable rule violations against 6 domain-specific reference dictionaries.

GRIM and GRIMMER are deterministic: mean*N and SD^2*(N-1) must be integers within rounding tolerance. Violations are mathematically impossible in real data. Statcheck accurately recovers p-values from test statistics in approximately 95% of cases for t-tests, F-tests, and chi-squared tests. The composite D10 fabrication score uses weighted aggregation across 4 sub-domains with multiplicative bonuses for critical pattern combinations that co-occur in known fabrication cases.

Known Limitations

False positives in non-native English text. Stylistic indicators (C-series) may flag legitimate writing by non-native speakers as AI-like. Always corroborate with forensic (E-series) and fingerprint (F-series) indicators before concluding.

Evolving AI models. Fingerprint indicators are trained on known model outputs (ChatGPT 4, Claude 3.5/4, Gemini 2.0). New model versions may produce different patterns. Fingerprint dictionaries require periodic updates.

AI-generated images. Current image indicators detect manipulation of real images. Fully AI-generated figures (DALL-E, Midjourney) require different detection approaches. I2 (FFT analysis) and M9 (resampling) provide partial coverage.

Sample size dependency. Statistical indicators require sufficient reported values. A manuscript reporting only means without SDs, or with fewer than 10 test statistics, provides limited signal for the S-series and D-series indicators.

Not a verdict tool. ResAIKit is designed for triage and documentation, not automated decisions. All flags require expert human review. The tool records evidence for editorial workflow; it does not replace it.

Key References

Brown, N. J. L., & Heathers, J. A. J. (2017). The GRIM Test: A Simple Technique Detects Numerous Anomalies in the Reporting of Results in Psychology. Social Psychological and Personality Science, 8(4), 363-369. DOI: 10.1177/1948550616673876
Nuijten, M. B., et al. (2016). Statcheck: Extract statistics from articles and recompute p-values. Behavior Research Methods, 48, 1205-1226. DOI: 10.3758/s13428-015-0664-2
Benford, F. (1938). The Law of Anomalous Numbers. Proceedings of the American Philosophical Society, 78(4), 551-572.
Farid, H. (2009). Image forgery detection. IEEE Signal Processing Magazine, 26(2), 16-25. DOI: 10.1109/MSP.2008.931079
Bik, E. M., et al. (2016). The Prevalence of Inappropriate Image Duplication in Biomedical Research Publications. mBio, 7(3), e00809-16. DOI: 10.1128/mBio.00809-16
Carlisle, J. B. (2017). Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia, 72(8), 944-952. DOI: 10.1111/anae.13938
Ioannidis, J. P. A., & Trikalinos, T. A. (2007). An exploratory test for an excess of significant findings. Clinical Trials, 4(3), 245-253. DOI: 10.1177/1740774507079441
Liang, W., et al. (2023). GPT detectors are biased against non-native English writers. Patterns, 4(7), 100779. DOI: 10.1016/j.patter.2023.100779

Try it on a manuscript

Upload a paper and see how the 145 indicators flag potential integrity risks in your own field.

Analyze a manuscript