Coherence Analysis
Uses a large language model to evaluate whether the argument flow is genuinely coherent or just superficially connected, catching the subtle logical gaps that AI-generated text often contains.
Technical description
Sends text to an LLM with a prompt designed to evaluate deep coherence: does each paragraph logically follow from the previous one? Are claims supported by the evidence presented? Do the conclusions follow from the arguments? The LLM identifies specific logical gaps, non-sequiturs, and places where surface-level connectives mask weak logical connections.
How it works
Layer 4 (LLM-powered): Sends the text to a language model with a structured rubric that separates coherence into four dimensions, judged independently: local flow (do adjacent sentences connect through shared subjects and clear reference), global structure (does the whole argument build rather than drift or merely restate its opening), contradiction (do statements conflict), and transition validity (do connective words signal a real logical relation). The model returns a sub-score and flagged passages per dimension; these are combined into one score, weighted toward global structure and contradictions, and the per-dimension breakdown is kept alongside it. Runs only when a model is configured.
Why this matters
Fluency is the easiest thing for a language model to produce and the hardest to trust. Machine text tends to get the local surface right while drifting at the global level, joining ideas with the right transition words but without the logical relation those words imply. Surface measures catch some of this, but the most convincing machine text defeats them by getting the surface right, so a model's own semantic judgement is needed to catch the well-connected but hollow argument that statistics rate as coherent.
Score thresholds
- 0-1
- Strong logical flow with well-supported arguments
- 2-3
- Generally coherent with some weak transitions
- 4-5
- Surface coherence masking logical disconnections
Limitations
Requires a configured LLM provider. The evaluating model's own limitations affect assessment quality. Very long documents may need to be chunked, potentially missing cross-document coherence issues. Domain-specific logical conventions may not be recognized.
References
- Liu W, Strube M. (2025). Joint modeling of entities and discourse relations for coherence assessment. Proceedings of EMNLP 2025
- Schroeder K, Wood-Doughty Z. (2024). Can you trust LLM judgments? Reliability of LLM-as-a-judge. arXiv preprint arXiv:2412.12509
- Kim J, Huang Z, McKeown K. (2024). Threads of subtlety: detecting machine-generated texts through discourse motifs. Proceedings of ACL 2024
- Sheng Z, Zhang T, Jiang C, Kang D. (2024). BBScore: a Brownian bridge based metric for assessing text coherence. Proceedings of AAAI 2024
- Barzilay R, Lapata M. (2008). Modeling local coherence: an entity-based approach. Computational Linguistics