ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
L4-CoherenceText analysisLLMLayer 4

Coherence Analysis

Uses a language model to judge whether an argument holds together, scoring four dimensions of coherence, local flow, global structure, internal contradiction, and transition validity, and combining them into a weighted score with a per-dimension breakdown.

Technical description

L4-Coherence is the semantic, model-based counterpart to the deterministic local-coherence check: where that check measures coherence through countable surface features, this one asks a language model to read the argument and report where the reasoning breaks down. It runs at Layer 4, only when a model is configured and the text is at least 100 words, and sends the document, truncated to the first 8000 characters, to the model with a structured rubric. The rubric separates coherence into four dimensions, each scored 0 to 5, which are aggregated by a fixed weighting into a single 0 to 5 score with the per-dimension breakdown retained.

How it works

The model is given the text with a rubric that asks it to score four dimensions independently, so it judges each on its own rather than forming a single overall impression.

Local flow asks whether adjacent sentences connect through shared subjects and clear reference, or jump between topics with unclear pronouns. Global structure asks whether the whole argument builds, each part following from the last, or drifts between sections and ends by restating its opening rather than earning a conclusion. Contradiction asks whether any statements in the text conflict. Transition validity asks whether connective words such as however, therefore and moreover signal a real logical relation or paper over content that does not follow.

Aggregation. The four dimension scores s are combined by a weighted mean with weights local 0.20, global 0.35, contradiction 0.25, transitions 0.20, summing only over the dimensions the model returns: score = Σ wᵢ sᵢ / Σ wᵢ. Global structure and contradiction carry the most weight, since those are where machine text drifts while keeping its local surface smooth. When the model returns a single overall score instead of the per-dimension rubric, that score is used directly, preserving compatibility with simpler outputs. The reported score is clamped to the 0 to 5 range.

Findings. Each passage the model flags is labelled with the dimension it fails (Local coherence, Global structure, Contradiction, Transition relation). The per-dimension sub-scores and a short summary are returned in the metadata, so a reader can see whether the problem is local choppiness, a hollow overall argument, or transitions that do not hold.

Score thresholds

Score Meaning
0 to 1 The argument develops cleanly, each step supported by the last.
2 to 3 Some thematic jumps, weakly supported transitions, or a local rough patch.
4 to 5 Repeated gaps, contradictions, circular reasoning, or sections that connect only superficially.

Why this matters

Fluency is the easiest thing for a language model to produce and the hardest signal to trust. Coherence works at two levels: a local level, where adjacent sentences connect through shared subjects and reference, and a global level, where the whole argument has a real structure. Machine text tends to get the local surface right while drifting at the global level, joining ideas with the right transition words but without the underlying relation those words imply. Surface measures catch some of this, but the most convincing machine text defeats them by getting the surface right, which is why recent detection work has moved toward discourse-level and global coherence features that read the structure rather than the words. L4-Coherence brings a model's own semantic judgement to bear on that gap, separating the dimensions so the kind of failure is visible.

Limitations

L4-Coherence depends on a language model, so it is slower and costlier than the Layer 1 checks and its judgement carries the model's own biases. Research on using models as judges is clear that the same fluency that fools a reader can satisfy the judging model, and that judges miss drops in long-form coherence in particular; separating the rubric into distinct dimensions reduces, without removing, that risk. Its findings are assessments rather than proofs and should inform a human reading. Results vary between runs and between models, the text is truncated to 8000 characters so a very long document is judged on its opening, and the score reflects the single rubric pass rather than a consensus across models.

Theoretical background

L4-Coherence builds on the neural and discourse-based coherence literature. The local-versus-global split follows entity-based and discourse-relation models of coherence, in which local coherence is captured by entity transitions between adjacent sentences and global coherence by the rhetorical structure of the whole text; recent work that jointly models entities and discourse relations reports gains from combining the two, which motivates scoring them as separate dimensions. The choice to weight global structure and contradiction most heavily reflects the finding that discourse-level features separate machine from human text more reliably than token-level ones. The use of a model as the judge sits within the LLM-as-judge literature, whose results on rubric design, criterion separation reduces halo effects but does not eliminate verbosity and position biases, shape the rubric and the abstention-leaning interpretation of its output.

References

  1. Liu W, Strube M. Joint modeling of entities and discourse relations for coherence assessment. Proceedings of EMNLP 2025. 2025. https://arxiv.org/abs/2509.04182
  2. Schroeder K, Wood-Doughty Z. Can you trust LLM judgments? Reliability of LLM-as-a-judge. arXiv preprint arXiv:2412.12509. 2024. https://arxiv.org/abs/2412.12509
  3. Kim J, Huang Z, McKeown K. Threads of subtlety: detecting machine-generated texts through discourse motifs. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). 2024. https://aclanthology.org/2024.acl-long.300/
  4. Sheng Z, Zhang T, Jiang C, Kang D. BBScore: a Brownian bridge based metric for assessing text coherence. Proceedings of the AAAI Conference on Artificial Intelligence. 2024. https://ojs.aaai.org/index.php/AAAI/article/view/29879
  5. Barzilay R, Lapata M. Modeling local coherence: an entity-based approach. Computational Linguistics. 2008;34(1):1-34. DOI: 10.1162/coli.2008.34.1.1