ResAIKit
Research Integrity Toolkit
Back to the encyclopedia
E5Text analysisForensicLayer 1 (Deterministic)

Copy Artifacts

Detects residual formatting and conversational markers left behind when text is copied from large language model (LLM) interfaces. Includes Markdown markup, LLM conversation prefixes, Perplexity-style citations, LaTeX remnants, and Claude Extensible Markup Language (XML) thinking blocks.

Technical description

E5 scans for ten categories of copy-paste artifacts that directly indicate LLM origin. These are not statistical signals, they are physical evidence of text having passed through a chat interface, code notebook, or structured LLM output format. The indicator runs at Layer 1 using only pattern matching matching.

How it works

1. Residual Markdown. Seven sub-patterns: fenced code blocks (``), bold markers (**text**), heading markers (# Heading), bullet lists (- item / * item), Markdown hyperlinks ([text](url)), and inline code (code`). Code blocks inside fenced regions are excluded from other Markdown checks to avoid double-flagging. Each pattern contributes independently: code blocks +1.5, bold +1.5, headings +1.0, bullets +0.75, links +0.75, inline code +0.5.

2. LLM conversation prefixes. Anchored at the start of text. English patterns: "Sure", "Certainly", "Of course", "I'd be happy", "Here's/Here is", "Let me", "Great question", "Absolutely". Romanian patterns: "Desigur", "Cu placere", "Sigur", "Iata", "Permiteti-mi", "Buna intrebare", "Bineinteles", "Fara indoiala". Each match contributes +2.0. These are the strongest single artifact signal, no human author begins an academic paper with "Certainly! Here's a comprehensive analysis..."

3. Perplexity-style citations. Numeric citations ([1], [2]) without a corresponding References/Bibliography section heading. Common in Perplexity AI output where inline citations are generated but no full reference list follows. Contributes +1.0.

4. Residual LaTeX. LaTeX command patterns (\textbf{}, \cite{}, \ref{}, \begin{}, \end{}, \section{}, \subsection{}) and inline math delimiters ($...$). Contributes +0.75 for any LaTeX found.

5. Residual XML / Claude artifacts. Three sub-patterns: <thinking>...</thinking> blocks (+2.5, Claude's internal reasoning), <artifact>...</artifact> blocks (+2.0, Claude's structured output), and generic XML and HyperText Markup Language (HTML) tags (+1.0). Previously flagged thinking and artifact blocks are excluded from generic tag detection to avoid double-counting.

6. Numbered Markdown headings. The pattern ## 1. Title (numbered headings with Markdown markers) is characteristic of LLM-generated structured outlines. Contributes +0.5.

7. LLM closing phrases. Detects conversational wrap-up phrases characteristic of chatbot output: "I hope this helps", "Let me know if you need", "Feel free to", "Please don't hesitate", "Happy to help/clarify/elaborate", "If you have any questions". Each match contributes +1.5. These phrases are definitive chatbot artifacts, no human author ends an academic paper with "I hope this helps!"

8. LLM self-reference. Detects the text referring to itself as an AI: "As an AI", "As a language model", "I'm an AI", "As an artificial intelligence", "I was trained/designed/created to". Each match contributes +2.0. This is the strongest possible single artifact signal.

9. Overly helpful presentational phrases. Detects chatbot scaffolding language used mid-text: "Here is a...", "Here are the...", "Below is a...", "The following is a...", "Let me break/explain/clarify/elaborate/summarize/outline/walk you through...". Each match contributes +0.5, capped at +2.0.

Score thresholds

Score Meaning
0 to 1 Clean text with no detectable LLM interface artifacts.
2 to 3 One or two minor artifacts (Markdown formatting, lone LaTeX commands). Could be copy-paste from a code notebook or word processor.
4 to 5 Multiple or critical artifacts present: conversation prefixes, thinking blocks, or heavy Markdown formatting. Definitive evidence of LLM interface origin.

Limitations

The Markdown detection uses pattern matching patterns that may produce false positives on text that legitimately uses Markdown-like formatting (e.g., software documentation, README files, technical writing with code examples). The indicator is designed for academic prose and may over-flag technical documents.

The LLM prefix check is anchored at the start of text only. A prefix that appears after a title block or abstract will not be detected. The prefix list covers the most common English and Romanian patterns but may miss less common variants or other languages.

LaTeX detection targets explicit command patterns and inline math. Text that uses LaTeX as its native formatting (e.g., arXiv papers submitted in LaTeX source form) will be heavily flagged. This is intentional for the academic use case: E5 is designed to detect LaTeX in otherwise plain-text submissions, which indicates LLM copy-paste. Documents that are legitimately LaTeX-formatted should be submitted in a different format or have this indicator disabled.

References

  1. Aiersilan A. Detecting verbatim LLM copy-paste in homework (SteganoPrompt). arXiv preprint arXiv:2605.16336. 2026. https://arxiv.org/abs/2605.16336
  2. Gloaguen T, Jovanovic N, Staab R, Vechev M. Discovering spoofing attempts on language model watermarks. ICML. 2025.