Authors Guild test: Pangram and Grammarly spot human writing
The Authors Guild ran ten 2020–2022 articles through detectors; Pangram and Grammarly labeled every text human while Sidekicker flagged all.
TL;DR
- 01The Authors Guild ran ten 2020–2022 articles through detectors; Pangram and Grammarly labeled every text human while Sidekicker flagged all.
- 02Pangram and Grammarly correctly labeled all ten Guild articles as human-written, while Sidekicker marked every item as mostly AI, and ZeroGPT produced inconsistent, sometimes high AI scores.
- 03Originality.ai "also performed well." The sample consisted of ten articles published between 2020 and 2022, before generative AI went mainstream.
Authors Guild ran ten human-written articles published between 2020 and 2022 through five AI-detection tools and found stark differences: Pangram and Grammarly identified every article as human, while Sidekicker flagged every single article as mostly AI-generated, with two at 100 percent.
What did the Authors Guild test find?
Pangram and Grammarly correctly labeled all ten Guild articles as human-written, while Sidekicker marked every item as mostly AI, and ZeroGPT produced inconsistent, sometimes high AI scores. Originality.ai "also performed well." The sample consisted of ten articles published between 2020 and 2022, before generative AI went mainstream.
The test table shows wide variance across individual pieces. For example, Sidekicker scored "Antitrust Litigation & Publications" at 100.0 percent AI while ZeroGPT scored that same article at 5.3 percent and Originality.ai at 0.0 percent. In other rows, Sidekicker returned 85.0 percent for "Obscenity Petitions Dismissed," while Pangram and Grammarly returned 0.0 percent for that piece.
How did each detector behave and why do they differ?
Pangram and Grammarly were tuned to avoid false positives in this sample: both returned 0.0 percent AI for the majority of the ten articles, effectively identifying every text as human. Originality.ai often returned 0.0 percent as well, and is described as having "performed well." ZeroGPT reported nonzero percentages in several cases (for example, 66.0 percent on the "Obituary: Joan Didion" piece and 64.5 percent on "Banned Books Club"), producing unreliable high AI percentages for clearly human texts. Sidekicker flagged every article as mostly AI-generated, with values like 96.0 percent for "Copyright Claims Board" and 100.0 percent for both "Antitrust Litigation & Publications" and "Erdrich Pulitzer Prize."
Pangram CEO Max Spero framed his detector as a black box but argued that models betray themselves through uniformity, saying, "Language models do give themselves away through uniformity, though, especially in how they build arguments. Humans write with far more variety," which explains why Pangram aims to minimize false positives. The Authors Guild cautioned that many professionally written texts share statistical patterns with model output because language models were trained on that kind of writing, so detection is not straightforward.
Why it matters
Detectors can produce false positives that have real consequences: "False positives can cost authors their contracts and their reputations," the Guild warns. Publishers and institutions that act on detector output risk unfairly penalizing skilled writers whose concise, polished prose resembles model output. The Guild recommends disclosing methods and giving authors a chance to defend themselves, since the tools change constantly and accuracy cannot be assumed.
The test also shows a trade-off between minimizing false positives and catching AI-written material. Tools that return zero AI for human texts may still miss AI-generated content, while tools that aggressively flag content risk false accusations. That split matters for publishers, editors, and authors negotiating contracts or enforcing disclosure rules.
What to watch
Watch for follow-up tests that include confirmed AI-generated samples to measure true positive rates; the Authors Guild’s current sample measured only how detectors classified human-written texts. Also watch how vendors change thresholds and whether publishers adopt requirements to disclose detector methods and offer appeals, as the Guild recommends.
The test provides a concrete baseline: ten pre-AI-era articles, two detectors that flagged all as human, one that flagged all as AI, and others producing mixed results. Those precise outcomes should guide decisions about whether and how to use these tools in editorial and contractual contexts.
| Item | ||||||
|---|---|---|---|---|---|---|
| Obscenity Petitions Dismissed | 14.3% | 0.0% | 85.0% | 0.0% | 0.0% | |
| Antitrust Litigation & Publications | 5.3% | 0.0% | 100.0% | 0.0% | 0.0% | |
| Warhol Fair Use Letter | 40.7% | 0.0% | 79.0% | 0.0% | 0.0% | |
| Copyright Claims Board | 28.1% | 0.0% | 96.0% | 0.0% | 0.0% | |
| Banned Books Club | 64.5% | 1.0% | 71.0% | 0.0% | 0.0% | |
| Kiss Library Piracy Lawsuit | 26.5% | 1.0% | 71.0% | 7.0% | 0.0% | |
| Obituary: Joan Didion | 66.0% | 0.0% | 82.0% | 9.0% | 0.0% | |
| Erdrich Pulitzer Prize | 76.3% | 0.0% | 100.0% | 0.0% | 0.0% | |
| Support Authors & Literary Arts | 50.6% | 0.0% | 92.0% | 0.0% | 0.0% | |
| The Roundup 12/2020 | 18.1% | 0.0% | 96.0% | 0.0% | 0.0% |
Written by The Brieftide · Source: The Decoder
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in AI SafetyHuman-centric AI and firm idiosyncratic risks, 2015–2023
Human-centric AI strategies are associated with lower firm idiosyncratic risk among Chinese listed firms.
OpenAI joins Appia Foundation to build shared AI standards
OpenAI supports evaluation frameworks, safety practices and global cooperation through the Appia Foundation.
AI4SE and SE4AI: A decade review of AI in systems engineering
H. Sinan Bank, Daniel R. Herber and Thomas Bradley map three research phases and assess 1.
Dario Amodei's AI playbook: Anthropic's regulation plan
Amodei urges binding third-party audits, federal power to block risky models, export controls.