AI Safety4 min read

Authors Guild test: Pangram and Grammarly spot human writing

The Authors Guild ran ten 2020–2022 articles through detectors; Pangram and Grammarly labeled every text human while Sidekicker flagged all.

The Brieftide

TL;DR

  • 01The Authors Guild ran ten 2020–2022 articles through detectors; Pangram and Grammarly labeled every text human while Sidekicker flagged all.
  • 02Pangram and Grammarly correctly labeled all ten Guild articles as human-written, while Sidekicker marked every item as mostly AI, and ZeroGPT produced inconsistent, sometimes high AI scores.
  • 03Originality.ai "also performed well." The sample consisted of ten articles published between 2020 and 2022, before generative AI went mainstream.

Authors Guild ran ten human-written articles published between 2020 and 2022 through five AI-detection tools and found stark differences: Pangram and Grammarly identified every article as human, while Sidekicker flagged every single article as mostly AI-generated, with two at 100 percent.

What did the Authors Guild test find?

Pangram and Grammarly correctly labeled all ten Guild articles as human-written, while Sidekicker marked every item as mostly AI, and ZeroGPT produced inconsistent, sometimes high AI scores. Originality.ai "also performed well." The sample consisted of ten articles published between 2020 and 2022, before generative AI went mainstream.

The test table shows wide variance across individual pieces. For example, Sidekicker scored "Antitrust Litigation & Publications" at 100.0 percent AI while ZeroGPT scored that same article at 5.3 percent and Originality.ai at 0.0 percent. In other rows, Sidekicker returned 85.0 percent for "Obscenity Petitions Dismissed," while Pangram and Grammarly returned 0.0 percent for that piece.

How did each detector behave and why do they differ?

Pangram and Grammarly were tuned to avoid false positives in this sample: both returned 0.0 percent AI for the majority of the ten articles, effectively identifying every text as human. Originality.ai often returned 0.0 percent as well, and is described as having "performed well." ZeroGPT reported nonzero percentages in several cases (for example, 66.0 percent on the "Obituary: Joan Didion" piece and 64.5 percent on "Banned Books Club"), producing unreliable high AI percentages for clearly human texts. Sidekicker flagged every article as mostly AI-generated, with values like 96.0 percent for "Copyright Claims Board" and 100.0 percent for both "Antitrust Litigation & Publications" and "Erdrich Pulitzer Prize."

Pangram CEO Max Spero framed his detector as a black box but argued that models betray themselves through uniformity, saying, "Language models do give themselves away through uniformity, though, especially in how they build arguments. Humans write with far more variety," which explains why Pangram aims to minimize false positives. The Authors Guild cautioned that many professionally written texts share statistical patterns with model output because language models were trained on that kind of writing, so detection is not straightforward.

Why it matters

Detectors can produce false positives that have real consequences: "False positives can cost authors their contracts and their reputations," the Guild warns. Publishers and institutions that act on detector output risk unfairly penalizing skilled writers whose concise, polished prose resembles model output. The Guild recommends disclosing methods and giving authors a chance to defend themselves, since the tools change constantly and accuracy cannot be assumed.

The test also shows a trade-off between minimizing false positives and catching AI-written material. Tools that return zero AI for human texts may still miss AI-generated content, while tools that aggressively flag content risk false accusations. That split matters for publishers, editors, and authors negotiating contracts or enforcing disclosure rules.

What to watch

Watch for follow-up tests that include confirmed AI-generated samples to measure true positive rates; the Authors Guild’s current sample measured only how detectors classified human-written texts. Also watch how vendors change thresholds and whether publishers adopt requirements to disclose detector methods and offer appeals, as the Guild recommends.

The test provides a concrete baseline: ten pre-AI-era articles, two detectors that flagged all as human, one that flagged all as AI, and others producing mixed results. Those precise outcomes should guide decisions about whether and how to use these tools in editorial and contractual contexts.

Authors Guild test: detector scores for ten human-written articles (percent AI)
Item
Obscenity Petitions Dismissed14.3%0.0%85.0%0.0%0.0%
Antitrust Litigation & Publications5.3%0.0%100.0%0.0%0.0%
Warhol Fair Use Letter40.7%0.0%79.0%0.0%0.0%
Copyright Claims Board28.1%0.0%96.0%0.0%0.0%
Banned Books Club64.5%1.0%71.0%0.0%0.0%
Kiss Library Piracy Lawsuit26.5%1.0%71.0%7.0%0.0%
Obituary: Joan Didion66.0%0.0%82.0%9.0%0.0%
Erdrich Pulitzer Prize76.3%0.0%100.0%0.0%0.0%
Support Authors & Literary Arts50.6%0.0%92.0%0.0%0.0%
The Roundup 12/202018.1%0.0%96.0%0.0%0.0%
Advertisement

Written by The Brieftide · Source: The Decoder

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

More in AI Safety
Advertisement