Skip to content

Benchmark and results

slopscore is evaluated on a committed, taxonomy-graded benchmark and on a held-out real-world slice. Reproduce with python scripts/eval/report.py; full detail is in eval/RESULTS.md and the labeling rubric is in eval/RUBRIC.md.

Headline numbers (rule scorer, the default)

Set n AUROC PR-AUC TPR@1%FPR ECE
benchmark (overt slop, in-sample) 128 0.89 0.91 0.65 0.18
Wikipedia AI-Cleanup (held-out, real wild slop) 40 0.69 0.65 0.00 0.39

slopscore separates overt formulaic slop from clean prose well. On real Wikipedia cases it is only moderately better than chance and catches almost none at a strict 1%-false-positive operating point. That gap is a real limitation, and it is why the accuracy claims stay modest.

Fairness keeps the rule scorer the default

Per-subgroup false-positive rate on the benchmark:

Subgroup n rules FPR ml FPR
general 100 0.00 0.06
simple_english 14 0.00 0.71
non_native 14 0.00 0.33

The learned model (--scorer ml) edges the rule scorer on raw metrics but over-flags plain and non-native English. The replace-if-wins gate keeps the transparent rule scorer as the default; the learned scorer stays opt-in. See Limitations & authorship.