Benchmark and results¶

slopscore is evaluated on a committed, taxonomy-graded benchmark and on a held-out real-world slice. Reproduce with python scripts/eval/report.py; full detail is in eval/RESULTS.md and the labeling rubric is in eval/RUBRIC.md.

Headline numbers (rule scorer, the default)¶

Set	n	AUROC	PR-AUC	TPR@1%FPR	ECE
benchmark (overt slop, in-sample)	128	0.89	0.91	0.65	0.18
Wikipedia AI-Cleanup (held-out, real wild slop)	40	0.69	0.65	0.00	0.39

slopscore separates overt formulaic slop from clean prose well. On real Wikipedia cases it is only moderately better than chance and catches almost none at a strict 1%-false-positive operating point. That gap is a real limitation, and it is why the accuracy claims stay modest.

Fairness keeps the rule scorer the default¶

Per-subgroup false-positive rate on the benchmark:

Subgroup	n	ml FPR
general	100	0.06
simple_english	14	0.71
non_native	14	0.33

The learned model (--scorer ml) edges the rule scorer on raw metrics but over-flags plain and non-native English. The replace-if-wins gate keeps the transparent rule scorer as the default; the learned scorer stays opt-in. See Limitations & authorship.