You are viewing a plain text version of this content. The canonical link for it is here.
Posted to ruleqa@spamassassin.apache.org by Henrik K <he...@hege.li> on 2021/04/28 14:35:05 UTC

corpus quality

Are these still a thing?

https://ruleqa.spamassassin.org/20210428-r1889258-n/HK_NAME_DRUGS/detail

Can ena-corpus be taken seriously if 40% of it is probably identical
messages to different recipients?  Or some spamtrap that has subscribed to
all possible "enhancement" sites?

I guess it works for that rule, little harm scoring it high..  but because
of the corpus skew, will other good rules that seem to hit percentually
less, score less?  Just thinking aloud..

-hk


Re: corpus quality

Posted by Henrik K <he...@hege.li>.
On Thu, Apr 29, 2021 at 10:53:00PM -0400, Kevin A. McGrail wrote:
>
> ENA has a lot of customers in the education space as I remember so it might be
> a different corpora set than others.

I don't think that explains how 40% of the corpus is very specific
drug-spam, unless we start making demographic-related jokes.. :-)


Re: corpus quality

Posted by "Kevin A. McGrail" <km...@apache.org>.
ENA has a lot of customers in the education space as I remember so it might
be a different corpora set than others.
--
Kevin A. McGrail
Member, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171


On Wed, Apr 28, 2021 at 10:35 AM Henrik K <he...@hege.li> wrote:

>
> Are these still a thing?
>
> https://ruleqa.spamassassin.org/20210428-r1889258-n/HK_NAME_DRUGS/detail
>
> Can ena-corpus be taken seriously if 40% of it is probably identical
> messages to different recipients?  Or some spamtrap that has subscribed to
> all possible "enhancement" sites?
>
> I guess it works for that rule, little harm scoring it high..  but because
> of the corpus skew, will other good rules that seem to hit percentually
> less, score less?  Just thinking aloud..
>
> -hk
>
>