You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Daniel Quinlan <qu...@pathname.com> on 2004/07/05 22:34:37 UTC

corpus policy

So, before we make the pre2 release and start mass-checks, there's one
thing I want to nail down in the corpus policy: should we just remove
any spam list that has tons of false positives?

Removing the SpamAssassin ones is just common sense, but I looked at my
false positives and 59 out of 102 of my false positives are from another
anti-spam mailing list that frequently includes snippets of spam, URLs,
etc.  My other FPs are pretty well spread out.  A few are from a few
spam-related lists, like the spf-discuss list, but nothing significant
like 59/102.

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: corpus policy

Posted by Daniel Quinlan <qu...@pathname.com>.
Daniel Quinlan wrote:

>> So, before we make the pre2 release and start mass-checks, there's one
>> thing I want to nail down in the corpus policy: should we just remove
>> any spam list that has tons of false positives?
 
Theo Van Dinter <fe...@kluge.net> writes:

> It would depend what the FPs are from I'd say.

Well, I'd rather just have a hard and fast rule such as "remove
anti-spam mailing lists if spam snippets or domain names are frequently
posted".  If we're going to remove any FPs (by rule or message), then
there's really no point in including other messages because they won't
affect the perceptron results.
 
>> Removing the SpamAssassin ones is just common sense, but I looked at my
>> false positives and 59 out of 102 of my false positives are from another
>> anti-spam mailing list that frequently includes snippets of spam, URLs,

> Ah.  IMO, any spam-related mails have no place in a ham corpus.
> They're not going to be considered "standard" for most people, and as
> you've said, they have a large tendency to include spam snippets/etc
> that cause filters to go all gonzo.

Well, since you ask... out of the non-net FPs:

35      BIZ_TLD
19      DRUGS_ERECTILE
18      FORGED_RCVD_HELO
15      DRUGS_ANXIETY
12      DRUGS_ANXIETY_EREC
11      DRUGS_PAIN
10      INFO_TLD
9       DRUGS_ERECTILE_OBFU
9       DRUGS_ANXIETY_OBFU
8       MAILTO_TO_SPAM_ADDR
6       NORMAL_HTTP_TO_IP
6       DRUGS_PAIN_EREC
6       DOMAIN_4U2
5       HTTP_EXCESSIVE_ESCAPES
5       FROM_NO_LOWER
5       DRUGS_MANYKINDS
5       DRUGS_DIET_EREC
5       DRUGS_DIET
...

and with network tests, there are amazingly only 5 false positives from
that list instead of 59 because we rely on the body a lot less:

2       BIZ_TLD
2       DOMAIN_4U2
2       DOMAIN_RATIO
2       FROM_NO_LOWER
2       INFO_TLD
2       URI_OFFERS

Clearly, domain names are the main issue here.  Snippets of drug spam
are popular too.

I think removing them *all* would be better and would better match
actual practice (while I do tag them with SA headers, I don't filter
this list or the SpamAssassin lists).

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: corpus policy

Posted by Theo Van Dinter <fe...@kluge.net>.
On Mon, Jul 05, 2004 at 01:34:37PM -0700, Dan Quinlan wrote:
> So, before we make the pre2 release and start mass-checks, there's one
> thing I want to nail down in the corpus policy: should we just remove
> any spam list that has tons of false positives?

It would depend what the FPs are from I'd say.

> Removing the SpamAssassin ones is just common sense, but I looked at my
> false positives and 59 out of 102 of my false positives are from another
> anti-spam mailing list that frequently includes snippets of spam, URLs,

Ah.  IMO, any spam-related mails have no place in a ham corpus.
They're not going to be considered "standard" for most people, and as
you've said, they have a large tendency to include spam snippets/etc
that cause filters to go all gonzo.

-- 
Randomly Generated Tagline:
"To love another person is to see the face of God."   - Les Miserables