You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2004/06/23 22:52:00 UTC

Re: Re[2]: proposed changes to CORPUS_POLICY

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Daniel Quinlan writes:
> Bob Menschel <Ro...@Menschel.net> writes:
> 
> > I can see the reason for most of Daniel's suggestions, and while I
> > think 12 months is too short a period for ham (I'd favor 18 or 24
> > months), I could live with that.
> 
> I might be able to live with 18, but I think we should stick with 12
> because of the network tests (which are on for 2 of the 3 mass-check
> runs if I recall correctly).  The problem is that you get more and more
> mail that is no longer representative of the current sender
> configuration: SPF negative, host no longer exists, IP address has
> changed, etc.

Yes, I agree -- this is the problem with older ham.  (esp. the SPF
problem.  SPF is very brittle on this point.)

How's about putting stricter limits on the net check corpora?

I would suggest though that Malte's point is also valid -- some "special
case" reported FP mails should be kept in the ham corpus, if they really
are special cases that the submitter is worried about.

> > Ham bounces (valid bounces of ham sent from our systems) are ham, and
> > should be in the ham corpus.  Spam bounces (blind bounces of spam sent
> > back to forged or faked from addresses) are spam, often containing the
> > content of the spam as well as the notification.
> 
> I agree those are spam, but since those can be addressed with techniques
> like envelope rewriting that are 100% reliable and non-probabilistic, I
> think we should just remove them.

And the ham?  I'm +1 on keeping ham bounces.

Spam bounces, however, I don't think should be used in the corpus at
all.

> >>> 5. no mailing list moderation administative messages since these also
> >>>    contain spam
> > 
> > They also contain ham. If a system administrator can differentiate
> > between them, why shouldn't the spam messages be in a spam corpus, and
> > the ham messages in a ham corpus?
> 
> Moderators can't ignore either type of moderation message for a large
> proportion of mailing list software (especially mailman).  If anything,
> they should all be ham and I don't think we want to do that.  I think
> it's better to just remove them.

OK, I've come around to that view BTW.  +1

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFA2e1wQTcbUG5Y7woRApGIAJ96HbTdMromHvsVa/gH1BOev1FtvgCgtbDM
dngT9ZZmVyR1VUa1MKwgT9U=
=WjwV
-----END PGP SIGNATURE-----


Re: Re[2]: proposed changes to CORPUS_POLICY

Posted by Daniel Quinlan <qu...@pathname.com>.
jm@jmason.org (Justin Mason) writes:

> Yes, I agree -- this is the problem with older ham.  (esp. the SPF
> problem.  SPF is very brittle on this point.)
> 
> How's about putting stricter limits on the net check corpora?

Well, do we really want to use an extra 6 months on only one of the
runs?  I think it would be better to use more or less the same data.
 
> I would suggest though that Malte's point is also valid -- some "special
> case" reported FP mails should be kept in the ham corpus, if they really
> are special cases that the submitter is worried about.

Yes, I *am* keeping my non-SpamAssassin-list spam-related mail in the
corpus.  The main reason to remove the SpamAssassin list mail is that
we'll totally bias the corpus; I'm sure we'll have more than enough FPs
for iffy rules by virtue of our everyday mail.
 
> And the ham?  I'm +1 on keeping ham bounces.

Agreed, I am keeping ham bounces.
 
Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/