You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2004/06/23 22:52:00 UTC
Re: Re[2]: proposed changes to CORPUS_POLICY
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Daniel Quinlan writes:
> Bob Menschel <Ro...@Menschel.net> writes:
>
> > I can see the reason for most of Daniel's suggestions, and while I
> > think 12 months is too short a period for ham (I'd favor 18 or 24
> > months), I could live with that.
>
> I might be able to live with 18, but I think we should stick with 12
> because of the network tests (which are on for 2 of the 3 mass-check
> runs if I recall correctly). The problem is that you get more and more
> mail that is no longer representative of the current sender
> configuration: SPF negative, host no longer exists, IP address has
> changed, etc.
Yes, I agree -- this is the problem with older ham. (esp. the SPF
problem. SPF is very brittle on this point.)
How's about putting stricter limits on the net check corpora?
I would suggest though that Malte's point is also valid -- some "special
case" reported FP mails should be kept in the ham corpus, if they really
are special cases that the submitter is worried about.
> > Ham bounces (valid bounces of ham sent from our systems) are ham, and
> > should be in the ham corpus. Spam bounces (blind bounces of spam sent
> > back to forged or faked from addresses) are spam, often containing the
> > content of the spam as well as the notification.
>
> I agree those are spam, but since those can be addressed with techniques
> like envelope rewriting that are 100% reliable and non-probabilistic, I
> think we should just remove them.
And the ham? I'm +1 on keeping ham bounces.
Spam bounces, however, I don't think should be used in the corpus at
all.
> >>> 5. no mailing list moderation administative messages since these also
> >>> contain spam
> >
> > They also contain ham. If a system administrator can differentiate
> > between them, why shouldn't the spam messages be in a spam corpus, and
> > the ham messages in a ham corpus?
>
> Moderators can't ignore either type of moderation message for a large
> proportion of mailing list software (especially mailman). If anything,
> they should all be ham and I don't think we want to do that. I think
> it's better to just remove them.
OK, I've come around to that view BTW. +1
- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS
iD8DBQFA2e1wQTcbUG5Y7woRApGIAJ96HbTdMromHvsVa/gH1BOev1FtvgCgtbDM
dngT9ZZmVyR1VUa1MKwgT9U=
=WjwV
-----END PGP SIGNATURE-----
Re: Re[2]: proposed changes to CORPUS_POLICY
Posted by Daniel Quinlan <qu...@pathname.com>.
jm@jmason.org (Justin Mason) writes:
> Yes, I agree -- this is the problem with older ham. (esp. the SPF
> problem. SPF is very brittle on this point.)
>
> How's about putting stricter limits on the net check corpora?
Well, do we really want to use an extra 6 months on only one of the
runs? I think it would be better to use more or less the same data.
> I would suggest though that Malte's point is also valid -- some "special
> case" reported FP mails should be kept in the ham corpus, if they really
> are special cases that the submitter is worried about.
Yes, I *am* keeping my non-SpamAssassin-list spam-related mail in the
corpus. The main reason to remove the SpamAssassin list mail is that
we'll totally bias the corpus; I'm sure we'll have more than enough FPs
for iffy rules by virtue of our everyday mail.
> And the ham? I'm +1 on keeping ham bounces.
Agreed, I am keeping ham bounces.
Daniel
--
Daniel Quinlan
http://www.pathname.com/~quinlan/