You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2005/12/01 19:21:14 UTC

Re: 3.0.5 rescoring

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Michael Monnerie writes:
> On Dienstag, 22. November 2005 06:44 Theo Van Dinter wrote:
> > So basically Justin is 34%, I'm 31%, and everyone else combined is
> > 35%.
> 
> I could send you my hand sorted SPAM, if you like. It's only ~3000 
> SPAMs, but maybe worth it - more and more german language SPAM coming 
> into my honeypots.
> 
> For a more differentiated SPAM score, more people would have to commit 
> their SPAM. If that process would be well documented, and people 
> encouraged to do so, and the process is as easy as calling a script, I 
> believe you could have a lot of reporters.

Actually, the problem that Theo is highlighting is not that we don't have
any contributors for rescoring mass-checks using smaller corpora; we do
(and more are definitely welcome!)

The problem is that these small corpora become "background noise" compared
to the big, 700k-message corpora -- myself (34%), and Theo (31%).
What we need to do to fix this problem, is come up with ways to avoid
letting big corpora "drown out" the little ones.

I think if we limit each corpora to a certain max percentage of the total,
we could do this -- e.g. if a corpus makes up more than (100 /
num_contributors)%, then any excess above that percentage is dropped,
favouring recent mails over older ones.  (This post-processing step
is doable with mass-check logs btw, we can write a script to do this.)

The downside would be that we would then have "only" a 700,000-message
corpus (or so) instead of a 2,000,000-message one.  Henry, is that OK?

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDjz8aMJF5cimLx9ARAnzkAKCn6114fMkEqYby6QuDyd0V2x46gACdH8FN
p2axrU0h3iTd9evP8aUhFS4=
=iKC1
-----END PGP SIGNATURE-----


Re: 3.0.5 rescoring

Posted by Henry Stern <he...@stern.ca>.
I'd expect that the 700k message corpus will be more prone to errors
than the 2M message corpus.  It still might be good enough.

I'm not convinced that rescoring (as opposed to putting in new rules)
will do much for 3.0.5's accuracy.  If people really want to go to the
trouble of running the mass-checks, I won't say no to generating the
scores.  However, I can't promise that they will be any good.

Cheers,
Henry

Justin Mason wrote:
> Actually, the problem that Theo is highlighting is not that we don't have
> any contributors for rescoring mass-checks using smaller corpora; we do
> (and more are definitely welcome!)
>
> The problem is that these small corpora become "background noise" compared
> to the big, 700k-message corpora -- myself (34%), and Theo (31%).
> What we need to do to fix this problem, is come up with ways to avoid
> letting big corpora "drown out" the little ones.
>
> I think if we limit each corpora to a certain max percentage of the total,
> we could do this -- e.g. if a corpus makes up more than (100 /
> num_contributors)%, then any excess above that percentage is dropped,
> favouring recent mails over older ones.  (This post-processing step
> is doable with mass-check logs btw, we can write a script to do this.)
>
> The downside would be that we would then have "only" a 700,000-message
> corpus (or so) instead of a 2,000,000-message one.  Henry, is that OK?
>
> --j.

Re: 3.0.5 rescoring

Posted by Michael Monnerie <m....@zmi.at>.
On Donnerstag, 1. Dezember 2005 19:21 Justin Mason wrote:
> I think if we limit each corpora to a certain max percentage of the
> total, we could do this -- e.g. if a corpus makes up more than (100 /
> num_contributors)%, then any excess above that percentage is dropped,

You don't tell me that your 700k messages are hand sorted? How old are 
you ;-)

Anyway, more contributors would help to the problem. Imagine you get 100 
contributors, each just 2000 messages. And I believe there are a lot of 
people out there having a bigger corpora already. Making it more easy 
to contribute (and encourage people to report) could help.

If your two corpora is so big, I guess setting a time limit to only take 
the last 180 days or so of all SPAM could reduce your over-power in the 
percentages.

mfg zmi
-- 
// Michael Monnerie, Ing.BSc  ---   it-management Michael Monnerie
// http://zmi.at           Tel: 0660/4156531          Linux 2.6.11
// PGP Key:   "lynx -source http://zmi.at/zmi2.asc | gpg --import"
// Fingerprint: EB93 ED8A 1DCD BB6C F952  F7F4 3911 B933 7054 5879
// Keyserver: www.keyserver.net                 Key-ID: 0x70545879