You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2005/10/30 03:37:37 UTC

Re: corpora, again

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


"Daryl C. W. O'Shea" writes:
> Justin Mason wrote:
> > again:
> > Can anyone provide corpora for the preflight mass-checker?
> > 
> > --j.
> 
> I could provide about 100 messages/day, in an mbox file, scored over 15 
> without checking them first.  If that's useful let me know where to 
> rsync them and how far back to go.

Unfortunately, it wouldn't be a good plan to use that.

For rule QA purposes, it really needs to be a corpus of all spam,
including the low-scoring stuff and FNs, otherwise the reported
hitrates on rules will be skewed...

Doc, Michael -- thanks!  I'll get some rsync thing set up ASAP.

(It'll be a new "directory" on the rsync server, and the idea will be to
rsync whatever mails you want mass-checked, one-file-per-mail, up to that
directory as frequently as you like.)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDZDHxMJF5cimLx9ARAnW8AJ9vh1RJLJZCPzVwYfYKQa0Wc7xgswCgsu3d
uIeJuQkxcgvvUhSAC51qOu8=
=51zf
-----END PGP SIGNATURE-----


Re: corpora, again

Posted by Doc Schneider <ma...@maddoc.net>.
Justin Mason wrote:

> "Daryl C. W. O'Shea" writes:
> 
>>Justin Mason wrote:
>>
>>>again:
>>>Can anyone provide corpora for the preflight mass-checker?
>>>
>>>--j.
>>
>>I could provide about 100 messages/day, in an mbox file, scored over 15 
>>without checking them first.  If that's useful let me know where to 
>>rsync them and how far back to go.
> 
> 
> Unfortunately, it wouldn't be a good plan to use that.
> 
> For rule QA purposes, it really needs to be a corpus of all spam,
> including the low-scoring stuff and FNs, otherwise the reported
> hitrates on rules will be skewed...
> 
> Doc, Michael -- thanks!  I'll get some rsync thing set up ASAP.
> 
> (It'll be a new "directory" on the rsync server, and the idea will be to
> rsync whatever mails you want mass-checked, one-file-per-mail, up to that
> directory as frequently as you like.)
> 
> - --j.

So you want it in Maildir and not mbox format? I've got several thousand 
recent spams but they're all in mbox (I check them through IMAP)

Suppose I can split them, though.
One note: You're going to create different directories for each of us, 
right? That will prevent collisions.

-Doc

Re: corpora, again

Posted by Michael Monnerie <m....@zmi.at>.
On Sonntag, 30. Oktober 2005 03:37 Justin Mason wrote:
> (It'll be a new "directory" on the rsync server, and the idea will be
> to rsync whatever mails you want mass-checked, one-file-per-mail, up
> to that directory as frequently as you like.)

Could it be mbox format instead? That would be easier to sync, as it's 
only one file then. Otherwise, I have to extract all mails before the 
sync (to a temp dir, and delete afterwards). I guess it's better to 
just sync one file, and you split it on your side.

I store my SPAM currently in imapd folder format, and convert it to mbox 
for the nightly mass-checks.

mfg zmi
-- 
// Michael Monnerie, Ing.BSc  ---   it-management Michael Monnerie
// http://zmi.at           Tel: 0660/4156531          Linux 2.6.11
// PGP Key:   "lynx -source http://zmi.at/zmi2.asc | gpg --import"
// Fingerprint: EB93 ED8A 1DCD BB6C F952  F7F4 3911 B933 7054 5879
// Keyserver: www.keyserver.net                 Key-ID: 0x70545879