You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2005/10/30 03:37:37 UTC
Re: corpora, again
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
"Daryl C. W. O'Shea" writes:
> Justin Mason wrote:
> > again:
> > Can anyone provide corpora for the preflight mass-checker?
> >
> > --j.
>
> I could provide about 100 messages/day, in an mbox file, scored over 15
> without checking them first. If that's useful let me know where to
> rsync them and how far back to go.
Unfortunately, it wouldn't be a good plan to use that.
For rule QA purposes, it really needs to be a corpus of all spam,
including the low-scoring stuff and FNs, otherwise the reported
hitrates on rules will be skewed...
Doc, Michael -- thanks! I'll get some rsync thing set up ASAP.
(It'll be a new "directory" on the rsync server, and the idea will be to
rsync whatever mails you want mass-checked, one-file-per-mail, up to that
directory as frequently as you like.)
- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS
iD8DBQFDZDHxMJF5cimLx9ARAnW8AJ9vh1RJLJZCPzVwYfYKQa0Wc7xgswCgsu3d
uIeJuQkxcgvvUhSAC51qOu8=
=51zf
-----END PGP SIGNATURE-----
Re: corpora, again
Posted by Doc Schneider <ma...@maddoc.net>.
Justin Mason wrote:
> "Daryl C. W. O'Shea" writes:
>
>>Justin Mason wrote:
>>
>>>again:
>>>Can anyone provide corpora for the preflight mass-checker?
>>>
>>>--j.
>>
>>I could provide about 100 messages/day, in an mbox file, scored over 15
>>without checking them first. If that's useful let me know where to
>>rsync them and how far back to go.
>
>
> Unfortunately, it wouldn't be a good plan to use that.
>
> For rule QA purposes, it really needs to be a corpus of all spam,
> including the low-scoring stuff and FNs, otherwise the reported
> hitrates on rules will be skewed...
>
> Doc, Michael -- thanks! I'll get some rsync thing set up ASAP.
>
> (It'll be a new "directory" on the rsync server, and the idea will be to
> rsync whatever mails you want mass-checked, one-file-per-mail, up to that
> directory as frequently as you like.)
>
> - --j.
So you want it in Maildir and not mbox format? I've got several thousand
recent spams but they're all in mbox (I check them through IMAP)
Suppose I can split them, though.
One note: You're going to create different directories for each of us,
right? That will prevent collisions.
-Doc
Re: corpora, again
Posted by Michael Monnerie <m....@zmi.at>.
On Sonntag, 30. Oktober 2005 03:37 Justin Mason wrote:
> (It'll be a new "directory" on the rsync server, and the idea will be
> to rsync whatever mails you want mass-checked, one-file-per-mail, up
> to that directory as frequently as you like.)
Could it be mbox format instead? That would be easier to sync, as it's
only one file then. Otherwise, I have to extract all mails before the
sync (to a temp dir, and delete afterwards). I guess it's better to
just sync one file, and you split it on your side.
I store my SPAM currently in imapd folder format, and convert it to mbox
for the nightly mass-checks.
mfg zmi
--
// Michael Monnerie, Ing.BSc --- it-management Michael Monnerie
// http://zmi.at Tel: 0660/4156531 Linux 2.6.11
// PGP Key: "lynx -source http://zmi.at/zmi2.asc | gpg --import"
// Fingerprint: EB93 ED8A 1DCD BB6C F952 F7F4 3911 B933 7054 5879
// Keyserver: www.keyserver.net Key-ID: 0x70545879