You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2004/07/14 09:06:14 UTC

Re: SA Public Corpus

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Darryl Bleau writes:
> The SA Public Corpus at spamassassin.org/publiccorpus has been a great 
> help to myself and others who like to use a standard corpus of mail to 
> evaluate new anti-spam ideas and current techniques.
> 
> However, it's now quite dated, with the newest collection being 2003/02/28.
> 
> My question is, is there a newer collection in another location that I'm 
> missing out on, or if not, are there any plans to have an updated public 
> corpus?

Well, it's pretty labour-intensive to put together -- but I suppose the
ham hasn't changed much since 2003/02, so I could just upload some newer
spam.

sound useful?

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFA9NtmQTcbUG5Y7woRAkjpAJ9mmnfEXUn4ko+kXRkunfWIU7E6WQCbB/zm
kIC1cMbq7Ex9TPsdeLkeCF8=
=OsIO
-----END PGP SIGNATURE-----


Re: SA Public Corpus

Posted by Darryl Bleau <da...@submersion.com>.
Justin Mason wrote:

> Darryl Bleau writes:
>
> >The SA Public Corpus at spamassassin.org/publiccorpus has been a great
> >help to myself and others who like to use a standard corpus of mail to
> >evaluate new anti-spam ideas and current techniques.
>
> >However, it's now quite dated, with the newest collection being 
> 2003/02/28.
>
> >My question is, is there a newer collection in another location that I'm
> >missing out on, or if not, are there any plans to have an updated public
> >corpus?
>
>
> Well, it's pretty labour-intensive to put together -- but I suppose the
> ham hasn't changed much since 2003/02, so I could just upload some newer
> spam.
>
> sound useful?

Yes, quite. :)

The issue really isn't gathering spam... while it is a pain to manually 
verify them, anyone with enough time can do it. What's nice about the SA 
public corpus is that it's a common, open set of mail from a trusted 
source which makes it quite useful to use when comparing with others.

The only suggestion I would have for the ham would be to remove the 
SA-list (or Spam-topic ham) related messages, for the same reason that 
you don't incude these types of messages in the mass checks.

On a related note, there was talk some time back (I'm not sure if it was 
on this list or not) about setting up a publicly-updated corpus using 
some sort of trust/verification mechanism. If there is interest (besides 
myself) in this sort of thing I could take a look into seeing how to get 
it going.


Re: SA Public Corpus

Posted by Vivek Khera <vi...@khera.org>.
On Jul 14, 2004, at 3:06 AM, Justin Mason wrote:

> Well, it's pretty labour-intensive to put together -- but I suppose the
> ham hasn't changed much since 2003/02, so I could just upload some 
> newer
> spam.

Well, unless you say that the amount of HTML mail hasn't changed... it 
would throw off statistics on how high to score things.

 From where I sit, the amount of HTML mail has gone up.