You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Vaishnavi Sannidhanam <va...@cs.washington.edu> on 2004/11/26 23:06:05 UTC

Spam assassin corpus

Hi

I am a student a University of Washington and I am doing a project on
classifying spam. I was wondering where could I find the spam assassin
corpus of ham and spam mails and where would I also find some tools to
process these mails.

Please let me know about it.

Thank you very much,
Vaishnavi


Re: Spam assassin corpus

Posted by Henry Stern <he...@stern.ca>.
Hi Vaishnavi,

I wrote a parser for the 12000 message SpamAssassin public corpus
(http://spamassassin.apache.org/publiccorpus) based on SpamAssassin's
Bayes code.  If you would like to use it, you can download both the
parser and a pre-tokenized corpus from
http://stern.cs.dal.ca/publiccorpus-tokenized.tar.bz2.

Henry

P.S.  Who is your advisor at UWash?

Vaishnavi Sannidhanam wrote:

>Hi Theo,
>
>I got a spam assassin corpus that had ~3500 ham and spam messages in it. I
>was wondering if I could get a larger collection of corpus or a bunch of
>smaller corpora that I can put together to get a bigger corpus. Please let
>me know if I can get it from somewhere.
>
>Thank you very much and really appreciate all your help,
>Vaishnavi
>
>-----Original Message-----
>From: Theo Van Dinter [mailto:felicity@kluge.net]
>Sent: Friday, November 26, 2004 3:50 PM
>To: dev@spamassassin.apache.org
>Subject: Re: Spam assassin corpus
>
>
>On Fri, Nov 26, 2004 at 02:06:05PM -0800, Vaishnavi Sannidhanam wrote:
>
>
>>I am a student a University of Washington and I am doing a project on
>>classifying spam. I was wondering where could I find the spam assassin
>>corpus of ham and spam mails and where would I also find some tools to
>>process these mails.
>>
>>
>
>Hi.
>
>Unfortunately there is no single "SpamAssassin corpus".  All of the people
>involved in development (including the folks who help out with score
>generation and testing) each have their own private corpus of messages. The
>tools (specifically mass-check) under the "masses" directory (see the
>tarball) are used to generate logs from the corpus specifying the messages
>processed and the results from the processing (namely what rules hit).
>
>That information is then used to generate the scores, determine which rules
>are worth keeping during devleopment, etc.
>
>There is some more information available at:
>
>http://wiki.apache.org/spamassassin/DevelopmentStuff
>
>
>

RE: Spam assassin corpus

Posted by Vaishnavi Sannidhanam <va...@cs.washington.edu>.
Hi Theo,

I got a spam assassin corpus that had ~3500 ham and spam messages in it. I
was wondering if I could get a larger collection of corpus or a bunch of
smaller corpora that I can put together to get a bigger corpus. Please let
me know if I can get it from somewhere.

Thank you very much and really appreciate all your help,
Vaishnavi

-----Original Message-----
From: Theo Van Dinter [mailto:felicity@kluge.net] 
Sent: Friday, November 26, 2004 3:50 PM
To: dev@spamassassin.apache.org
Subject: Re: Spam assassin corpus


On Fri, Nov 26, 2004 at 02:06:05PM -0800, Vaishnavi Sannidhanam wrote:
> I am a student a University of Washington and I am doing a project on 
> classifying spam. I was wondering where could I find the spam assassin 
> corpus of ham and spam mails and where would I also find some tools to 
> process these mails.

Hi.

Unfortunately there is no single "SpamAssassin corpus".  All of the people
involved in development (including the folks who help out with score
generation and testing) each have their own private corpus of messages. The
tools (specifically mass-check) under the "masses" directory (see the
tarball) are used to generate logs from the corpus specifying the messages
processed and the results from the processing (namely what rules hit).

That information is then used to generate the scores, determine which rules
are worth keeping during devleopment, etc.

There is some more information available at:

http://wiki.apache.org/spamassassin/DevelopmentStuff

-- 
Randomly Generated Tagline:
Two-hundred-thirty-nine pounds?!  I'm a blimp!  Why are all the good  things
so tasty?
 
 		-- Homer Simpson
 		   Brush With Greatness


Re: Spam assassin corpus

Posted by Theo Van Dinter <fe...@kluge.net>.
On Fri, Nov 26, 2004 at 02:06:05PM -0800, Vaishnavi Sannidhanam wrote:
> I am a student a University of Washington and I am doing a project on
> classifying spam. I was wondering where could I find the spam assassin
> corpus of ham and spam mails and where would I also find some tools to
> process these mails.

Hi.

Unfortunately there is no single "SpamAssassin corpus".  All of the people
involved in development (including the folks who help out with score
generation and testing) each have their own private corpus of messages.
The tools (specifically mass-check) under the "masses" directory (see
the tarball) are used to generate logs from the corpus specifying the
messages processed and the results from the processing (namely what rules
hit).

That information is then used to generate the scores, determine which rules
are worth keeping during devleopment, etc.

There is some more information available at:

http://wiki.apache.org/spamassassin/DevelopmentStuff

-- 
Randomly Generated Tagline:
Two-hundred-thirty-nine pounds?!  I'm a blimp!  Why are all the good
 things so tasty?
 
 		-- Homer Simpson
 		   Brush With Greatness