You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Andy Firman <an...@firman.us> on 2004/11/29 22:01:02 UTC

sa-learn on a 15,000 email mbox file?

I just started using Spamassasin 3.0 and am very
impressed with it.  Recently, on an old server that I 
just started to manage,  I just found a spam
infested mbox spool file with 15,000 spams in it. (52MB)
Nobody had checked the mailbox in about 10 months.

Is it a good idea to run sa-learn on this giant spam
mbox file on other servers that I get SA 3.0 installed on?

Or no?


Re: sa-learn on a 15,000 email mbox file?

Posted by Nix <ni...@esperi.org.uk>.
On Mon, 29 Nov 2004, snowjack@fastmail.fm moaned:
> Unless the address has never been used by a real person, you should
> manually check each message to see whether it's spam. Personally, I
> never have the endurance to check more than about 500 messages at a
> shot. So I'd just cut it into files of a size I could manually verify
> without bleeding from the eyes, delete any hammy-looking stuff I find in
> each file as I go through it, and then save the verified files and use
> those for bayes training.

I've always validated things like that by mass-checking the mailbox and
manually checking the stuff close to the spam/ham boundary line, on the
basis that SA is pretty much *never* wrong for very-high-scoring things
being spam --- well, maybe it is for particularly atrocious newsletters
or something, but my users don't get any such abominations.

It still means a good few manual checks, but checking a hundred-odd mails
is a hell of a lot easier than checking tens of thousands.

-- 
`The sword we forged has turned upon us
 Only now, at the end of all things do we see
 The lamp-bearer dies; only the lamp burns on.'

Re: sa-learn on a 15,000 email mbox file?

Posted by sn...@fastmail.fm.
On Mon, 29 Nov 2004 12:01:02 -0900, "Andy Firman" <an...@firman.us> said:
> I just started using Spamassasin 3.0 and am very
> impressed with it.  Recently, on an old server that I 
> just started to manage,  I just found a spam
> infested mbox spool file with 15,000 spams in it. (52MB)
> Nobody had checked the mailbox in about 10 months.
> 
> Is it a good idea to run sa-learn on this giant spam
> mbox file on other servers that I get SA 3.0 installed on?
> 
> Or no?
> 

Unless the address has never been used by a real person, you should
manually check each message to see whether it's spam. Personally, I
never have the endurance to check more than about 500 messages at a
shot. So I'd just cut it into files of a size I could manually verify
without bleeding from the eyes, delete any hammy-looking stuff I find in
each file as I go through it, and then save the verified files and use
those for bayes training.

It would be safe to do what you propose if the account is one that you
are certain will never receive legit mail, but old mail accounts *will*
still get the occasional legit message. "Hey Bob, why haven't I heard
from you in the past eight months? Here's all our new customer info..."

For ongoing Bayes training, I have two IMAP folders that I copy messages
into, one for ham and one for spam. Any spams scoring less than 10 get
manually copied into the spam folder (the rest of the spam is rejected
at the mail gateway). Periodically I run through a bunch of recent ham
and copy it into the ham folder. A nightly script cleans out those IMAP
folders, runs sa-learn on the messages, and copies them into ham/spam
folders on the server, so I can use those if I need a corpus of manually
verified messages.
--
  
  snowjack(a)fastmail.fm


Re: sa-learn on a 15,000 email mbox file?

Posted by Jim Maul <jm...@elih.org>.
Andy Firman wrote:
> I just started using Spamassasin 3.0 and am very
> impressed with it.  Recently, on an old server that I 
> just started to manage,  I just found a spam
> infested mbox spool file with 15,000 spams in it. (52MB)
> Nobody had checked the mailbox in about 10 months.
> 
> Is it a good idea to run sa-learn on this giant spam
> mbox file on other servers that I get SA 3.0 installed on?
> 
> Or no?
> 
> 
> 


I would think it would be a good set of spam to learn from.  However, 
with 15k messages in the box, the chances of having atleast *some* ham 
in there are pretty good..so either you sift through the 15k messages 
and make sure it indeed is all spam, or just learn it all and take the 
chance of causing some bayes weirdness in the future (false positives).

-Jim