You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Mike Samba <sa...@astroshapes.com> on 2004/02/02 22:35:36 UTC

Bayes & Ham

This might sound more than a little stupid, but...

I am looking into implementing Bayes filtering and have stockpiled a TON 
of Spam to train with.  Where are you getting an equal amount of Ham to 
train with?  I administer an email domain, but only have access to my 
own mail (ethically).  What are your suggestions on rounding up 1000 or 
so Ham messages from my users so that it is not too intrusive or 
annoying for the user?

Any suggestions would be great!!!

Mike

Re: Bayes & Ham

Posted by sp...@sasknow.com.
Matt Kettler wrote to Mike Samba and spamassassin-users@incubator.apache.org:

> FWIW I use a combination of two sources for HAM training:
>
> 1) some selected chunks of my own email (ie: mailing lists not
> involving SA, personal email, etc)
>
> 2) I set up a "nonspamtrap" account, and I've subscribed this to a few
> of the newsletters my user's commonly subscribe to.

Good sources. We provide "spam" and "nonspam" accounts for our more
pro-active clients to forward spam and ham, particularly messages that
were incorrectly classified. As long as they're instructed to forward
such messages as attachments, the messages (attachments) come through
unmolested.

I'm fortunate enough to personally own a domain that is now very close
in spelling (same name, different TLD) to a domain used by a large ISP
in our region. After seeing the postmaster logs on our email server, I
set up an account to catch all of the incoming email on my domain. There
are enough mistypes that I get several hundred messages per day for
different recipients, including ham, spam, and virii. It's the closest
thing to broadly varied user email that we can get without violating our
own privacy policy.

I have a staff member (otherwise known as our Resident SpamQueen) go
through that, as well as our shared email boxes (sales, support, etc),
and train the filter. She has no problem finding 1000+ SPAM and HAM
weekly. It's done wonders for our filtering.

If we didn't have such a good source of email, I guess I'd ask a small
percentage of our customers to *voluntarily* allow us to use their
accounts to train the filter... at which point we could just have the
server FCC all of their messages to another shared mailbox on our system
for our bodacious SpamQueen to traverse. That's trivial to implement on
most systems.

Yes, filtering can be configured on a per-user basis, but we chose to
make it as simple for our clients (and as simple for us) as possible,
and go site-wide. So, the filtering may not be quite as precise, but at
least *we* control the QoS, and we err on the side of caution.

It's worked remarkably well. We've been sustaining about 95% correctly
filtered, with no false positives. Server-wide, our HAM:SPAM ratio is
about 1.5:1. With many personal accounts, though, it's more like 1:15
(90-95% SPAM), after viruses are taken out of the equation (but that's
another tangent). We'd be sunk without SpamAssassin.

- Ryan

-- 
  Ryan Thompson <ry...@sasknow.com>

  SaskNow Technologies - http://www.sasknow.com
  901-1st Avenue North - Saskatoon, SK - S7K 1Y4

        Tel: 306-664-3600   Fax: 306-244-7037   Saskatoon
  Toll-Free: 877-727-5669     (877-SASKNOW)     North America


Re: Bayes & Ham

Posted by Matt Kettler <mk...@evi-inc.com>.
FWIW I use a combination of two sources for HAM training:

1) some selected chunks of my own email (ie: mailing lists not involving 
SA, personal email, etc)

2) I set up a "nonspamtrap" account, and I've subscribed this to a few of 
the newsletters my user's commonly subscribe to.

Note that an equal amount of spam and ham isn't exactly required, and it's 
not exactly optimal either, so don't kill yourself trying to make the 
numbers exactly match. Just don't have some huge imbalance (optimal would 
be to have the same spam/ham ratio in your training that your server sees 
in reality)


At 04:35 PM 2/2/2004, Mike Samba wrote:
>This might sound more than a little stupid, but...
>
>I am looking into implementing Bayes filtering and have stockpiled a TON 
>of Spam to train with.  Where are you getting an equal amount of Ham to 
>train with?  I administer an email domain, but only have access to my own 
>mail (ethically).  What are your suggestions on rounding up 1000 or so Ham 
>messages from my users so that it is not too intrusive or annoying for the 
>user?
>
>Any suggestions would be great!!!
>
>Mike