You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Jason Wellman <ja...@gmail.com> on 2006/11/03 04:42:01 UTC

sa-learn training question(s)

Hey all,

Recently my domain came under 'Spam attack' as my users are calling it, we
have been flooded with hundreds of Spam messages. :(  So over the last week
I have been setting up SA (3.1.3) along with Amavis, ClamAV, postfix and
dovecot.  Just out of the box we have noticed a huge drop in Spam but I do
have a couple of questions that I have not been able to find good answers to
yet.

First, I am using all the default SA settings, including those for
autolearning.  I have all incoming mail that is tagged as Spam delivered to
a "CaughtSpam" IMAP box for each user.  I also have a pretty nice little
script I tossed together to sa-learn from a "IsSpam" folder that the users
put Spam that is missed into.  It also learns ham from a folder called
"IsNotSpam" for when a message is marked as Spam but is not.

Should I also have sa-learn from the "CaughtSpam" folder?  I have read some
places that say yes, and some that say no.

Second question.  It is easy to tell a user (and some of mine are non-tech
folks) to put Spam in the "IsSpam" folder, but there isn't a way to really
tell them that they need to put HAM in a certain folder, they just don't
understand it.  So my second question is how are people feeding sa-learn
good HAM?  I was toying with the idea of feeding in peoples "Sent" folders
along with all messages from their "INBOX" and "Trash" that were marked as
read (I can pull these out using mboxgrep).  This would also give me a
larger sample of HAM them Spam which I understand is a good thing.  Can
anyone poke holes in my logic on this, or point out a better source for me
to scrape HAM to feed sa-learn?

Many thanks in advance for any help. :)

- J

Re: sa-learn training question(s)

Posted by Matt Kettler <mk...@verizon.net>.
Jason Wellman wrote:
> Hey all,
>
> Recently my domain came under 'Spam attack' as my users are calling
> it, we have been flooded with hundreds of Spam messages. :(  So over
> the last week I have been setting up SA (3.1.3) along with Amavis,
> ClamAV, postfix and dovecot.  Just out of the box we have noticed a
> huge drop in Spam but I do have a couple of questions that I have not
> been able to find good answers to yet.
>
> First, I am using all the default SA settings, including those for
> autolearning.  I have all incoming mail that is tagged as Spam
> delivered to a "CaughtSpam" IMAP box for each user.  I also have a
> pretty nice little script I tossed together to sa-learn from a
> "IsSpam" folder that the users put Spam that is missed into.  It also
> learns ham from a folder called "IsNotSpam" for when a message is
> marked as Spam but is not.
>
> Should I also have sa-learn from the "CaughtSpam" folder?  I have read
> some places that say yes, and some that say no.
YES. Those that say no clearly do not know what they're talking about.

Lets face it.. if there was no point in learning tagged spam, why does
the autolearner only kick in on high-scoring spam?

That said, it will only learn the "caught" spam that wasn't already
autolearned, but this is actually quite valuable as it will generally
contain more of the "borderline" spam which is important for bayes to
know about.
>
> Second question.  It is easy to tell a user (and some of mine are
> non-tech folks) to put Spam in the "IsSpam" folder, but there isn't a
> way to really tell them that they need to put HAM in a certain folder,
> they just don't understand it.  So my second question is how are
> people feeding sa-learn good HAM?
That depends a lot on the user. Some are good, some not so good. Most
will generally do this only when they're getting FPs, but that's still
handy.
> I was toying with the idea of feeding in peoples "Sent" folders along
> with all messages from their "INBOX" and "Trash" that were marked as
> read (I can pull these out using mboxgrep).  This would also give me a
> larger sample of HAM them Spam which I understand is a good thing. 
> Can anyone poke holes in my logic on this, or point out a better
> source for me to scrape HAM to feed sa-learn?
Well, doing inbox and trash, you'll autolearn any false-negatives that
your user happened to read and did not move to the "IsSpam".. If you
don't trust them to force-feed good ham, this might not be a good idea.

Sent would appear to be fine.. unless your users are really dumb and
frequently reply to spam.
>  
> Many thanks in advance for any help. :)
>
> - J