You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Christopher Scott <ch...@chrisjscott.net> on 2013/06/24 04:41:58 UTC

trying to get Spamassassin to really sing for me

I've got an IMAP account with a small ISP that uses CPanel (and, through it, SpamAssassin 3.3.1). I've been struggling with spam for a while now, partially hampered by my inability to access a command shell in order to manipulate SA. As a result, my solution was to set my required_score = 0 (and I auto delete anything > 5) and then whitelist all of the email addresses I commonly receive messages from.

This somewhat brute force method doesn't work so well anyhow, so I figured I'd finally try to wrap my head around using sa-learn (after figuring out a way to execute it via cronjob, since I don't have command line access).

I've spent the past 4-6 weeks filing away any messages that have made it to my inbox that should've been spam; I now have about 1000. The problem is that I've read that, in order for sa-learn to work effectively, I should be running it over equal collections of spam AND ham. I've got auto-learn turned on, so I'm presuming that running sa-learn for ham on my inbox won't do any good… am I going to screw everything up if I run sa-learn for spam in my collected folder and that's it?

Any suggestions for a smart course of action here?

Follow-up question: what would be a smart way to wean myself off of the whitelisted entrée and, instead, get the Bayes filters to recognize them?

-Chris

Re: trying to get Spamassassin to really sing for me

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
On 23.06.13 22:41, Christopher Scott wrote:

> [deleted] my solution was to set my required_score = 0
> (and I auto delete anything > 5) and then whitelist all of the email
> addresses I commonly receive messages from.

>This somewhat brute force method doesn't work so well anyhow,

... it even could not work. The sa rules are tuned so the 5 points separates
spam from ham. Slight tuning the required_score can help (in my former work
I set it to 4.8 and on my personal machine with trained BAYES 3.5) but you
must be very careful about that,

>I've spent the past 4-6 weeks filing away any messages that have made it to
> my inbox that should've been spam; I now have about 1000.  The problem is
> that I've read that, in order for sa-learn to work effectively, I should
> be running it over equal collections of spam AND ham.

Yes, you need to train on both spam and ham. The BAYES filter must know how
they differ to decide which group of mail a message belongs apparently to.

>  I've got auto-learn
> turned on, so I'm presuming that running sa-learn for ham on my inbox
> won't do any good… am I going to screw everything up if I run sa-learn for
> spam in my collected folder and that's it?

The auto-learn will only train on messages that are spam or ham with high
probability, and even so time to time we get reports from users about
mistrained BAYES database.

However, if your inbox contains only messages you consider ham, you may
safely train whole inbox as ham.

The same - if your spam folder contains all spam messages, train it as spam.


You can re-train the same message if you (or autolearn) misclassified it, so
there's no issue about training repeatedly - the message content will be
remembered only once, as you trained it for last time.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Quantum mechanics: The dreams stuff is made of.