You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Marios Titas <re...@gmail.com> on 2011/10/24 00:35:02 UTC

Getting started with Bayesian filtering

Hi all,

I was recently given a list of 10,000 posts from an internet forum.
Out of those, 9,000 had been aproved by the site's moderators and the
remaining were rejected. I was wondering if I could use this data set
to play with Bayesian filtering in spamassassin. I tried the
following: I converted all posts to emails and then I used sa-learn
with --ham and --spam to train spamassassin. This seems to have worked
fine (it produced some files in ~/.spamassassin, having a total size
of 1MB). Now I am trying to test some new posts to see if spamassassin
thinks they should be aproved or not. I have written the following
perl code

    my $spamassassin=Mail::SpamAssassin->new({
        require_rules      => 1,
        local_tests_only   => 1,
        userprefs_filename => "$ENV{HOME}/.spamassassin/user_prefs",
        userstate_dir      => "$ENV{HOME}/.spamassassin",
        rules_filename     => "$ENV{HOME}/.spamassassin/user_prefs",
    });
    my $status = $spamassassin->check($post);
    print $status->get_score,"\n";

When I run this, it always returns zero. Here's how my
~/.spamassassin/user_prefs looks like:

    required_score    5
    use_learner       1
    use_bayes         1
    use_bayes_rules   1
    bayes_auto_learn  0
    allow_user_rules  1
    score BAYES_05    9

Could someone give me any pointers on how to make this work? All I
want is to be able to use Bayesian filtering and Bayesian filtering
alone, without any other rules. Is there any document that describes
how to do something like that? All I could find is documents
describing how to make spamassassin work with other programs like
procmail/qmail etc.

Re: Getting started with Bayesian filtering

Posted by da...@chaosreigns.com.

On 10/23, Marios Titas wrote:
>     my $spamassassin=Mail::SpamAssassin->new({
>         require_rules      => 1,

I have no experience using SA this way.  I'd start with trying to get it to
work with the default configuration, from the command line, not through
this API.

>         rules_filename     => "$ENV{HOME}/.spamassassin/user_prefs",

First thing I'd try is just commenting out that line.  I bet it'll work.

>     score BAYES_05    9

It looks like you're not loading the full default ruleset, which is
probably good.  But it looks like then you're not defining the BAYES
rules, which you need - just setting that one score for an undefined rule.
I bet if you ran spamassassin --lint with those same settings
(rules_filename, etc.), it would complain about you setting a score for an
undefined rule.

Grep the default rules for BAYES:

body BAYES_00           eval:check_bayes('0.00', '0.01')
body BAYES_05           eval:check_bayes('0.01', '0.05')
body BAYES_20           eval:check_bayes('0.05', '0.20')
body BAYES_40           eval:check_bayes('0.20', '0.40')
body BAYES_50           eval:check_bayes('0.40', '0.60')
body BAYES_60           eval:check_bayes('0.60', '0.80')
body BAYES_80           eval:check_bayes('0.80', '0.95')
body BAYES_95           eval:check_bayes('0.95', '0.99')
body BAYES_99           eval:check_bayes('0.99', '1.00')
tflags BAYES_00         nice learn
tflags BAYES_05         nice learn
tflags BAYES_20         nice learn
tflags BAYES_40         nice learn
tflags BAYES_50         learn
tflags BAYES_60         learn
tflags BAYES_80         learn
tflags BAYES_95         learn
tflags BAYES_99         learn
score BAYES_00  0  0 -1.5   -1.9
score BAYES_05  0  0 -0.3   -0.5
score BAYES_20  0  0 -0.001 -0.001
score BAYES_40  0  0 -0.001 -0.001
score BAYES_50  0  0  2.0    0.8
score BAYES_60  0  0  2.5    1.5
score BAYES_80  0  0  2.7    2.0
score BAYES_95  0  0  3.2    3.0
score BAYES_99  0  0  3.8    3.5
priority BAYES_99               -400

I wonder what "priority" does.  You probably don't want the "learn" flags.

Oh, hah, none of the BAYES scores are high enough to go above your
required_score of 5.  So you'll probably want to set your required_score to
something within the range of scores defined there, or change the scores.  

> Could someone give me any pointers on how to make this work? All I
> want is to be able to use Bayesian filtering and Bayesian filtering

I'd probably try a dedicated bayesian filter, maybe spamprobe.  Although
there might not be a reason not to use SA.  Especially if you like that
API.

If you get this working, I'd appreciate it if you documented it on the SA
wiki.

-- 
"When we remember we are all mad, the mysteries of life disappear and
life stands explained." - Mark Twain
http://www.ChaosReigns.com

Re: Getting started with Bayesian filtering

Posted by Marios Titas <re...@gmail.com>.

Thanks for the suggestions. I was able after all to get spamassassin
to work by loading the relevant rules. I actually loaded the default
rule set and then removed all then NN_XXX.cf files except
10_default_prefs and 23_bayes. I also added the following line:

    add_header all Bayes "_BAYES_"

so that I can extract the probability of spam according to Bayes algorithm.

I also tried dspam and did a comparison of the two. I used the
following methodology: I divided my data set into two parts; the first
one, consisting of 90% of the posts, was used for training and the
remaining 10% was used for testing. The following is what the two
programs reported for the 10% of the posts compared to what the forum
administrators did for these posts (for both programs I assumed that a
post is classified as spam if the reported probability is 55% or
higher.):

For the posts rejected by forum moderators:
spamassassin classified 19.0% of those as spam
dspam classified 15.0% of those as spam

For the posts approved by forum moderators:
spamassassin classified 99.3% of those as ham
dspam classified 99.7% of those as ham

As you can see, the performance of both was unsatisfactory as they
both failed to correctly identify 80% to 85% of the rejected posts. It
is also interesting to note that the posts that they actually did
correctly identify as rejected, were mostly posts with formatting
issues (such as posts being written in ALL CAPS which is forbidden by
forum rules) and not posts with issues related to the actual content
(such as off-topic posts). This of course is not surprising at all.

So the conclusion is that Bayesian spam filtering cannot be used for
this particular case.

On Mon, Oct 24, 2011 at 00:24, Henrik K <he...@hege.li> wrote:
> On Sun, Oct 23, 2011 at 06:35:02PM -0400, Marios Titas wrote:
>> Hi all,
>>
>> I was recently given a list of 10,000 posts from an internet forum.
>> Out of those, 9,000 had been aproved by the site's moderators and the
>> remaining were rejected. I was wondering if I could use this data set
>> to play with Bayesian filtering in spamassassin.
>
> Why don't you just try something like dspam and it's "DataSource document"
> option.  It should process non-email data just like that and probably work
> much more efficiently anyway.  SA Bayes heavily tuned for email messages and
> their quirks.
>
> Of course if would be interesting if someone put up a comparison.
>
>

Re: Getting started with Bayesian filtering

Posted by Henrik K <he...@hege.li>.

On Sun, Oct 23, 2011 at 06:35:02PM -0400, Marios Titas wrote:
> Hi all,
> 
> I was recently given a list of 10,000 posts from an internet forum.
> Out of those, 9,000 had been aproved by the site's moderators and the
> remaining were rejected. I was wondering if I could use this data set
> to play with Bayesian filtering in spamassassin.

Why don't you just try something like dspam and it's "DataSource document"
option.  It should process non-email data just like that and probably work
much more efficiently anyway.  SA Bayes heavily tuned for email messages and
their quirks.

Of course if would be interesting if someone put up a comparison.