You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Harry Putnam <re...@newsguy.com> on 2013/09/16 04:53:06 UTC

Really getting discouraged... when does the learning happen?

I've been trying to `teach' SA to spam from ham in my mail system.

I've made it thru two main learning sessions where I ran around 450
msgs (each time) thru sa-learn spam/ham and yet SA is still incapable
of getting it right more than about 40 % or maybe less.  Not sure how
to figure that out very exactly.

My incoming mail is probably no more than 10-12% ham.. maybe not even
that. So major spam is coming in.

Now after the above mentioned amount of training I've run 1100
messages thru my send box setup... its the last 11 messages that have
come in.

I'm using only 2 rules in procmailrc... spam and ham following the
call to SA.

Look at the (mbox style) files that resulted:
-rw-------[...] 10045521 Sep 15 22:26 ham
-rw-------[...]  6372824 Sep 15 22:26 spam

That is about:
9.6 MB ham
6.1 MB spam

So truly massive amounts of spam are STILL being seen as ham by SA.

That should be something like:
 2.0 MB ham 
13.5 MB spam

Even more aggravating is that many many of the spam msgs are just like
the messages that SA was 'trained' on.

Does this seem unreasonable enough that it must mean I'm doing this
all wrong? 

Can anyone post some figures of what to expect with default SA 3.3.2
and what to expect after some specific amount of training?

Re: Really getting discouraged... when does the learning happen?

Posted by John Hardin <jh...@impsec.org>.

On Sat, 28 Sep 2013, Bart Schaefer wrote:

> On Mon, Sep 16, 2013 at 1:38 PM, Harry Putnam <re...@newsguy.com> wrote:
>>
>> Yes, here is an example of a message rated as spam:
>>
>> X-Spam-Report: *  3.5 BAYES_99 BODY: Bayes spam probability is 99 to 100%
>>         *      [score: 0.9999]
>
> OK, so you've got a BAYES_99 on that message, which is a pretty good
> indication that the training has worked.  However, SA's confidence in
> the Bayes algorithm is only worth about one point out of a necessary
> five, so the rest of the rules have to contribute the other (just a
> bit more than) four points, and they do not:

You're misreading that. The bayes evaluation is 0.9999 (99%) probability 
spam, which leads to BAYES_99, which adds 3.5 points.

>>         *  0.4 STOX_REPLY_TYPE STOX_REPLY_TYPE
>>         *  1.2 RCVD_NUMERIC_HELO Received: contains an IP address used for
>>         HELO
>>         *  1.8 STOX_REPLY_TYPE_WITHOUT_QUOTES STOX_REPLY_TYPE_WITHOUT_QUOTES

3.5 + .4 + 1.2 + 1.8 = 6.9

According to that, if the threshold hasn't been changed this message 
should have been considered spam.

Agreed that hitting BAYES_99 is a good indicator that training is working.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Gun Control is nothing more than an attempt to return to feudalism,
   where the peasants are helpless and must humbly petition their lord
   and master to protect them from bandits and thieves (when they can
   get around to it), and where the lords and masters can abuse the
   peasants whenever they like without fear of effective resistance.
-----------------------------------------------------------------------
  5 days until the 9th anniversary of SpaceshipOne winning the X-prize

Re: Really getting discouraged... when does the learning happen?

Posted by Bart Schaefer <ba...@gmail.com>.

On Mon, Sep 16, 2013 at 1:38 PM, Harry Putnam <re...@newsguy.com> wrote:
>
> Yes, here is an example of a message rated as spam:
>
> X-Spam-Report: *  3.5 BAYES_99 BODY: Bayes spam probability is 99 to 100%
>         *      [score: 0.9999]

OK, so you've got a BAYES_99 on that message, which is a pretty good
indication that the training has worked.  However, SA's confidence in
the Bayes algorithm is only worth about one point out of a necessary
five, so the rest of the rules have to contribute the other (just a
bit more than) four points, and they do not:

>         *  0.4 STOX_REPLY_TYPE STOX_REPLY_TYPE
>         *  1.2 RCVD_NUMERIC_HELO Received: contains an IP address used for
>         HELO
>         *  1.8 STOX_REPLY_TYPE_WITHOUT_QUOTES STOX_REPLY_TYPE_WITHOUT_QUOTES

This could be because the scores are tuned to include network tests
which aren't able to be applied to your archive, or some such.  In any
case it's not the training that is failing you here.

You have a couple of choices.  You can assign your own higher score to
the BAYES_99 rule in your local spamassassin config, or you can modify
your procmail recipe to look for BAYES_99 in the filtered message and
treat messages that have it as spam even if they do not score above
the five point threshold.  Anything that's falsely BAYES_99 is
probably something you want to re-learn as ham anyway.

Re: Really getting discouraged... when does the learning happen?

Posted by Harry Putnam <re...@newsguy.com>.

Bart Schaefer <ba...@gmail.com> writes:

> On Sun, Sep 15, 2013 at 7:53 PM, Harry Putnam <re...@newsguy.com> wrote:
>> I've been trying to `teach' SA to spam from ham in my mail system.
>>
>> I've made it thru two main learning sessions where I ran around 450
>> msgs (each time) thru sa-learn spam/ham and yet SA is still incapable
>> of getting it right more than about 40 % or maybe less.
>
> You say you've run 1100 messages through -- have at least 200 of those
> been ham?  Bayes won't kick in until 200 *each* of spam and ham are
> trained.  You can run "sa-learn --dump magic" to see how many of each
> it believes it has seen.

Yes

> If you've sa-learned enough of both types, is it possible you haven't
> enabled bayes scoring?  Are the BAYES_* rules showing up at all in the
> score details for newly arrived messages fed through spamc?

Yes, here is an example of a message rated as spam:

X-Spam-Report: *  3.5 BAYES_99 BODY: Bayes spam probability is 99 to 100%
        *      [score: 0.9999]
        *  0.4 STOX_REPLY_TYPE STOX_REPLY_TYPE
        *  1.2 RCVD_NUMERIC_HELO Received: contains an IP address used for
        HELO
        *  1.8 STOX_REPLY_TYPE_WITHOUT_QUOTES STOX_REPLY_TYPE_WITHOUT_QUOTES

-------        ---------       ---=---       ---------      -------- 

This message is a bit disorganized but I'm experimenting all thru
this. 

Below is the message counts and the 'magic' produced of my 2 learning
sessions: 

  675 msgs thru sa-learn --mbox --spam spam
  228 msgs thru sa-learn --mbox --ham  ham

Resulting in this magic output:
reader > sa-learn --dump magic 
0.000     0          3     0  non-token data: bayes db version
0.000     0        675     0  non-token data: nspam
0.000     0        214     0  non-token data: nham
0.000     0     117579     0  non-token data: ntokens
0.000     0 1369611901     0  non-token data: oldest atime
0.000     0 1374276652     0  non-token data: newest atime
0.000     0          0     0  non-token data: last journal sync atime
0.000     0          0     0  non-token data: last expiry atime
0.000     0          0     0  non-token data: last expire atime delta
0.000     0          0     0  non-token data: last expire reduction count


Now I'm running several thousand mixed/spam/ham thru procmail/SA with
the magic as above.
 -------        ---------       ---=---       ---------      -------- 

.procmailrc consists of:

#shell-script-*--
PATH=/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin
SHELL=/bin/sh
MAILDIR=/home/reader/projects/reader/proc/spool
LOGFILE=/home/reader/projects/reader/proc/log/log
ORGMAIL=/home/reader/projects/reader/proc/spool/$LOGNAME
DEFAULT=$ORGMAIL
VERBOSE=YES 
LOG="Processing <$FILENO>
"
TRAP='formail -XMessage-Id: && date +"%b %d %T%nSTOP"'

PSCRIPTS="/home/reader/projects/perl"
SCRIPTS="/home/reader/scripts/"
MAILARC="/home/reader/proc/spool"

:0fw
| /usr/bin/spamc

:0:
* ^X-Spam-Status: Yes   
spam

:0
ham

-------        ---------       ---=---       ---------      -------- 

Local.cf looks like: 

ok_locales en
report_safe 0

## Trusted network
192.168.1.

use_bayes 1

bayes_auto_learn 0

-------        ---------       ---=---       ---------      -------- 

Below file sizes shows what happens with no learning
sessions.

-rw------- 1 reader nfsu 16878376 Sep 16 10:45 ham
-rw------- 1 reader nfsu  4406449 Sep 16 10:45 spam

There is way more ham than spam and my actual ham is probably
something like 10-12% of mail... probably less. So there is roughly 4
times MORE spam registered than there should be.  But that is with
no learning.

-------        ---------       ---=---       ---------      -------- 

Below shows the relative size of ham/spam at the 3825 mark in message
count.  Still way way over what it should be since it is after the
learning sessions that produced the 'magic' posted above. 

So I guess that is significant improvement although seems like it
should be a good bit better.  Here it is closer to 3:1 and above is
closer to 4:1

reader > lsp
total 119741
-rw------- 1 reader nfsu 92106819 Sep 16 16:36 ham
-rw------- 1 reader nfsu 30382534 Sep 16 16:36 spam

-------        ---------       ---=---       ---------      -------- 

Do you think the ratio shown above is about normal for the amount of
learning done?

Re: Really getting discouraged... when does the learning happen?

Posted by Bart Schaefer <ba...@gmail.com>.

On Sun, Sep 15, 2013 at 7:53 PM, Harry Putnam <re...@newsguy.com> wrote:
> I've been trying to `teach' SA to spam from ham in my mail system.
>
> I've made it thru two main learning sessions where I ran around 450
> msgs (each time) thru sa-learn spam/ham and yet SA is still incapable
> of getting it right more than about 40 % or maybe less.

You say you've run 1100 messages through -- have at least 200 of those
been ham?  Bayes won't kick in until 200 *each* of spam and ham are
trained.  You can run "sa-learn --dump magic" to see how many of each
it believes it has seen.

If you've sa-learned enough of both types, is it possible you haven't
enabled bayes scoring?  Are the BAYES_* rules showing up at all in the
score details for newly arrived messages fed through spamc?