You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Bernard <be...@rosset.me> on 2017/03/20 10:12:00 UTC

SpamAssassin score

Hello,

Using SpamAssassin, I am trying to make it learn 'bad' messages.

Experimenting with the learning process, I do not seem to be able to
reach a successful outcome:
$ spamc --username=debian-spamd --socket=/run/spamd/spamd.sock
--learntype=spam < spamassassin/junktestmail
Message was already un/learned
$ spamc --username=debian-spamd --socket=/run/spamd/spamd.sock -c <
spamassassin/junktestmail &&hideme
3.7/5.0
$ spamc --username=debian-spamd --socket=/run/spamd/spamd.sock
--learntype=ham < spamassassin/junktestmail
Message successfully un/learned
$ spamc --username=debian-spamd --socket=/run/spamd/spamd.sock
--learntype=ham < spamassassin/junktestmail
Message was already un/learned
$ hideprev
$ spamc --username=debian-spamd --socket=/run/spamd/spamd.sock -c <
spamassassin/junktestmail &&hideme
3.7/5.0
$ spamc --username=debian-spamd --socket=/run/spamd/spamd.sock
--learntype=spam < spamassassin/junktestmail
Message successfully un/learned
$ spamc --username=debian-spamd --socket=/run/spamd/spamd.sock
--learntype=spam < spamassassin/junktestmail
Message was already un/learned

 1. How come the same message being classified either as spam/ham
    returns the same score? I would expect a message learnt as 'spam' to
    get a score at least equal to the spam score threshold
 2. Even though the message was correctly learnt as spam before and
    after the test, receiving this email message is still not tagged as
    spam:

    X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on ***
    X-Spam-Level: **
    X-Spam-Status: No, score=2.1 required=5.0 tests=MISSING_HEADERS,SPF_FAIL,
    	SPF_HELO_FAIL autolearn=no autolearn_force=no version=3.4.0

Am I missing something?
---
Bernard

Re: SpamAssassin score

Posted by Martin Gregorie <ma...@gregorie.org>.

On Mon, 2017-03-20 at 11:12 +0100, Bernard wrote:

....

> Am I missing something?
> 
I think so. Bayes cannot have its spamminess score changed by a single
message, since its results would be very unstable if this was possible.
There is also a strong a clue that this is designed behavior when you
consider that Bayes has no effect on spam scoring until its has learnt
200 ham AND spam messages.

If you want an immediate change in spamminess scoring, you can:

- whitelist or blacklist the sender if the message source is a�
  reliable indicator, e.g. blacklist a domain that is employed
  by retailers to send targeted mail to previous customers.
  Use the authorised blacklist-* and whitelist-* statements to�
  do this, not the plain 'whitelist' and 'blacklist' ones. 

- write a rule that explicitly specifies the recognition features 
  in messages, e.g. there may be subtle misspellings of common
  business phrases in messages sent by spammers or botnets.

  For instance, if you write a meta rule that only fires if two other
  rules both fire (one detecting selling phrases and the other looking
  for product names) then, if carefully done, this will be quite
  specific for sales spam and, once both sub rules have a reasonable
  number of alternate targets it will reliable detect combinations
  that you haven't previously seen.

Martin

Re: SpamAssassin score

Posted by Bernard <be...@rosset.me>.

Thanks Reindl, David, Martin & Joe for replying!

Reindl:

> 100 each at minimum - you only trained 23 spam samples but 1729 ham
> which is a bad balance and you would not want bayes kick in with such
> a bad database - how do you imagine a statistic analyse based on 23
> samples with a magnitude more non-spam-tokens?

It seems to actually require even more.

David:

> If you don't see any BAYES_* rule hits make sure the plugin is enabled:
>
> v320.pre:loadplugin Mail::SpamAssassin::Plugin::Bayes
>
> Run a debug lint and check for bayes output:
>
> spamassassin -D --lint 2>&1 | grep -i bayes
>
> You should see a BAYES_ in the test= line near the end.

Got it:
dbg: plugin: loading Mail::SpamAssassin::Plugin::Bayes from @INC

> Another common problem is the Bayes training is done as one user
> while spamassassin is being called by a different user.  This depends on
> how/what is launching SA -- amavis-new, spamd, MailScanner, etc.

That is normally taken care of properly.
My setup is :

  * postfix -> spamd (through spamc) -> dovecot on reception
  * dovecot's antispam plugin -> spamd (through spamc) on mail directory
    change
  * sa-learn for training

All components ar invoked with the same debian-spamd user (which own
/var/lib/spamassassin -sub-directory and files).

Martin & Joe:

> There is also a strong a clue that this is designed behavior when you
> consider that Bayes has no effect on spam scoring until its has learnt
> 200 ham AND spam messages.
>
> You need to train more than 23 messages as ham first. Read the
> documentation in the SA manpages and on the wiki to make sure you meet
> every criteria for running bayes.
>
Bingo!
The spamassassin -D invocation as filtered before also popped up
something related:
dbg: bayes: not available for scanning, only 23 spam(s) in bayes DB < 200

I got no-one to blacklist, I was merely testing a custom-made 'Spam
test' message which seems to be useless (and maybe harmful in the end?).
I'll wait to be an advanced user w/ SA before attempting to
black/whitelist senders or write rules, unless events push me into doing
it ofc.


So far, all received messages have SpamAssassin headers, meaning the
delivery works and a small debug session on the antispam plugin seems to
show it reacts properly and sends commands to spamc correctly (hoping
the SA client + daemon handle/receive everything correctly).

All in all, I require more spam to trigger the bayesian filter. Only
then I will be able to assert it being running properly or not it seems.
At least it is loaded.
I thought the database (updated daily if it works) would provide it with
a kickstarted. I was probably mixing-up separate components.

Thus I sit hanging tight, hoping for the best... Thanks for your help.
---
Bernard

Re: SpamAssassin score

Posted by David Jones <dj...@ena.com>.

>From: Reindl Harald <h....@thelounge.net>
>Sent: Monday, March 20, 2017 6:08 AM
>To: David Jones; SpamAssassin Users ML
>Subject: Re: SpamAssassin score

>Am 20.03.2017 um 11:52 schrieb David Jones:
>>> From: Bernard <be...@rosset.me>
>>> Sent: Monday, March 20, 2017 5:37 AM
>>> To: SpamAssassin Users ML
>>> Subject: Re: SpamAssassin score
>>
>>> Thanks for that information.
>>> After ~1750 messages having been digested, still no improvement:
>>> 0.000          0          3          0  non-token data: bayes db version
>>> 0.000          0         23          0  non-token data: nspam
>>> 0.000          0       1729          0  non-token data: nham
>>> 0.000          0     123471          0  non-token data: ntokens
>>> 0.000          0 1358530476          0  non-token data: oldest atime
>>> 0.000          0 1489938564          0  non-token data: newest atime
>>> 0.000          0          0          0  non-token data: last journal sync atime
>>> 0.000          0          0          0  non-token data: last expiry atime
>>> 0.000          0          0          0  non-token data: last expire atime delta
>>> 0.000          0          0          0  non-token data: last expire reduction count

>why don't you read what you quote before make assumptions?
>what does the "23" tell you?

> >> 0.000          0         23          0  non-token data: nspam
> >> 0.000          0       1729          0  non-token data: nham

>for me it tells too few sample messages

Sorry.  Honest mistake.  I was looking at that on a small laptop screen.

Even after the OP trains 200 ham, there could still be a problem that
my suggestions below could help the OP or others.  Don't be so
critical.  Just let some of this stuff go without responding.  Others
did and gave good, positive advice to check the SA wiki.

> If you don't see any BAYES_* rule hits make sure the plugin is enabled:
>
> v320.pre:loadplugin Mail::SpamAssassin::Plugin::Bayes
>
> Run a debug lint and check for bayes output:
>
> spamassassin -D --lint 2>&1 | grep -i bayes
>
> You should see a BAYES_ in the test= line near the end.
>
> Another common problem is the Bayes training is done as one user
> while spamassassin is being called by a different user.  This depends on
> how/what is launching SA -- amavis-new, spamd, MailScanner, etc.
>
> You can force the bayes_path in the local.cf to make sure all users
> use the same Bayes DB if you are using a global (not individual) Bayes

Re: SpamAssassin score

Posted by David Jones <dj...@ena.com>.

>From: Bernard <be...@rosset.me>
>Sent: Monday, March 20, 2017 5:37 AM
>To: SpamAssassin Users ML
>Subject: Re: SpamAssassin score
  
>Thanks for that information.
>After ~1750 messages having been digested, still no improvement:
>0.000          0          3          0  non-token data: bayes db version
>0.000          0         23          0  non-token data: nspam
>0.000          0       1729          0  non-token data: nham
>0.000          0     123471          0  non-token data: ntokens
>0.000          0 1358530476          0  non-token data: oldest atime
>0.000          0 1489938564          0  non-token data: newest atime
>0.000          0          0          0  non-token data: last journal sync atime
>0.000          0          0          0  non-token data: last expiry atime
>0.000          0          0          0  non-token data: last expire atime delta
>0.000          0          0          0  non-token data: last expire reduction count
> Have you got an idea of the required order of magnitude of the input volume for the bayesian filter to kick in?

>On 20/03/2017 11:15, Reindl Harald wrote: 

>Am 20.03.2017 um 11:12 schrieb Bernard: 
> 1. How come the same message being classified either as spam/ham
>    returns the same score? I would expect a message learnt as 'spam' to 
>    get a score at least equal to the spam score threshold 
> 2. Even though the message was correctly learnt as spam before and 
>    after the test, receiving this email message is still not tagged as 
>    spam: 

>    X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on *** 
>    X-Spam-Level: ** 
>    X-Spam-Status: No, score=2.1 required=5.0 tests=MISSING_HEADERS,SPF_FAIL, 
>        SPF_HELO_FAIL autolearn=no autolearn_force=no version=3.4.0 

>Am I missing something? 

>yes, tarin your bayers properly with enough spam *and* ham samples and train the bayes wihich is really in use - >do you see any BAYES_ tag above? no! so bayes was not used at all

If you don't see any BAYES_* rule hits make sure the plugin is enabled:

v320.pre:loadplugin Mail::SpamAssassin::Plugin::Bayes

Run a debug lint and check for bayes output:

spamassassin -D --lint 2>&1 | grep -i bayes

You should see a BAYES_ in the test= line near the end.

Another common problem is the Bayes training is done as one user
while spamassassin is being called by a different user.  This depends on
how/what is launching SA -- amavis-new, spamd, MailScanner, etc.

You can force the bayes_path in the local.cf to make sure all users
use the same Bayes DB if you are using a global (not individual) Bayes
DB.

Dave

Re: SpamAssassin score

Posted by Joe Quinn <he...@gmail.com>.

On 3/20/2017 6:37 AM, Bernard wrote:
>
> Thanks for that information.
>
> After ~1750 messages having been digested, still no improvement:
> 0.000          0          3          0  non-token data: bayes db version
> 0.000          0         23          0  non-token data: nspam
> 0.000          0       1729          0  non-token data: nham
> 0.000          0     123471          0  non-token data: ntokens
> 0.000          0 1358530476          0  non-token data: oldest atime
> 0.000          0 1489938564          0  non-token data: newest atime
> 0.000          0          0          0  non-token data: last journal 
> sync atime
> 0.000          0          0          0  non-token data: last expiry atime
> 0.000          0          0          0  non-token data: last expire 
> atime delta
> 0.000          0          0          0  non-token data: last expire 
> reduction count
>
> Have you got an idea of the required order of magnitude of the input 
> volume for the bayesian filter to kick in?
> ---
> Bernard
>
> On 20/03/2017 11:15, Reindl Harald wrote:
>>
>>
>> Am 20.03.2017 um 11:12 schrieb Bernard:
>>>  1. How come the same message being classified either as spam/ham
>>>     returns the same score? I would expect a message learnt as 
>>> 'spam' to
>>>     get a score at least equal to the spam score threshold
>>>  2. Even though the message was correctly learnt as spam before and
>>>     after the test, receiving this email message is still not tagged as
>>>     spam:
>>>
>>>     X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on ***
>>>     X-Spam-Level: **
>>>     X-Spam-Status: No, score=2.1 required=5.0 
>>> tests=MISSING_HEADERS,SPF_FAIL,
>>>         SPF_HELO_FAIL autolearn=no autolearn_force=no version=3.4.0
>>>
>>> Am I missing something?
>>
>> yes, tarin your bayers properly with enough spam *and* ham samples 
>> and train the bayes wihich is really in use - do you see any BAYES_ 
>> tag above? no! so bayes was not used at all

You need to train more than 23 messages as ham first. Read the 
documentation in the SA manpages and on the wiki to make sure you meet 
every criteria for running bayes.

Re: SpamAssassin score

Posted by Bernard <be...@rosset.me>.

Thanks for that information.

After ~1750 messages having been digested, still no improvement:
0.000          0          3          0  non-token data: bayes db version
0.000          0         23          0  non-token data: nspam
0.000          0       1729          0  non-token data: nham
0.000          0     123471          0  non-token data: ntokens
0.000          0 1358530476          0  non-token data: oldest atime
0.000          0 1489938564          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal
sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire
atime delta
0.000          0          0          0  non-token data: last expire
reduction count

Have you got an idea of the required order of magnitude of the input
volume for the bayesian filter to kick in?
---
Bernard

On 20/03/2017 11:15, Reindl Harald wrote:
>
>
> Am 20.03.2017 um 11:12 schrieb Bernard:
>>  1. How come the same message being classified either as spam/ham
>>     returns the same score? I would expect a message learnt as 'spam' to
>>     get a score at least equal to the spam score threshold
>>  2. Even though the message was correctly learnt as spam before and
>>     after the test, receiving this email message is still not tagged as
>>     spam:
>>
>>     X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on ***
>>     X-Spam-Level: **
>>     X-Spam-Status: No, score=2.1 required=5.0
>> tests=MISSING_HEADERS,SPF_FAIL,
>>         SPF_HELO_FAIL autolearn=no autolearn_force=no version=3.4.0
>>
>> Am I missing something?
>
> yes, tarin your bayers properly with enough spam *and* ham samples and
> train the bayes wihich is really in use - do you see any BAYES_ tag
> above? no! so bayes was not used at all