You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Jason Haar <Ja...@trimble.com> on 2014/10/01 18:39:13 UTC

is my bayes working properly?

Hi there

We're using SA-3.4.0 with REDIS for bayes and until now what I've
normally seen is that if a piece of spam got through and got (say)
BAYES_00, then I could throw it back in via "spamc -L spam" and
immediately see it jump up to BAYES_50

Well today I threw in such an email, "spamc -L spam" returned it was
learnt and re-running it returns "message was already un/learned", but
running it through "spamc" still returns BAYES_00 - no sign that Bayes
has altered the score!

Why didn't the score change when explicitly told it was spam? Is that a
"statistics thing", or has something gone wrong with my Bayes?

Thanks

-- 
Cheers

Jason Haar
Corporate Information Security Manager, Trimble Navigation Ltd.
Phone: +1 408 481 8171
PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1

Re: is my bayes working properly?

Posted by Reindl Harald <h....@thelounge.net>.

Am 01.10.2014 um 18:39 schrieb Jason Haar:
> We're using SA-3.4.0 with REDIS for bayes and until now what I've
> normally seen is that if a piece of spam got through and got (say)
> BAYES_00, then I could throw it back in via "spamc -L spam" and
> immediately see it jump up to BAYES_50
> 
> Well today I threw in such an email, "spamc -L spam" returned it was
> learnt and re-running it returns "message was already un/learned", but
> running it through "spamc" still returns BAYES_00 - no sign that Bayes
> has altered the score!
> 
> Why didn't the score change when explicitly told it was spam? Is that a
> "statistics thing", or has something gone wrong with my Bayes?

you must not expect that a single mail changes bayes that much
if it would work that way you would risk with every single
training message to increase false positives or with every
single ham.training trash your spam detection

if it does your bayes is not really trained well with both spam
as well as ham - after enough training it reacts only in that
way if the message has a unique footprint and no neutralized
elements (neutralized = exists in spam as well as ham)

please consider to read how bayes works at all

Re: is my bayes working properly?

Posted by Amir Caspi <ce...@3phase.com>.

On Oct 2, 2014, at 9:19 AM, Amir Caspi <Ce...@3phase.com> wrote:

> On Oct 1, 2014, at 3:17 PM, Axb <ax...@gmail.com> wrote:
> 
>> have you tried "-L forget" before "-L spam" ?
> 
> I thought the documentation said that if a message had previously been learned as ham, that learning it as spam would auto-forget it beforehand.  Similarly for spam->ham training.  Is the documentation incorrect, and manually forgetting beforehand is required?

Oops, I see that this was already covered a few replies back.  Sorry, ignore. =)

--- Amir

Re: is my bayes working properly?

Posted by Amir Caspi <ce...@3phase.com>.

On Oct 1, 2014, at 3:17 PM, Axb <ax...@gmail.com> wrote:

> have you tried "-L forget" before "-L spam" ?

I thought the documentation said that if a message had previously been learned as ham, that learning it as spam would auto-forget it beforehand.  Similarly for spam->ham training.  Is the documentation incorrect, and manually forgetting beforehand is required?

--- Amir

Re: is my bayes working properly?

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

>On 02/10/14 10:17, Axb wrote:
>> have you tried "-L forget" before "-L spam" ?
>>
>> sa-learn --dump magic  before and after learning show show a
>> difference...

On 02.10.14 14:10, Jason Haar wrote:
>I didn't do a "forget" before - I'll remember that, thanks.

This is not usually needed. Axb recommended it to you only to verify if the
message has been learned before. re-learning has exactly the same effect.

Note that there was recently change about how are messages identified.
The issue was that SA took the first Received: line to identify, but e.g. 
spamass-milter provides its own Received: line (because at milter time, this
line is not added yet), and so the same message can get another ID that
can't be tracked later, which means it's very hard to re-train or forget an
incorrectly learned mail.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Saving Private Ryan...
Private Ryan exists. Overwrite? (Y/N)

Re: is my bayes working properly?

Posted by Axb <ax...@gmail.com>.

On 10/02/2014 08:50 AM, Matus UHLAR - fantomas wrote:
>> sa-learn --dump magic
>> 0.000          0          3          0  non-token data: bayes db version
>> 0.000          0    3436572          0  non-token data: nspam
>> 0.000          0    1475976          0  non-token data: nham
>> 0.000          0          0          0  non-token data: ntokens
>> 0.000          0          0          0  non-token data: oldest atime
>> 0.000          0          0          0  non-token data: newest atime
>> 0.000          0          0          0  non-token data: last journal
>> sync atime
>> 0.000          0          0          0  non-token data: last expiry atime
>> 0.000          0          0          0  non-token data: last expire
>> atime delta
>> 0.000          0          0          0  non-token data: last expire
>> reduction count
>
> this looks like you have no tokens at all.

When using the Redis backend that's what it looks like

as in

  sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0   28034140          0  non-token data: nspam
0.000          0   13122464          0  non-token data: nham
0.000          0          0          0  non-token data: ntokens
0.000          0          0          0  non-token data: oldest atime
0.000          0          0          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal 
sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire 
atime delta
0.000          0          0          0  non-token data: last expire 
reduction count

Re: is my bayes working properly?

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

...continuing, sorry for multiple replies...

On 02.10.14 14:10, Jason Haar wrote:
>However, it's been 7 hours since I sent my first email and now the same
>message is BAYES_20 - so it is "learning" something - just took longer
>than I was used to I guess. We use site-wide SA and don't really
>hand-feed the bayes (too hard for our users: Exchange backends, SA
>frontend), so there is over 200% more nspam tokens than nham - could
>that cause a problem?

that could cause false positives, but should not cause false negatives.

> sa-learn --dump magic
>0.000          0          3          0  non-token data: bayes db version
>0.000          0    3436572          0  non-token data: nspam
>0.000          0    1475976          0  non-token data: nham
>0.000          0          0          0  non-token data: ntokens
>0.000          0          0          0  non-token data: oldest atime
>0.000          0          0          0  non-token data: newest atime
>0.000          0          0          0  non-token data: last journal sync atime
>0.000          0          0          0  non-token data: last expiry atime
>0.000          0          0          0  non-token data: last expire atime delta
>0.000          0          0          0  non-token data: last expire reduction count

this looks like you have no tokens at all.
-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
"They say when you play that M$ CD backward you can hear satanic messages."
"That's nothing. If you play it forward it will install Windows."

Re: is my bayes working properly?

Posted by Axb <ax...@gmail.com>.

On 10/02/2014 03:10 AM, Jason Haar wrote:
> On 02/10/14 10:17, Axb wrote:
>>
>> have you tried "-L forget" before "-L spam" ?
>>
>> sa-learn --dump magic  before and after learning show show a
>> difference...
>
> I didn't do a "forget" before - I'll remember that, thanks. As far as
> "before, after" goes for the dump - not an option. We're receiving 6-12
> messages per second, "--dump magic" is *always* different :-)
>
> However, it's been 7 hours since I sent my first email and now the same
> message is BAYES_20 - so it is "learning" something - just took longer
> than I was used to I guess. We use site-wide SA and don't really
> hand-feed the bayes (too hard for our users: Exchange backends, SA
> frontend), so there is over 200% more nspam tokens than nham - could
> that cause a problem?
>
>
>   sa-learn --dump magic
> 0.000          0          3          0  non-token data: bayes db version
> 0.000          0    3436572          0  non-token data: nspam
> 0.000          0    1475976          0  non-token data: nham
> 0.000          0          0          0  non-token data: ntokens
> 0.000          0          0          0  non-token data: oldest atime
> 0.000          0          0          0  non-token data: newest atime
> 0.000          0          0          0  non-token data: last journal
> sync atime
> 0.000          0          0          0  non-token data: last expiry atime
> 0.000          0          0          0  non-token data: last expire
> atime delta
> 0.000          0          0          0  non-token data: last expire
> reduction count

As I see it it's not a problem
In corporate traffic, ham patterns/tokens tend to be pretty constant 
while spam patterns/tokens change way more often.

In my case. atm I have

0.000          0   28032453          0  non-token data: nspam
0.000          0   13119717          0  non-token data: nham

and have no BAYES_99 hitting ham.

On production boxes, SA sees very little spam (most gets rejected).
To compensate I feed spam from a separate trap box which autolearns 
EVERYTHING it gets as spam (no rejects).

I also keep different token TTLs  for spam and ham:
autolearn on production boxes has 7 days token TTL
autolearn on trap box has 5 days token TTL.

Redis memory usage is pretty constant

# Clients
connected_clients:99
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:4035596240
used_memory_human:3.76G
used_memory_rss:4403003392
used_memory_peak:4306083208
used_memory_peak_human:4.01G
used_memory_lua:109568
mem_fragmentation_ratio:1.09
mem_allocator:jemalloc-3.2.0

h2h

Axb

Re: is my bayes working properly?

Posted by Jason Haar <Ja...@trimble.com>.

On 02/10/14 10:17, Axb wrote:
>
> have you tried "-L forget" before "-L spam" ?
>
> sa-learn --dump magic  before and after learning show show a
> difference...

I didn't do a "forget" before - I'll remember that, thanks. As far as
"before, after" goes for the dump - not an option. We're receiving 6-12
messages per second, "--dump magic" is *always* different :-)

However, it's been 7 hours since I sent my first email and now the same
message is BAYES_20 - so it is "learning" something - just took longer
than I was used to I guess. We use site-wide SA and don't really
hand-feed the bayes (too hard for our users: Exchange backends, SA
frontend), so there is over 200% more nspam tokens than nham - could
that cause a problem?

 sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0    3436572          0  non-token data: nspam
0.000          0    1475976          0  non-token data: nham
0.000          0          0          0  non-token data: ntokens
0.000          0          0          0  non-token data: oldest atime
0.000          0          0          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal
sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire
atime delta
0.000          0          0          0  non-token data: last expire
reduction count

-- 
Cheers

Jason Haar
Corporate Information Security Manager, Trimble Navigation Ltd.
Phone: +1 408 481 8171
PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1

Re: is my bayes working properly?

Posted by Axb <ax...@gmail.com>.

On 10/01/2014 06:39 PM, Jason Haar wrote:
> Hi there
>
> We're using SA-3.4.0 with REDIS for bayes and until now what I've
> normally seen is that if a piece of spam got through and got (say)
> BAYES_00, then I could throw it back in via "spamc -L spam" and
> immediately see it jump up to BAYES_50
>
> Well today I threw in such an email, "spamc -L spam" returned it was
> learnt and re-running it returns "message was already un/learned", but
> running it through "spamc" still returns BAYES_00 - no sign that Bayes
> has altered the score!
>
> Why didn't the score change when explicitly told it was spam? Is that a
> "statistics thing", or has something gone wrong with my Bayes?

have you tried "-L forget" before "-L spam" ?

sa-learn --dump magic  before and after learning show show a difference...