You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Adi <ad...@gmail.com> on 2014/07/24 09:32:35 UTC

Individual pre learning - Bayes in SQL

Hello

I have Bayes in SQL for each users (emails) on test server.
SA is trigger by
/usr/local/bin/spamc -U /var/run/spamd/spamd.socket -u $local_part@$domain

I looked at the results in database and have doubt.

select * from bayes_vars;

id | username    | spam_count | ham_count | token_count
 1 | a@x.x       |          1 |         8 |      3937
13 | t@x.x       |          0 |         1 |       356
15 | i@x.x       |          0 |         1 |       360


Column skiped:
 last_expire | last_atime_delta | last_expire_reduce |
oldest_token_age | newest_token_age |


account id 1 is oldest created few days ago.
"Trained" myself.

13 and 15 is new account received only one email:

Why both account have token_count ~ 360 ?
Not 1? whether these tokens are inherited?


sa-learn -ut@x.x --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0          0          0  non-token data: nspam
0.000          0          1          0  non-token data: nham
0.000          0        356          0  non-token data: ntokens
0.000          0 1406154984          0  non-token data: oldest atime
0.000          0 1406154984          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal
sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire
atime delta
0.000          0          0          0  non-token data: last expire
reduction count




for id: 15
sa-learn -ui@x.x --dump magic

0.000          0          3          0  non-token data: bayes db version
0.000          0          0          0  non-token data: nspam
0.000          0          1          0  non-token data: nham
0.000          0        360          0  non-token data: ntokens
0.000          0 1406159567          0  non-token data: oldest atime
0.000          0 1406159567          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal
sync atime
0.000          0          0          0  non-token data: last expiry atime
0.000          0          0          0  non-token data: last expire
atime delta
0.000          0          0          0  non-token data: last expire
reduction count

Probably I should make --sync.



Second question:
whether SA draws attention to mail's header TO, CC etc.?

I want make pre learning. Collect dozens of "super" spam mails from
different accounts and by script learn all accounts in loop
sa-learn --spam --username=$account /spam/dir/*

Mail addressed to another person will not be a problem in learning
process?



Best Regards.

Re: Individual pre learning - Bayes in SQL

Posted by Adi <ad...@gmail.com>.
Hello

> OTOH if someone gets so little spam that they struggle to reach 200,
> does it matter?


I'm just in the course of transferring the mail accounts from the
server where was global bayes (with a lot ham/spam tokens) for an
individual userpref/bayes.

Before bayes reach 200 spam threshold it that a lot more time and user
get more than 200 "not super hard" spam (less than 12 score)
mails

I want some speed up the process.

Of course the user can learn Bayes by Roundcube (markasjunk2 plugin)
or my script (run by cron) after copy / move mail to an IMAP folder
(LearnOK or LearnSPAM).

Best Regards.


Re: Individual pre learning - Bayes in SQL

Posted by RW <rw...@googlemail.com>.
On Fri, 25 Jul 2014 12:21:42 +0200
Adi wrote:


> I can change To/CC in loop for trained addresses in "mega spam mails".
> Or change To/CC to example@example.com before make sa-learn.

Just delete those headers.

> I want pre learning because in the beginning people would be hard to
> get 200 SPAM trained mail to start working bayes.

OTOH if someone gets so little spam that they struggle to reach 200,
does it matter?

Re: Individual pre learning - Bayes in SQL

Posted by Adi <ad...@gmail.com>.
Hello

> A token is a word or some piece of derived data. I just means
> that email contained 360 of them.


Thanks for clarify


>> Mail addressed to another person will not be a problem in learning
>> process?
>
> Probably not. It wont make any difference in most cases, but if
> one of those addresses is in To/Cc , and the recipient hasn't yet
> trained it as ham, there's a small chance it might.

I can change To/CC in loop for trained addresses in "mega spam mails".
Or change To/CC to example@example.com before make sa-learn.

I want pre learning because in the beginning people would be hard to
get 200 SPAM trained mail to start working bayes.

I don't know it is good Idea. Normally if Bayes is working globally
I was trained it alot.

Best Regards

Re: Individual pre learning - Bayes in SQL

Posted by RW <rw...@googlemail.com>.
On Thu, 24 Jul 2014 09:32:35 +0200
Adi wrote:

> Hello
> 

> 13 and 15 is new account received only one email:
> 
> Why both account have token_count ~ 360 ?
> Not 1? whether these tokens are inherited?

A token is a word or some piece of derived data. I just means that that
email contained 360 of them.


> Second question:
> whether SA draws attention to mail's header TO, CC etc.?

Yes.


> I want make pre learning. Collect dozens of "super" spam mails from
> different accounts and by script learn all accounts in loop
> sa-learn --spam --username=$account /spam/dir/*
> 
> Mail addressed to another person will not be a problem in learning
> process?

Probably not. It wont make any difference in most cases, but if one of
those addresses is in To/Cc , and the recipient hasn't yet trained it
as ham, there's a small chance it might. 

Re: Individual pre learning - Bayes in SQL

Posted by Adi <ad...@gmail.com>.
Hello

I have Bayes in SQL for each users (emails) on test server.
SA is trigger by
/usr/local/bin/spamc -U /var/run/spamd/spamd.socket -u $local_part@$domain


My Bayes dosen't auto learn SPAM, only HAM

Some email users have  38 HAM learned but SPAM 0;/

Some settings from userpref table

| $GLOBAL         | use_bayes                        | 1
| $GLOBAL         | required_score                   | 6
| $GLOBAL         | use_bayes                        | 1
| $GLOBAL         | bayes_auto_learn                 | 1
| $GLOBAL         | skip_rbl_checks                  | 0
| $GLOBAL         | bayes_auto_learn_threshold_nonspam | 0.1
| $GLOBAL         | bayes_auto_learn_threshold_spam    | 12





I know that minimal threshold is 3 spam score for body +
3 score for headers.

Is strange that no one spam is not autolearn as SPAM.

few mail examples X_Spam_Status:
X-Spam-Status: Yes, score=15.2 required=6.0
tests=DCC_CHECK,DIGEST_MULTIPLE,
DKIM_SIGNED,DKIM_VALID,FUZZY_CREDIT,HEADER_FROM_DIFFERENT_DOMAINS,

RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RCVD_IN_PSBL,

RP_MATCHES_RCVD,SPF_PASS,URIBL_DBL_SPAM,URIBL_JP_SURBL,URIBL_SC_SURBL,
        URIBL_WS_SURBL autolearn=no autolearn_force=no version=3.4.0


X-Spam-Flag: YES
X-Spam-Status: Yes, score=15.4 required=6.0
tests=DCC_CHECK,DIGEST_MULTIPLE,

DKIM_ADSP_CUSTOM_MED,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,

FREEMAIL_REPLYTO_END_DIGIT,HEADER_FROM_DIFFERENT_DOMAINS,HTML_IMAGE_RATIO_02,
       HTML_MESSAGE,NML_ADSP_CUSTOM_MED,RAZOR2_CF_RANGE_51_100,
 RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RDNS_NONE,T_DKIM_INVALID,
        URIBL_DBL_SPAM,URIBL_JP_SURBL,URIBL_SC_SURBL,URIBL_WS_SURBL
autolearn=no autolearn_force=no version=3.4.0



X-Spam-Status: Yes, score=13.1 required=6.0
tests=DCC_CHECK,DIGEST_MULTIPLE,
DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,HTML_IMAGE_RATIO_06,HTML_MESSAGE,

RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RDNS_NONE,

T_DKIM_INVALID,URIBL_DBL_SPAM,URIBL_JP_SURBL,URIBL_SC_SURBL,URIBL_WS_SURBL
        autolearn=no autolearn_force=no version=3.4.0


X-Spam-Flag: YES
X-Spam-Status: Yes, score=14.6 required=6.0
tests=DCC_CHECK,DIGEST_MULTIPLE,
DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,HTML_IMAGE_RATIO_02,

HTML_MESSAGE,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,

RCVD_IN_PSBL,RP_MATCHES_RCVD,SPF_PASS,URIBL_DBL_SPAM,URIBL_JP_SURBL,
        URIBL_SC_SURBL,URIBL_WS_SURBL autolearn=no autolearn_force=no
version=3.4.0






Do you have any ideas?



Best Regards;