You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Jean Caron <ca...@norac.net> on 2005/04/13 17:53:15 UTC

sa-learn - bayes training...

Folks, 

I searched the archive, tried different things, yet I need to ask a few 
questions. 

I'm running SA 3.0.2 with Qmail/QQ 1.25, and procmail, on linux. Works 
great. Bayes auto-learns ok, I run sa-learn from a "dedicated" user every 
night for ham and spam. My logs show how many msgs were inspected and how 
many were learned. So far so good. 

Here's the part I'm unsure of, I have one centralized bayes DB own by this 
"dedicated" user. This user runs sa-learn against two shared folders, one 
for ham and one for spam. All users (only a hand full) may populate the 
shared folders. Many thousand msgs have gone through sa-learn. I thought 
this was all too easy... 

My problem is bayes does not seem to have any effect what so ever on the 
amount of spam delivered to INBOXes. I keep receiving these low score spam 
msgs still. 

I now suspect this centralized DB, updated by this user alone, may not 
produce the expected results. I've read in the archive that individual users 
should run cron jobs against their own ham and spam folders. The issue with 
this is that only one user has an actual shell defined on the system, so the 
others can't run cron. Then again, that just a suspicion, I may be wrong, 
and something else may be missing or mis-configured, and that's why I'm 
posting this... I'm a little confused. I don't understand how bayes works 
exactly, so I can't come to any helpfull conclusion about my setup. 

Can anyone see through this and help me understand what is happening ?
Thanks in advance,
Jean 


Re: sa-learn - bayes training...

Posted by Jean Caron <ca...@norac.net>.
I just had a chance to (finally) get back to this issue. I tried your 
suggestion, changed the mode to 0777 and re-started spamd. Apparently 
nothing changed. 

I did however realize that bayes tests are listed in my log file, even 
though they are not in the header of the msgs. 

So, I have bayes autolearn working fine. The database is also fine (> 6000 
ham & spam learned). My logs show all that's expected. The messages header 
are missing the list of Bayes tests, but are otherwise fine. Spamassassin 
 --lint returns no error. I have the SARE rules installed. Running qmail, 
with qmail-scanner v1.25 and SA 3.0.2. Everything works fine... 

Yet, I still have a lot of spam (I know that's relative) that slips through, 
more that before this SA upgrade. To show some numbers, I use to get a 
couple of false negatives per day, if any, before the upgrade, now I get 
anywhere from half a dozen to two dozens. Still much better that the 500 
without SA, but not quite fine tuned enough for my taste. 

Any suggestions as to where to look next would be appreciated.
Cheers,
Jean 

Matt Kettler writes: 

> Jean Caron wrote: 
> 
>>
>> Here's the bayes related I had in there already;
>> use_bayes 1
>> bayes_path              /home/bayesUID/bayes
>> bayes_file_mode         0666
>> bayes_auto_learn 1
>> Jean 
> 
> Suggestion: set bayes_file_mode to 0777 not 0666. 
> 
> The bayes_file_mode is really a mask not literal permissions, so it
> won't result in executable bits being set for your bayes files. However,
> this mask is sometimes used in directory creation, where the x bit is
> quite appropriate. 
> 
> This is why the default is 0700, not 0600. 
> 
>   
> 
 


Re: sa-learn - bayes training...

Posted by Matt Kettler <mk...@evi-inc.com>.
Jean Caron wrote:

>
> Here's the bayes related I had in there already;
> use_bayes 1
> bayes_path              /home/bayesUID/bayes
> bayes_file_mode         0666
> bayes_auto_learn 1
> Jean 

Suggestion: set bayes_file_mode to 0777 not 0666.

The bayes_file_mode is really a mask not literal permissions, so it
won't result in executable bits being set for your bayes files. However,
this mask is sometimes used in directory creation, where the x bit is
quite appropriate.

This is why the default is 0700, not 0600.

 

Re: sa-learn - bayes training...

Posted by Jean Caron <ca...@norac.net>.
Alright. I find it strange that the defaults don't apply to my setup, but in 
any case I added the following to local.cf and re-started spamd.
> add_header all Status _YESNO_, score=_SCORE_ required=_REQD_ tests=_TESTS_

Here's the bayes related I had in there already; 

use_bayes 1
bayes_path              /home/bayesUID/bayes
bayes_file_mode         0666
bayes_auto_learn 1 

Jean 


Kevin Peuhkurinen writes: 

> Jean Caron wrote: 
> 
>> Really ? I never saw bayes score in the header. Sould ALL msgs have a 
>> bayes score in the header ? Here's a sample header;
>> Received: from 80.231.10.208 by mail (envelope-from 
>> <ol...@business-kc.com>, uid 1001) with qmail-scanner-1.25 
>> (spamassassin: 3.0.2. Clear:RC:0(80.231.10.208):SA:0(1.5/2.0):. Processed 
>> in 3.859362 secs); 14 Apr 2005 07:18:05 -0000
>> X-Spam-Status:     No, hits=1.5 required=2.0
>> X-Spam-Level:     +
>> Did I miss such an obvious switch somewhere ??
>> Jean 
>> 
> For some reason, SA is not adding the tests that the email hit in the 
> X-Spam-Status header, as is the default.   Without this information, it's 
> difficult to tell what is going on.    Look in your local.cf file for 
> either a "remove_header" or "add_header" entry.    Remove (or comment out) 
> any of the former and if you have any of the latter, make sure they read: 
> 
> add_header all Status _YESNO_, score=_SCORE_ required=_REQD_ tests=_TESTS_ 
> autolearn=_AUTOLEARN_ version=_VERSION_ 
> 
> 
> After making the change, be sure to restart spamd.   Then begin to moniter 
> your false negatives.   The headers should then show which tests are hit.  
>  Look for BAYES tests and see which they are hitting. 
> 
> 
 


Re: sa-learn - bayes training...

Posted by Kevin Peuhkurinen <ke...@meridiancu.ca>.
Jean Caron wrote:

> Really ? I never saw bayes score in the header. Sould ALL msgs have a 
> bayes score in the header ? Here's a sample header;
> Received: from 80.231.10.208 by mail (envelope-from 
> <ol...@business-kc.com>, uid 1001) with qmail-scanner-1.25 
> (spamassassin: 3.0.2. Clear:RC:0(80.231.10.208):SA:0(1.5/2.0):. 
> Processed in 3.859362 secs); 14 Apr 2005 07:18:05 -0000
> X-Spam-Status:     No, hits=1.5 required=2.0
> X-Spam-Level:     +
> Did I miss such an obvious switch somewhere ??
> Jean
>
For some reason, SA is not adding the tests that the email hit in the 
X-Spam-Status header, as is the default.   Without this information, 
it's difficult to tell what is going on.    Look in your local.cf file 
for either a "remove_header" or "add_header" entry.    Remove (or 
comment out) any of the former and if you have any of the latter, make 
sure they read:

add_header all Status _YESNO_, score=_SCORE_ required=_REQD_ tests=_TESTS_ autolearn=_AUTOLEARN_ version=_VERSION_


After making the change, be sure to restart spamd.   Then begin to 
moniter your false negatives.   The headers should then show which tests 
are hit.   Look for BAYES tests and see which they are hitting.



Re: sa-learn - bayes training...

Posted by Jean Caron <ca...@norac.net>.
Really ? I never saw bayes score in the header. Sould ALL msgs have a bayes 
score in the header ? Here's a sample header; 

Received: from 80.231.10.208 by mail (envelope-from 
<ol...@business-kc.com>, uid 1001) with qmail-scanner-1.25 (spamassassin: 
3.0.2. Clear:RC:0(80.231.10.208):SA:0(1.5/2.0):. Processed in 3.859362 
secs); 14 Apr 2005 07:18:05 -0000
X-Spam-Status: 	No, hits=1.5 required=2.0
X-Spam-Level: 	+ 

Did I miss such an obvious switch somewhere ??
Jean 


Phil Barnett writes: 

> On Friday 15 April 2005 08:03 am, Jean Caron wrote: 
> 
>> Again, how can I tell for sure ?
> 
> Look in the header and see what the bayes score was on the FN. 
> 
> --  
> 
> "In the beginning of a change, the patriot is a brave and scarce man, hated 
> and scorned. When the cause succeeds, however, the timid join him...for then 
> it costs nothing to be a patriot." -Mark Twain  
> 


Re: sa-learn - bayes training...

Posted by Phil Barnett <ph...@philb.us>.
On Friday 15 April 2005 08:03 am, Jean Caron wrote:

> Again, how can I tell for sure ?

Look in the header and see what the bayes score was on the FN.

-- 

"In the beginning of a change, the patriot is a brave and scarce man, hated 
and scorned. When the cause succeeds, however, the timid join him...for then 
it costs nothing to be a patriot." -Mark Twain 

Re: sa-learn - bayes training...

Posted by Jean Caron <ca...@norac.net>.
Kevin, my comments/questions are inline. 

Kevin Peuhkurinen writes: 

> Jean Caron wrote: 
> 
>> Kevin, your assumption is correct, user accounts are on the server and 
>> spamc is used. I already have the central DB setup using bayes_path in 
>> local.cf.
>> I think what you are saying confirms what I suspected, but it's still not 
>> 100% clear. Even though I have a central DB, all users must train it 
>> individually, is that it ?
>> For example, if UserA populates the shared folders respectively with ham 
>> and spam from messages he/she received, if UserB trains the central DB 
>> against those msgs, it will have no effect for UserA ? All users must 
>> individually train the central DB even though they train using the same 
>> msgs from the same shared folders ?
>> Sorry if I seem a little dense, but I think I'm getting it. I hope !
>> Jean 
>> 
> If you have bayes_path set, then all users should be using just the one 
> DB, and any training that one user does will affect the results for all 
> other users.   

Hummm... That's what I *thought*, but then the results led me to beleive 
otherwise, and now you are confirming that only one user can learn for all. 

> So, presuming that the permissions on the Bayes files are 
> set correctly so that all of your users have access to it, it would seem 
> that you do have things set up properly.

I thought so, but something is not doing its "thing". 

> 
> It is possible that the database is corrupt.

How can I tell for sure ? As far as I can tell, using spamassassin --lint, 
sa-learn --dump, etc. the results seem to indicate a healthy DB. 

> Have you in fact 
> determined that most or all of your false negatives are due to low Bayes 
> scores? 
> 

Again, how can I tell for sure ? My main lead here is that since I upgraded 
to 3.0.2, I also changed from owning the DB myself, as a regular user, to 
making it system wide owned and trained by a dedicated user. And since then, 
I went from a handfull of false negatives a day, to almost a hundred. At 
first, and this is where I may have assumed wrong, I thought well alright I 
have a brand new DB and it needs to be trained that's all. I gave it enough 
time and training, but it never got better. I still have way more FN than I 
use to. I've also recently (this week) added the SARE rules, and the results 
are not much better. 

Jean 



Re: sa-learn - bayes training...

Posted by Kevin Peuhkurinen <ke...@meridiancu.ca>.
Jean Caron wrote:

> Kevin, your assumption is correct, user accounts are on the server and 
> spamc is used. I already have the central DB setup using bayes_path in 
> local.cf.
> I think what you are saying confirms what I suspected, but it's still 
> not 100% clear. Even though I have a central DB, all users must train 
> it individually, is that it ?
> For example, if UserA populates the shared folders respectively with 
> ham and spam from messages he/she received, if UserB trains the 
> central DB against those msgs, it will have no effect for UserA ? All 
> users must individually train the central DB even though they train 
> using the same msgs from the same shared folders ?
> Sorry if I seem a little dense, but I think I'm getting it. I hope !
> Jean
>
If you have bayes_path set, then all users should be using just the one 
DB, and any training that one user does will affect the results for all 
other users.   So, presuming that the permissions on the Bayes files are 
set correctly so that all of your users have access to it, it would seem 
that you do have things set up properly.   

It is possible that the database is corrupt.    Have you in fact 
determined that most or all of your false negatives are due to low Bayes 
scores?

>

Re: sa-learn - bayes training...

Posted by Jean Caron <ca...@norac.net>.
Kevin, your assumption is correct, user accounts are on the server and spamc 
is used. I already have the central DB setup using bayes_path in local.cf. 

I think what you are saying confirms what I suspected, but it's still not 
100% clear. Even though I have a central DB, all users must train it 
individually, is that it ? 

For example, if UserA populates the shared folders respectively with ham and 
spam from messages he/she received, if UserB trains the central DB against 
those msgs, it will have no effect for UserA ? All users must individually 
train the central DB even though they train using the same msgs from the 
same shared folders ? 

Sorry if I seem a little dense, but I think I'm getting it. I hope !
Jean 


Kevin Peuhkurinen writes: 

> Jean Caron wrote: 
> 
>> Folks,
>> I searched the archive, tried different things, yet I need to ask a few 
>> questions.
>> I'm running SA 3.0.2 with Qmail/QQ 1.25, and procmail, on linux. Works 
>> great. Bayes auto-learns ok, I run sa-learn from a "dedicated" user every 
>> night for ham and spam. My logs show how many msgs were inspected and how 
>> many were learned. So far so good.
>> Here's the part I'm unsure of, I have one centralized bayes DB own by 
>> this "dedicated" user. This user runs sa-learn against two shared 
>> folders, one for ham and one for spam. All users (only a hand full) may 
>> populate the shared folders. Many thousand msgs have gone through 
>> sa-learn. I thought this was all too easy...
>> My problem is bayes does not seem to have any effect what so ever on the 
>> amount of spam delivered to INBOXes. I keep receiving these low score 
>> spam msgs still.
>> I now suspect this centralized DB, updated by this user alone, may not 
>> produce the expected results. I've read in the archive that individual 
>> users should run cron jobs against their own ham and spam folders. The 
>> issue with this is that only one user has an actual shell defined on the 
>> system, so the others can't run cron. Then again, that just a suspicion, 
>> I may be wrong, and something else may be missing or mis-configured, and 
>> that's why I'm posting this... I'm a little confused. I don't understand 
>> how bayes works exactly, so I can't come to any helpfull conclusion about 
>> my setup.
>> Can anyone see through this and help me understand what is happening ?
>> Thanks in advance,
>> Jean 
>> 
> Jean,
> I'm not entirely sure based on the information you provided how spamd is 
> getting called, but I'm quite sure that your setup is not doing what you 
> expect it to.    I'm guessing since you say that you are using procmail 
> that you have user accounts set up on the server itself and that spamc is 
> being called as individual users from .forward files.    If this is the 
> case, then each user will have a .spamassassin/ directory in their home 
> which will contain their own personal Bayes database.   Your problem is 
> that you have one particular user who runs sa-learn, so only their Bayes 
> DB is being trained (other than through the auto-learning feature, that 
> is, which is  updating the individual databases).   
> 
> One easy option you can consider is the use of a global Bayes DB for all 
> your users instead of each of them having their own personal DB.   Bayes 
> tends to be less effective with global rather than personal databases, but 
> only if the individual users are able to do their own training.   You 
> could do this fairly easily by setting the "bayes_path" option in your 
> /etc/mail/spamassassin/local.cf file and have it point the .spamassassin/ 
> directory of the user who is doing all the sa-learn training. 
> 
> Hope that helps.
> Kevin 
> 
 


Re: sa-learn - bayes training...

Posted by Kevin Peuhkurinen <ke...@meridiancu.ca>.
Jean Caron wrote:

> Folks,
> I searched the archive, tried different things, yet I need to ask a 
> few questions.
> I'm running SA 3.0.2 with Qmail/QQ 1.25, and procmail, on linux. Works 
> great. Bayes auto-learns ok, I run sa-learn from a "dedicated" user 
> every night for ham and spam. My logs show how many msgs were 
> inspected and how many were learned. So far so good.
> Here's the part I'm unsure of, I have one centralized bayes DB own by 
> this "dedicated" user. This user runs sa-learn against two shared 
> folders, one for ham and one for spam. All users (only a hand full) 
> may populate the shared folders. Many thousand msgs have gone through 
> sa-learn. I thought this was all too easy...
> My problem is bayes does not seem to have any effect what so ever on 
> the amount of spam delivered to INBOXes. I keep receiving these low 
> score spam msgs still.
> I now suspect this centralized DB, updated by this user alone, may not 
> produce the expected results. I've read in the archive that individual 
> users should run cron jobs against their own ham and spam folders. The 
> issue with this is that only one user has an actual shell defined on 
> the system, so the others can't run cron. Then again, that just a 
> suspicion, I may be wrong, and something else may be missing or 
> mis-configured, and that's why I'm posting this... I'm a little 
> confused. I don't understand how bayes works exactly, so I can't come 
> to any helpfull conclusion about my setup.
> Can anyone see through this and help me understand what is happening ?
> Thanks in advance,
> Jean
>
Jean,
I'm not entirely sure based on the information you provided how spamd is 
getting called, but I'm quite sure that your setup is not doing what you 
expect it to.    I'm guessing since you say that you are using procmail 
that you have user accounts set up on the server itself and that spamc 
is being called as individual users from .forward files.    If this is 
the case, then each user will have a .spamassassin/ directory in their 
home which will contain their own personal Bayes database.   Your 
problem is that you have one particular user who runs sa-learn, so only 
their Bayes DB is being trained (other than through the auto-learning 
feature, that is, which is  updating the individual databases).  

One easy option you can consider is the use of a global Bayes DB for all 
your users instead of each of them having their own personal DB.   Bayes 
tends to be less effective with global rather than personal databases, 
but only if the individual users are able to do their own training.   
You could do this fairly easily by setting the "bayes_path" option in 
your /etc/mail/spamassassin/local.cf file and have it point the 
.spamassassin/ directory of the user who is doing all the sa-learn training.

Hope that helps.
Kevin