You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by jo3 <jo...@brats.com> on 2006/01/09 20:27:57 UTC

rules better than bayes?

Hi,

This is an observation, please take it in the spirit in which it is 
intended, it is not meant to be flame bait.

After using spamassassin for six solid months, it seems to me that the 
bayes process (sa-learn [--spam | --ham]) has only very limited success 
in learning about new spam.  Regardless of how many spams and hams are 
submitted, the effectiveness never goes above the default level which, 
in our case here, is somewhere around 2 out of 3 spams correctly 
identified.  By the same token, after adding the "third party" rule, 
airmax.cf, the effectiveness went up to 99 out of 100 spams correctly 
identified.

So far, we have not had a single ham misidentified as spam with over one 
million messages examined.

Throughout the documentation, there seems to be a bias toward the bayes 
filter rather than the rule system.  Does anyone on the list have some 
thoughts which would help to explain my observation as to why a single 
rule would appear so successful while a million spams and hams would 
have so little effect?

Thank you,
Jo3

Re: rules better than bayes?

Posted by mouss <us...@free.fr>.
Matt Kettler a écrit :
> 
> 
> Realistically, I don't know why your hit rates are so low. They shouldn't be so
> bad that you're only detecting 2 or 3 out of every hundred.
> 
> You could have some configuration problems, but I can't tell as you've not told
> us anything about your system, just the problems you have.
> 
> Can you answer a few questions that might help us diagnose some of your problems:
> 
> What version of SA are you running?
> 
> Can you post an X-Spam-Status header for one of the false negatives?
> 
> Is any of your spam hitting ALL_TRUSTED?
> 
> What BAYES rules are these messages hitting before and after training?
> 
> Do you use any network checks (URIBLs, RBLs, DCC, Razor, Pyzor, SPF)?
> 

also, a common error is to run SA as a user, but train it as another one.

Re: rules better than bayes?

Posted by Mike Jackson <mj...@barking-dog.net>.
> Do you recommend running airmax as a supplementary ruleset with 3.1.0?

This is just my humble opinion, but I don't know if that's a ruleset I would 
use in production for a multi-user server. A few of the rules use the 
"f-word" in the rule description line, so it would go out in a verbose 
report. The rules seem pretty random and unfocused, and scored based on gut 
instinct rather than rigorous testing. 


Re: rules better than bayes?

Posted by "M. Lewis" <ca...@cajuninc.com>.
Matthew Yette wrote:
>>>
>>
>>Correction, airmax.cf is not one single rule, it's one single FILE containing
>>211 rules. That's a significant difference, given that the stock spamassassin
>>3.1.0 has about 723 rules.
>>
>>Airmax has increased the number of rules in your system by 29.1%
>>
>>
>>
>>
>>
> 
> Do you recommend running airmax as a supplementary ruleset with 3.1.0?


There's an additional downside to airmax. It has excerpts from *lots* of 
SARE rules. If a SARE rule gets updated, will it be updated in airmax.cf?

YMMV,
M

-- 

  Overflow on /dev/null; please empty the bit bucket.
   14:50:01 up 1 day, 10:21,  5 users,  load average: 0.04, 0.14, 0.11

  Linux Registered User #241685  http://counter.li.org

Re: rules better than bayes?

Posted by Dhawal Doshy <dh...@netmagicsolutions.com>.
Robert Bartlett writes: 

> Ok I confused myself. Im sorry for being an idiot. I get it now. Everytime
> an email comes in it tries to access it as the user, since bayes is being
> feed to just the root account it doesn't see anything for the users in
> bayes. With the override I force it to use the root account for all emails
> coming in. Boy am I stupid. 
> 
> Thanks
> Robert

Try out this to find the right value for bayes_sql_override_username. 

SELECT id, username, spam_count, ham_count, token_count FROM bayes_vars; 

 - dhawal 

> -----Original Message-----
> From: Robert Bartlett [mailto:robert@digitalphx.com] 
> Sent: Monday, January 09, 2006 1:52 PM
> To: users@spamassassin.apache.org
> Subject: RE: rules better than bayes? 
> 
> Sorry for the confusion, I do use a site wide bayes database, I thought the
> information I sent below was the site wide information the system uses to
> access the bayes database. 
> 
> Thanks
> Robert  
> 
> -----Original Message-----
> From: Matt Kettler [mailto:mkettler@evi-inc.com]
> Sent: Monday, January 09, 2006 1:47 PM
> To: Robert Bartlett
> Cc: users@spamassassin.apache.org
> Subject: Re: rules better than bayes? 
> 
> Robert Bartlett wrote:
>>  This is what I have in my local.cf file: 
>> 
>> bayes_store_module               Mail::SpamAssassin::BayesStore::SQL
>> bayes_sql_dsn                    DBI:mysql:**************:localhost:3306
>> bayes_sql_username               ************
>> bayes_sql_password               ************ 
>> 
>> Obviously I hid the data that I didn't want to show with *. When I run 
>> sa-learn it trains into the mysql database just fine, I assume SA 
>> connects to it just fine because of that.
>  
> 
> That's all the database login information. That doesn't mean you have a
> single sitewide bayes database. 
> 
> Again, I suggest looking at the  bayes_sql_override_username option. 
> 
> 
 


RE: rules better than bayes?

Posted by Robert Bartlett <ro...@digitalphx.com>.
Ok I confused myself. Im sorry for being an idiot. I get it now. Everytime
an email comes in it tries to access it as the user, since bayes is being
feed to just the root account it doesn't see anything for the users in
bayes. With the override I force it to use the root account for all emails
coming in. Boy am I stupid.

Thanks
Robert



-----Original Message-----
From: Robert Bartlett [mailto:robert@digitalphx.com] 
Sent: Monday, January 09, 2006 1:52 PM
To: users@spamassassin.apache.org
Subject: RE: rules better than bayes?

Sorry for the confusion, I do use a site wide bayes database, I thought the
information I sent below was the site wide information the system uses to
access the bayes database.

Thanks
Robert 

-----Original Message-----
From: Matt Kettler [mailto:mkettler@evi-inc.com]
Sent: Monday, January 09, 2006 1:47 PM
To: Robert Bartlett
Cc: users@spamassassin.apache.org
Subject: Re: rules better than bayes?

Robert Bartlett wrote:
>  This is what I have in my local.cf file:
> 
> bayes_store_module               Mail::SpamAssassin::BayesStore::SQL
> bayes_sql_dsn                    DBI:mysql:**************:localhost:3306
> bayes_sql_username               ************
> bayes_sql_password               ************
> 
> Obviously I hid the data that I didn't want to show with *. When I run 
> sa-learn it trains into the mysql database just fine, I assume SA 
> connects to it just fine because of that.


That's all the database login information. That doesn't mean you have a
single sitewide bayes database.

Again, I suggest looking at the  bayes_sql_override_username option.



RE: rules better than bayes?

Posted by Robert Bartlett <ro...@digitalphx.com>.
Sorry for the confusion, I do use a site wide bayes database, I thought the
information I sent below was the site wide information the system uses to
access the bayes database.

Thanks
Robert 

-----Original Message-----
From: Matt Kettler [mailto:mkettler@evi-inc.com] 
Sent: Monday, January 09, 2006 1:47 PM
To: Robert Bartlett
Cc: users@spamassassin.apache.org
Subject: Re: rules better than bayes?

Robert Bartlett wrote:
>  This is what I have in my local.cf file:
> 
> bayes_store_module               Mail::SpamAssassin::BayesStore::SQL
> bayes_sql_dsn                    DBI:mysql:**************:localhost:3306
> bayes_sql_username               ************
> bayes_sql_password               ************
> 
> Obviously I hid the data that I didn't want to show with *. When I run 
> sa-learn it trains into the mysql database just fine, I assume SA 
> connects to it just fine because of that.


That's all the database login information. That doesn't mean you have a
single sitewide bayes database.

Again, I suggest looking at the  bayes_sql_override_username option.


Re: rules better than bayes?

Posted by Matt Kettler <mk...@evi-inc.com>.
Robert Bartlett wrote:
>  This is what I have in my local.cf file:
> 
> bayes_store_module               Mail::SpamAssassin::BayesStore::SQL
> bayes_sql_dsn                    DBI:mysql:**************:localhost:3306
> bayes_sql_username               ************
> bayes_sql_password               ************
> 
> Obviously I hid the data that I didn't want to show with *. When I run
> sa-learn it trains into the mysql database just fine, I assume SA connects
> to it just fine because of that.


That's all the database login information. That doesn't mean you have a single
sitewide bayes database.

Again, I suggest looking at the  bayes_sql_override_username option.

RE: rules better than bayes?

Posted by Robert Bartlett <ro...@digitalphx.com>.
 This is what I have in my local.cf file:

bayes_store_module               Mail::SpamAssassin::BayesStore::SQL
bayes_sql_dsn                    DBI:mysql:**************:localhost:3306
bayes_sql_username               ************
bayes_sql_password               ************

Obviously I hid the data that I didn't want to show with *. When I run
sa-learn it trains into the mysql database just fine, I assume SA connects
to it just fine because of that.

Robert

-----Original Message-----
From: Matt Kettler [mailto:mkettler@evi-inc.com] 
Sent: Monday, January 09, 2006 1:32 PM
To: Robert Bartlett
Cc: users@spamassassin.apache.org
Subject: Re: rules better than bayes?

Robert Bartlett wrote:
> Interesting, I did that just to see how mine were doing and the BAYES 
> one returned 0? Does that mean bayes is not being used? I have been 
> feeding emails to bayes and in debug mode it shows bayes being used. I 
> am using bayes in a mysql. Just weird that its showing 0.
> 

That sounds a lot like you're training bayes into mysql, but when mail comes
in and gets scanned, it's either not using SQL, or it's not using the same
table.

Usually this is a problem with username, where your training is occurring as
"root" but your scanning is occurring as "nobody".

You might want to try using the bayes_sql_override_username option, to force
a single site-wide bayes database, instead of having one per userid
executing SA.
(note: that's per userid EXECUTING SA.. not per email recipient.)





Re: rules better than bayes?

Posted by Matt Kettler <mk...@evi-inc.com>.
Robert Bartlett wrote:
> Interesting, I did that just to see how mine were doing and the BAYES one
> returned 0? Does that mean bayes is not being used? I have been feeding
> emails to bayes and in debug mode it shows bayes being used. I am using
> bayes in a mysql. Just weird that its showing 0.
> 

That sounds a lot like you're training bayes into mysql, but when mail comes in
and gets scanned, it's either not using SQL, or it's not using the same table.

Usually this is a problem with username, where your training is occurring as
"root" but your scanning is occurring as "nobody".

You might want to try using the bayes_sql_override_username option, to force a
single site-wide bayes database, instead of having one per userid executing SA.
(note: that's per userid EXECUTING SA.. not per email recipient.)




RE: rules better than bayes?

Posted by Robert Bartlett <ro...@digitalphx.com>.
Interesting, I did that just to see how mine were doing and the BAYES one
returned 0? Does that mean bayes is not being used? I have been feeding
emails to bayes and in debug mode it shows bayes being used. I am using
bayes in a mysql. Just weird that its showing 0.

Robert

-----Original Message-----
From: Matt Kettler [mailto:mkettler@evi-inc.com] 
Sent: Monday, January 09, 2006 1:05 PM
To: Matthew Yette
Cc: users@spamassassin.apache.org
Subject: Re: rules better than bayes?

Matthew Yette wrote:
> 
> Do you recommend running airmax as a supplementary ruleset with 3.1.0?

I personally have no recommendations on it.. I've never run it.

I personally like SARE's specific, evilnumbers, random and adult rulesets.


Here's some quick grep's for hit-rates on some SARE rules I use (no
declarations  about FPs vs real spam hits, but none of these sets have
caused me any problems so far)

70_sare_evilnum0.cf & 70_sare_evilnum1.cf:
  grep SARE_EN_ /var/log/maillog |wc -l
    301
70_sare_specific.cf:
  grep SARE_SPEC_ /var/log/maillog |wc -l
     60
70_sare_genlsubj0.cf:
  grep SARE_SUB /var/log/maillog |wc -l
     44
70_sare_adult.cf:
  grep SARE_ADLT /var/log/maillog |wc -l
     31
70_sare_uri0.cf:
  grep SARE_URI_ /var/log/maillog |wc -l
     10
70_sare_random.cf:
  grep SARE_RAND_ /var/log/maillog |wc -l
      1


I also strongly recommend enabling SA's URIBL support, and adding on a .cf
file to get uribl.com's list added in (default SA only uses surbl.org lists)

  grep URIBL_BLACK /var/log/maillog |wc -l
   2214

  grep _SURBL /var/log/maillog |wc -l
   2144

And of course I get great results from bayes:
  grep BAYES_99 /var/log/maillog |wc -l
   2190

Ditto DCC and Razor2:
 grep RAZOR2_CHECK /var/log/maillog |wc -l
   2114
 grep DCC_CHECK /var/log/maillog |wc -l
   1833


Re: rules better than bayes?

Posted by Matt Kettler <mk...@evi-inc.com>.
Matthew Yette wrote:
> 
> Do you recommend running airmax as a supplementary ruleset with 3.1.0?

I personally have no recommendations on it.. I've never run it.

I personally like SARE's specific, evilnumbers, random and adult rulesets.


Here's some quick grep's for hit-rates on some SARE rules I use (no declarations
 about FPs vs real spam hits, but none of these sets have caused me any problems
so far)

70_sare_evilnum0.cf & 70_sare_evilnum1.cf:
  grep SARE_EN_ /var/log/maillog |wc -l
    301
70_sare_specific.cf:
  grep SARE_SPEC_ /var/log/maillog |wc -l
     60
70_sare_genlsubj0.cf:
  grep SARE_SUB /var/log/maillog |wc -l
     44
70_sare_adult.cf:
  grep SARE_ADLT /var/log/maillog |wc -l
     31
70_sare_uri0.cf:
  grep SARE_URI_ /var/log/maillog |wc -l
     10
70_sare_random.cf:
  grep SARE_RAND_ /var/log/maillog |wc -l
      1


I also strongly recommend enabling SA's URIBL support, and adding on a .cf file
to get uribl.com's list added in (default SA only uses surbl.org lists)

  grep URIBL_BLACK /var/log/maillog |wc -l
   2214

  grep _SURBL /var/log/maillog |wc -l
   2144

And of course I get great results from bayes:
  grep BAYES_99 /var/log/maillog |wc -l
   2190

Ditto DCC and Razor2:
 grep RAZOR2_CHECK /var/log/maillog |wc -l
   2114
 grep DCC_CHECK /var/log/maillog |wc -l
   1833

Re: rules better than bayes?

Posted by Matthew Yette <my...@mapolce.com>.


On 1/9/06 2:43 PM, "Matt Kettler" <mk...@evi-inc.com> wrote:

> jo3 wrote:
>> Hi,
>> 
>> This is an observation, please take it in the spirit in which it is
>> intended, it is not meant to be flame bait.
>> 
>> After using spamassassin for six solid months, it seems to me that the
>> bayes process (sa-learn [--spam | --ham]) has only very limited success
>> in learning about new spam.  Regardless of how many spams and hams are
>> submitted, the effectiveness never goes above the default level which,
>> in our case here, is somewhere around 2 out of 3 spams correctly
>> identified.  By the same token, after adding the "third party" rule,
>> airmax.cf, the effectiveness went up to 99 out of 100 spams correctly
>> identified.
> 
> 
> Realistically, I don't know why your hit rates are so low. They shouldn't be
> so
> bad that you're only detecting 2 or 3 out of every hundred.
> 
> You could have some configuration problems, but I can't tell as you've not
> told
> us anything about your system, just the problems you have.
> 
> Can you answer a few questions that might help us diagnose some of your
> problems:
> 
> What version of SA are you running?
> 
> Can you post an X-Spam-Status header for one of the false negatives?
> 
> Is any of your spam hitting ALL_TRUSTED?
> 
> What BAYES rules are these messages hitting before and after training?
> 
> Do you use any network checks (URIBLs, RBLs, DCC, Razor, Pyzor, SPF)?
> 
> 
>> 
>> So far, we have not had a single ham misidentified as spam with over one
>> million messages examined.
>> 
>> Throughout the documentation, there seems to be a bias toward the bayes
>> filter rather than the rule system.  Does anyone on the list have some
>> thoughts which would help to explain my observation as to why a single
>> rule would appear so successful while a million spams and hams would
>> have so little effect?
>> 
> 
> Correction, airmax.cf is not one single rule, it's one single FILE containing
> 211 rules. That's a significant difference, given that the stock spamassassin
> 3.1.0 has about 723 rules.
> 
> Airmax has increased the number of rules in your system by 29.1%
> 
> 
> 
> 
> 
Do you recommend running airmax as a supplementary ruleset with 3.1.0?
-- 
Matthew Yette
Senior Engineer (NOC/Operations)
M.A. Polce Consulting
315-838-1644


Re: rules better than bayes?

Posted by Matt Kettler <mk...@evi-inc.com>.
jo3 wrote:
> Hi,
> 
> This is an observation, please take it in the spirit in which it is
> intended, it is not meant to be flame bait.
> 
> After using spamassassin for six solid months, it seems to me that the
> bayes process (sa-learn [--spam | --ham]) has only very limited success
> in learning about new spam.  Regardless of how many spams and hams are
> submitted, the effectiveness never goes above the default level which,
> in our case here, is somewhere around 2 out of 3 spams correctly
> identified.  By the same token, after adding the "third party" rule,
> airmax.cf, the effectiveness went up to 99 out of 100 spams correctly
> identified.


Realistically, I don't know why your hit rates are so low. They shouldn't be so
bad that you're only detecting 2 or 3 out of every hundred.

You could have some configuration problems, but I can't tell as you've not told
us anything about your system, just the problems you have.

Can you answer a few questions that might help us diagnose some of your problems:

What version of SA are you running?

Can you post an X-Spam-Status header for one of the false negatives?

Is any of your spam hitting ALL_TRUSTED?

What BAYES rules are these messages hitting before and after training?

Do you use any network checks (URIBLs, RBLs, DCC, Razor, Pyzor, SPF)?


> 
> So far, we have not had a single ham misidentified as spam with over one
> million messages examined.
> 
> Throughout the documentation, there seems to be a bias toward the bayes
> filter rather than the rule system.  Does anyone on the list have some
> thoughts which would help to explain my observation as to why a single
> rule would appear so successful while a million spams and hams would
> have so little effect?
> 

Correction, airmax.cf is not one single rule, it's one single FILE containing
211 rules. That's a significant difference, given that the stock spamassassin
3.1.0 has about 723 rules.

Airmax has increased the number of rules in your system by 29.1%






Re: rules better than bayes?

Posted by qqqq <qq...@usermail.com>.
I have since taken bayes out as I get WAY better results without it.  The reason this happens to me is that I get to many spam
mailings that poison the db and I end up with allot of spam that shows up as a Bayes_00.  I use all the Network tests but I get
allot of spam that has not been added yet.

QQQQ

----- Original Message ----- 
From: "jo3" <jo...@brats.com>
To: <us...@spamassassin.apache.org>
Sent: Monday, January 09, 2006 12:27 PM
Subject: rules better than bayes?


| Hi,
|
| This is an observation, please take it in the spirit in which it is
| intended, it is not meant to be flame bait.
|
| After using spamassassin for six solid months, it seems to me that the
| bayes process (sa-learn [--spam | --ham]) has only very limited success
| in learning about new spam.  Regardless of how many spams and hams are
| submitted, the effectiveness never goes above the default level which,
| in our case here, is somewhere around 2 out of 3 spams correctly
| identified.  By the same token, after adding the "third party" rule,
| airmax.cf, the effectiveness went up to 99 out of 100 spams correctly
| identified.
|
| So far, we have not had a single ham misidentified as spam with over one
| million messages examined.
|
| Throughout the documentation, there seems to be a bias toward the bayes
| filter rather than the rule system.  Does anyone on the list have some
| thoughts which would help to explain my observation as to why a single
| rule would appear so successful while a million spams and hams would
| have so little effect?
|
| Thank you,
| Jo3
|
|