You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Matt Kettler <mk...@verizon.net> on 2008/01/24 13:36:37 UTC

Re: Fired rules stats understanding

Sébastien AVELINE wrote:
> Hello,
>
> You will find my top rules fired with spamassassin.
> I have spamassassin on several boxes, each have his own bayes_db 
> files, I use razor, dcc_check, uribl, bayes .... We have hundreds of 
> thousand messages per day.
> In my top rules for spam you will see a lot of "collaborative rules" 
> like razor,uribl,dcc_check. I wonder why there isn't more heuristic 
> and bayesian rules in my top. Do you think that my stats seem to be 
> "normal" or is there something wrong ? Any suggestions are welcome.

It's really absurd that RDNS_NONE is firing off on 99.6% of email.

Do you not have RDNS for your own network, or is it generating invalid 
Recieved: headers?

Ahh, yeah, it looks like your own network lacks RDNS:

Received: from unknown (HELO ?192.168.0.213?)
 (saveline@alinto.net@82.235.12.159) by smtpp.alinto.net with SMTP; Thu,
 24 Jan 2008 09:30:20 +0000


If you've got a local nameserver, you might want to generate an 
in-addr.arpa zone for the 192.168.0.* network to fix that.

As for the bayes, that doesn't surprise me. There's 10 different bayes 
rules, and while I'd expect that collectively they add up to most of 
your mail, it's not surprising that they're not individually scoring 
high. It's a little surprising BAYES_50 is doing so well compared to 
BAYES_99.. with the chi-squared combining I'd expect BAYES_99 to edge it 
out slightly. Are you doing any manual training? what's your "sa-learn 
--dump magic" look like?


Re: Fired rules stats understanding

Posted by Sébastien AVELINE <sa...@alinto.net>.
Matt Kettler a écrit :
> Sébastien AVELINE wrote:
>> Hello,
>>
>> You will find my top rules fired with spamassassin.
>> I have spamassassin on several boxes, each have his own bayes_db 
>> files, I use razor, dcc_check, uribl, bayes .... We have hundreds of 
>> thousand messages per day.
>> In my top rules for spam you will see a lot of "collaborative rules" 
>> like razor,uribl,dcc_check. I wonder why there isn't more heuristic 
>> and bayesian rules in my top. Do you think that my stats seem to be 
>> "normal" or is there something wrong ? Any suggestions are welcome.
>
> It's really absurd that RDNS_NONE is firing off on 99.6% of email.
>
> Do you not have RDNS for your own network, or is it generating invalid 
> Recieved: headers?
>
> Ahh, yeah, it looks like your own network lacks RDNS:
>
> Received: from unknown (HELO ?192.168.0.213?)
> (saveline@alinto.net@82.235.12.159) by smtpp.alinto.net with SMTP; Thu,
> 24 Jan 2008 09:30:20 +0000
>
>
> If you've got a local nameserver, you might want to generate an 
> in-addr.arpa zone for the 192.168.0.* network to fix that.
>
> As for the bayes, that doesn't surprise me. There's 10 different bayes 
> rules, and while I'd expect that collectively they add up to most of 
> your mail, it's not surprising that they're not individually scoring 
> high. It's a little surprising BAYES_50 is doing so well compared to 
> BAYES_99.. with the chi-squared combining I'd expect BAYES_99 to edge 
> it out slightly. Are you doing any manual training? what's your 
> "sa-learn --dump magic" look like?
>
Local address is from my office where I submit my mail to my 
mailservers. I think RDNS_NONE isn't the main worry. Unfortunately I 
don't use sa-learn to feed my bayes, I rely on high number of mails that 
come into my servers.
Is it really efficient to train the bayes manualy ?
Here you can see the result from sa-learn --dump magic:

0.000          0          3          0  non-token data: bayes db version
0.000          0    3803618          0  non-token data: nspam
0.000          0     862246          0  non-token data: nham
0.000          0     496111          0  non-token data: ntokens
0.000          0 1181735997          0  non-token data: oldest atime
0.000          0 1198170104          0  non-token data: newest atime
0.000          0 1181805393          0  non-token data: last journal 
sync atime
0.000          0 1181779437          0  non-token data: last expiry atime
0.000          0      43200          0  non-token data: last expire 
atime delta
0.000          0     476160          0  non-token data: last expire 
reduction count