You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Ronan <r....@qub.ac.uk> on 2004/11/23 16:14:18 UTC

selected rulesets for better performance

im running 3.0.1 with the SURIBLS
but im starting to get the load related

spam acl condition: spamd connection to 127.0.0.1, port 783 failed: 
Connection timed out

which of the following could i cut back on or does it depend on which 
types of spam our site is getting??

70_sare_adult.cf
70_sare_bayes_poison_nxm.cf
70_sare_genlsubj0.cf
70_sare_header0.cf
70_sare_html0.cf
70_sare_oem.cf
70_sare_random.cf
70_sare_specific.cf
70_sare_spoof.cf
70_sare_unsub.cf
70_sare_uri.cf
72_sare_bml_post25x.cf
72_sare_redirect_post3.0.0.cf
99_sare_fraud_post25x.cf
chickenpox.cf
evilnumbers.cf
init.pre
local.cf

are any of the above redundant in 3.0.1, and is there a list somewhere 
of the rulesets that are made redundant with subsequent versions of 
SA??? maybe helpful..


thanks

ronan

Re: selected rulesets for better performance

Posted by Michael Barnes <mb...@compsci.wm.edu>.

On Tue, Nov 23, 2004 at 03:14:18PM +0000, Ronan wrote:
> 70_sare_adult.cf
> 70_sare_bayes_poison_nxm.cf
> 70_sare_genlsubj0.cf
> 70_sare_header0.cf
> 70_sare_html0.cf
> 70_sare_oem.cf
> 70_sare_random.cf
> 70_sare_specific.cf
> 70_sare_spoof.cf
> 70_sare_unsub.cf
> 70_sare_uri.cf
> 72_sare_bml_post25x.cf
> 72_sare_redirect_post3.0.0.cf
> 99_sare_fraud_post25x.cf
> chickenpox.cf
> evilnumbers.cf
> init.pre
> local.cf

I run SA 3.0.1 with one extra URIBL, one extra custom module,
70_sare_header.cf and 70_sare_random.cf, some local spam and ham rules
and I havn't had a spam score less than 10 in a while (my threshold) and
no false positives.  Look at my SA data for the past 2.5 to 3 days:

 SC      #  frequency
 -6 (   19) **********
 -4 (   72) ****************************************
 -2 (  124) **********************************************************************
  0 (   36) ********************
  2 (    4) **
  4 (    6) ***
  6 (    1) 
  8 (    1) 
 10 (    0) 
 12 (    0) 
 14 (    2) *
 16 (    3) *
 18 (   13) *******
 20 (    8) ****
 22 (   14) *******
 24 (    9) *****
 26 (    4) **
 28 (    6) ***
 30 (   51) ****************************
Got   110 spams mean   =   31.15 stddev =   11.15 min    =   14.50 max =   65.20
Got   263  hams mean   =   -2.48 stddev =    2.53 min    =  -12.60 max =    9.90
Got   373  alls mean   =    7.44 stddev =   16.64 min    =  -12.60 max =   65.20


My setup works well with users that have their bayes db initialized
and for those that do not, but it works a little better for those with
bayes.

I don't think that evilnumbers.cf or chickenbox.cf will give you
anything in 3.x.

Mike

-- 
/-----------------------------------------\
| Michael Barnes <mb...@compsci.wm.edu> |
| UNIX Systems Administrator              |
| College of William and Mary             |
| Phone: (757) 879-3930                   |
\-----------------------------------------/

Re: selected rulesets for better performance

Posted by Theo Van Dinter <fe...@kluge.net>.

On Wed, Nov 24, 2004 at 01:19:49AM -0500, Matt Kettler wrote:
> Quite frankly, I suspect corpus pollution. It really only takes 1 high 
> scoring spam in the nonspam corpus to really screw up the message scores.

That's quite possible.  I don't think anyone has 100% non-polluted corpus,
though try we might. :(

> 1) DRUGS_PAIN_OBFU actually hit some nonspam? I find that odd, but it could 
> be a typo.

Looking at the submitted results:

dave.log:. /home/dave/corpus/cooked-ham.43366468
jm.log:. /home/jm/Mail/deld.priv/34675
jm.log:. /home/jm/Mail/deld.priv/34682
jm.log:. /home/jm/Mail/deld.priv/34699
jm.log:. /home/jm/Mail/deld.priv/34703
quinlan.log:. /home/corpus/mail/ham/166370
quinlan.log:. /home/corpus/mail/ham/166400
quinlan.log:. /home/corpus/mail/ham/166430
quinlan.log:. /home/corpus/mail/ham/166437

> 2) DRUGS_SMEAR1 hit some nonspam? I find that damn near impossible. I don't 
> think any nonspam email other than one quoting spam will ever hit that 
> rule. It seems there's one drug spam, or drug spam quote in somebody's 
> corpus, and it was run in all 4 sets. (If anyone can show me the nonspam 
> matching that rule and it's not spam or a spam quote or discussion of SA's 
> rules, I'll send em $20. Really.)

jm.log:. /home/jm/Mail/deld.priv/26352

> 4) NIGERIAN_BODY3? could be a finance newsletter, but very unlikely.

That was mine:

theo.log:Y ham/misc200405-200407.33861588

Unfortunately I took those misc ham mboxes and converted them to dir
format a while ago, so I don't know what message that was.

> 6) PERCENT_RANDOM? Very unlikely. What would have %rnd_x in it?

jm.log:. /home/jm/Mail/deld.pub/12701

-- 
Randomly Generated Tagline:
Choosy modemers choose GIF.

Re[2]: selected rulesets for better performance

Posted by Matt Kettler <mk...@evi-inc.com>.

At 12:16 AM 11/24/2004, Robert Menschel wrote:
>Which brings up another point which has been mentioned on the list
>before -- the BAYES_99 score is too low for well-trained systems.
>
>I have never seen a BAYES_99 hit on any non-spam.

Yeah, it's kind of suspect.. take a look at the STATISTICS.txt data for 
set3 and set2.

Notice that in set3 the nonspam hit rate is quite low, but it's 10x higher 
than in set2 as a percentage of the total nonspam corpus...]

Quite frankly, I suspect corpus pollution. It really only takes 1 high 
scoring spam in the nonspam corpus to really screw up the message scores.

Things in general I find suspect about the STATISTICS-set*.txt files for 3.x:

1) DRUGS_PAIN_OBFU actually hit some nonspam? I find that odd, but it could 
be a typo.

2) DRUGS_SMEAR1 hit some nonspam? I find that damn near impossible. I don't 
think any nonspam email other than one quoting spam will ever hit that 
rule. It seems there's one drug spam, or drug spam quote in somebody's 
corpus, and it was run in all 4 sets. (If anyone can show me the nonspam 
matching that rule and it's not spam or a spam quote or discussion of SA's 
rules, I'll send em $20. Really.)

3) Hugely better bayes performance in set2 compared to set3. Factor of 10 
difference in FP rate for BAYES_90 and higher. Admittedly overall hits are 
up, but not that much..

# grep BAYES_9 STATISTICS-set2.txt
35.784  73.4212   0.0034    1.000   0.98    4.07  BAYES_99
1.483   3.0402   0.0030    0.999   0.87    3.61  BAYES_90
1.173   2.4030   0.0030    0.999   0.85    3.51  BAYES_95

# grep BAYES_9 STATISTICS-set3.txt
43.515  89.3888   0.0335    1.000   0.83    1.89  BAYES_99
0.805   1.6326   0.0202    0.988   0.70    2.06  BAYES_95
0.913   1.8399   0.0343    0.982   0.64    2.09  BAYES_90

4) NIGERIAN_BODY3? could be a finance newsletter, but very unlikely.

5) HARDCORE_PORN? hmmm.. possible.. Unlikely, but "extreme hardcore gaming" 
would match it.

6) PERCENT_RANDOM? Very unlikely. What would have %rnd_x in it?

Re[2]: selected rulesets for better performance

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Matt,

Tuesday, November 23, 2004, 7:32:05 PM, you wrote:

MK> At 09:51 PM 11/23/2004, Robert Menschel wrote:
>>R> 70_sare_bayes_poison_nxm.cf
>>I personally don't use this -- I personally verify 75%+ of all mail
>>that goes through SA's analysis on three domains, and I feed 100% of
>>that mail (excepting lists like this) into SA-Learn. IMO there is no
>>bayes poison, only bayes fodder. I expect the rule set would be useful
>>for those with less comprehensive training. Also, since you don't
>>mention Bayes above, if you /don't/ run Bayes, this rules file can be
>>very useful.

MK> I agree totally on the concept of poison in terms of training.
MK> There is no bayes poison, only fodder. 

MK> However, I would also agree that detecting lame attempts to poison
MK> bayes is a good spam sign. With SA 3.0's weak bayes scores in set3
MK> (1.886 for BAYES_99), this can help even a system with a well
MK> trained bayes DB.

Which brings up another point which has been mentioned on the list
before -- the BAYES_99 score is too low for well-trained systems.

I have never seen a BAYES_99 hit on any non-spam. I run with BAYES_99
at my spam threshold (9), and BAYES_95 at 75% of that threshold.
Either it hasn't happened yet, or it has happened only on non-spam
where my negative-scoring rules brought the scores down enough to be
treated as ham.

The distributed score is probably good for a system which is not
manually trained, or poorly trained, or mistrained. However, when
admins take the care to train their Bayes system properly, IMO that
score can and should be raised.

There are other score adjustments that probably should be documented
and shared within the SA community. I once posted most of my score
mods on the exit0.us wiki.

Should we maybe develop a section of the SA wiki dedicated to score
mods and other mods specific to rules?

Bob Menschel

Re: selected rulesets for better performance

Posted by Matt Kettler <mk...@evi-inc.com>.

At 09:51 PM 11/23/2004, Robert Menschel wrote:
>R> 70_sare_bayes_poison_nxm.cf
>I personally don't use this -- I personally verify 75%+ of all mail
>that goes through SA's analysis on three domains, and I feed 100% of
>that mail (excepting lists like this) into SA-Learn. IMO there is no
>bayes poison, only bayes fodder. I expect the rule set would be useful
>for those with less comprehensive training. Also, since you don't
>mention Bayes above, if you /don't/ run Bayes, this rules file can be
>very useful.

I agree totally on the concept of poison in terms of training. There is no 
bayes poison, only fodder.

However, I would also agree that detecting lame attempts to poison bayes is 
a good spam sign. With SA 3.0's weak bayes scores in set3 (1.886 for 
BAYES_99), this can help even a system with a well trained bayes DB.

>You say you're running with SURIBLs. Are you also running with other
>network tests? All standard network tests are good aids to SA scoring,
>but they can contribute to a timeout problem, since they need to wait
>for that other system somewhere on the network to respond.

Agreed. be sure to run spamassassin --lint -D to see if any are getting wedged.

This is especially true for pyzor, razor, and DCC. They have fixed 10 
second timeouts on them. I use DCC and razor. Between the two razor is 
generally slower, and more prone to outage. (I just ran tests on 3 
messages. Razor took a bit over twice as long for each message. But then 
again, razor is more complex as it does multiple hashes it does the e4 
partial-body-text hash and e8 hashing of URLs. Still, if speed is your 
goal, razor might not be for you.)

DNSBL outages generally aren't as much of a problem, as SA 2.6x and higher 
have a good adaptive DNS timeout that makes the "one dead list" not so much 
of a problem. Once it gets a bunch of responses it shortens the timeout. If 
all the lists except one come back in the first second, the remaining list 
is given 3 seconds and then SA bails on it for a total time of 4 seconds. 
This can be a bit of a slowdown, but it's not as bad as 10 second razor 
timeouts.

Re: selected rulesets for better performance

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Ronan,

Tuesday, November 23, 2004, 7:14:18 AM, you wrote:

R> im running 3.0.1 with the SURIBLS
R> but im starting to get the load related
R> spam acl condition: spamd connection to 127.0.0.1, port 783 failed:
R> Connection timed out
R> which of the following could i cut back on or does it depend on
R> which types of spam our site is getting??

Yes, when looking at custom rules files, it strongly depends on what
spam you're getting

R> 70_sare_adult.cf
R> 70_sare_bayes_poison_nxm.cf
I personally don't use this -- I personally verify 75%+ of all mail
that goes through SA's analysis on three domains, and I feed 100% of
that mail (excepting lists like this) into SA-Learn. IMO there is no
bayes poison, only bayes fodder. I expect the rule set would be useful
for those with less comprehensive training. Also, since you don't
mention Bayes above, if you /don't/ run Bayes, this rules file can be
very useful.
R> 70_sare_genlsubj0.cf
R> 70_sare_header0.cf
R> 70_sare_html0.cf
The above are great, and the most efficient of their families. I hope
to have updates for them out in another week or so.
R> 70_sare_oem.cf
R> 70_sare_random.cf
R> 70_sare_specific.cf
R> 70_sare_spoof.cf
R> 70_sare_unsub.cf
R> 70_sare_uri.cf
R> 72_sare_bml_post25x.cf
R> 72_sare_redirect_post3.0.0.cf
R> 99_sare_fraud_post25x.cf
Ought to get that last set renamed back to the 70's range...
R> chickenpox.cf
R> evilnumbers.cf
R> init.pre
R> local.cf
All look good. You've got an intelligent selection there. None of them
should be expensive (in computer resources).

You say you're running with SURIBLs. Are you also running with other
network tests? All standard network tests are good aids to SA scoring,
but they can contribute to a timeout problem, since they need to wait
for that other system somewhere on the network to respond.

R> are any of the above redundant in 3.0.1, and is there a list somewhere
R> of the rulesets that are made redundant with subsequent versions of
R> SA??? maybe helpful..

None of the ruleset files you list above are redundant with 3.0.1 nor
with each other. Eventually we'll put a comprehensive list of what
ruleset files are appropriate for which versions of SA (and/or which
should/not be used with each other) on the Wiki ... hopefully one of
us will have time to do that before end of year.

Bob Menschel