You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Robert Menschel <Ro...@Menschel.net> on 2004/11/24 06:16:19 UTC

Re[2]: selected rulesets for better performance

Hello Matt,

Tuesday, November 23, 2004, 7:32:05 PM, you wrote:

MK> At 09:51 PM 11/23/2004, Robert Menschel wrote:
>>R> 70_sare_bayes_poison_nxm.cf
>>I personally don't use this -- I personally verify 75%+ of all mail
>>that goes through SA's analysis on three domains, and I feed 100% of
>>that mail (excepting lists like this) into SA-Learn. IMO there is no
>>bayes poison, only bayes fodder. I expect the rule set would be useful
>>for those with less comprehensive training. Also, since you don't
>>mention Bayes above, if you /don't/ run Bayes, this rules file can be
>>very useful.

MK> I agree totally on the concept of poison in terms of training.
MK> There is no bayes poison, only fodder. 

MK> However, I would also agree that detecting lame attempts to poison
MK> bayes is a good spam sign. With SA 3.0's weak bayes scores in set3
MK> (1.886 for BAYES_99), this can help even a system with a well
MK> trained bayes DB.

Which brings up another point which has been mentioned on the list
before -- the BAYES_99 score is too low for well-trained systems.

I have never seen a BAYES_99 hit on any non-spam. I run with BAYES_99
at my spam threshold (9), and BAYES_95 at 75% of that threshold.
Either it hasn't happened yet, or it has happened only on non-spam
where my negative-scoring rules brought the scores down enough to be
treated as ham.

The distributed score is probably good for a system which is not
manually trained, or poorly trained, or mistrained. However, when
admins take the care to train their Bayes system properly, IMO that
score can and should be raised.

There are other score adjustments that probably should be documented
and shared within the SA community. I once posted most of my score
mods on the exit0.us wiki.

Should we maybe develop a section of the SA wiki dedicated to score
mods and other mods specific to rules?

Bob Menschel




Re: selected rulesets for better performance

Posted by Theo Van Dinter <fe...@kluge.net>.
On Wed, Nov 24, 2004 at 01:19:49AM -0500, Matt Kettler wrote:
> Quite frankly, I suspect corpus pollution. It really only takes 1 high 
> scoring spam in the nonspam corpus to really screw up the message scores.

That's quite possible.  I don't think anyone has 100% non-polluted corpus,
though try we might. :(

> 1) DRUGS_PAIN_OBFU actually hit some nonspam? I find that odd, but it could 
> be a typo.

Looking at the submitted results:

dave.log:. /home/dave/corpus/cooked-ham.43366468
jm.log:. /home/jm/Mail/deld.priv/34675
jm.log:. /home/jm/Mail/deld.priv/34682
jm.log:. /home/jm/Mail/deld.priv/34699
jm.log:. /home/jm/Mail/deld.priv/34703
quinlan.log:. /home/corpus/mail/ham/166370
quinlan.log:. /home/corpus/mail/ham/166400
quinlan.log:. /home/corpus/mail/ham/166430
quinlan.log:. /home/corpus/mail/ham/166437

> 2) DRUGS_SMEAR1 hit some nonspam? I find that damn near impossible. I don't 
> think any nonspam email other than one quoting spam will ever hit that 
> rule. It seems there's one drug spam, or drug spam quote in somebody's 
> corpus, and it was run in all 4 sets. (If anyone can show me the nonspam 
> matching that rule and it's not spam or a spam quote or discussion of SA's 
> rules, I'll send em $20. Really.)

jm.log:. /home/jm/Mail/deld.priv/26352

> 4) NIGERIAN_BODY3? could be a finance newsletter, but very unlikely.

That was mine:

theo.log:Y ham/misc200405-200407.33861588

Unfortunately I took those misc ham mboxes and converted them to dir
format a while ago, so I don't know what message that was.

> 6) PERCENT_RANDOM? Very unlikely. What would have %rnd_x in it?

jm.log:. /home/jm/Mail/deld.pub/12701

-- 
Randomly Generated Tagline:
Choosy modemers choose GIF.

Re[2]: selected rulesets for better performance

Posted by Matt Kettler <mk...@evi-inc.com>.
At 12:16 AM 11/24/2004, Robert Menschel wrote:
>Which brings up another point which has been mentioned on the list
>before -- the BAYES_99 score is too low for well-trained systems.
>
>I have never seen a BAYES_99 hit on any non-spam.

Yeah, it's kind of suspect.. take a look at the STATISTICS.txt data for 
set3 and set2.

Notice that in set3 the nonspam hit rate is quite low, but it's 10x higher 
than in set2 as a percentage of the total nonspam corpus...]

Quite frankly, I suspect corpus pollution. It really only takes 1 high 
scoring spam in the nonspam corpus to really screw up the message scores.

Things in general I find suspect about the STATISTICS-set*.txt files for 3.x:

1) DRUGS_PAIN_OBFU actually hit some nonspam? I find that odd, but it could 
be a typo.

2) DRUGS_SMEAR1 hit some nonspam? I find that damn near impossible. I don't 
think any nonspam email other than one quoting spam will ever hit that 
rule. It seems there's one drug spam, or drug spam quote in somebody's 
corpus, and it was run in all 4 sets. (If anyone can show me the nonspam 
matching that rule and it's not spam or a spam quote or discussion of SA's 
rules, I'll send em $20. Really.)


3) Hugely better bayes performance in set2 compared to set3. Factor of 10 
difference in FP rate for BAYES_90 and higher. Admittedly overall hits are 
up, but not that much..

# grep BAYES_9 STATISTICS-set2.txt
35.784  73.4212   0.0034    1.000   0.98    4.07  BAYES_99
1.483   3.0402   0.0030    0.999   0.87    3.61  BAYES_90
1.173   2.4030   0.0030    0.999   0.85    3.51  BAYES_95

# grep BAYES_9 STATISTICS-set3.txt
43.515  89.3888   0.0335    1.000   0.83    1.89  BAYES_99
0.805   1.6326   0.0202    0.988   0.70    2.06  BAYES_95
0.913   1.8399   0.0343    0.982   0.64    2.09  BAYES_90

4) NIGERIAN_BODY3? could be a finance newsletter, but very unlikely.

5) HARDCORE_PORN? hmmm.. possible.. Unlikely, but "extreme hardcore gaming" 
would match it.

6) PERCENT_RANDOM? Very unlikely. What would have %rnd_x in it?