You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by re...@dwf.com on 2004/02/19 21:28:06 UTC

Any Statistics available from SpamAssassin?

I have been using SpamAssassin for a year or so, and couldnt
get along without it.

Still, I keep seeing writeups of Spamassassin where they say
that it is 99+% efficient at recognizing spam, at least now
that it has the Baysian filtering in it...

Well, thats not the case here, it does recognize on the order
of 300 messages a day (thank you, thank you), but probably 
misses on the order of another 75-100.  

So thats more like 75-80% not 99+%. 

Now the Baysian Filtering is up (I think) and primed.
the dcc stuff is up...

But HOW DO I TELL IF THESE THINGS ARE REALLY WORKING?

Is there any way to get any statistics out of SpamAssassin?

I do see comments about the dcc stuff in the maillog on occasion,
mostly complaints about not being able to get to a specific site,

but it would be nice

to know if the Baysian stuff is working at all,
or if it (SpamAssassin)
was having long term problems getting to dcc sites.

I dont see any 'flags' for any statistics, am I missing something?
Even a script to grep the 'tossed' messages (I save them for a few
days) would be acceptable, but at the moment SpamAssassin is a great
be black box,- it seems to work, but it could be having real problems
and I wouldnt have a clue.

Well, thats longer than I wanted the message to be, but...


-- 
                                        Reg.Clemens
                                        reg@dwf.com



Re: Any Statistics available from SpamAssassin?

Posted by Greg Cirino - Cirelle Enterprises <gc...@cirelle.com>.
| Now the Baysian Filtering is up (I think) and primed.
| the dcc stuff is up...
| 
| But HOW DO I TELL IF THESE THINGS ARE REALLY WORKING?


add_header spam DCC _DCCB_: _DCCR_
will show the dcc stuff and the bayes stuff will show
when it hits after sufficient training

Greg


----- Original Message ----- 
From: <re...@dwf.com>
To: <sp...@incubator.apache.org>
Sent: Thursday, February 19, 2004 3:28 PM
Subject: Any Statistics available from SpamAssassin?


| I have been using SpamAssassin for a year or so, and couldnt
| get along without it.
| 
| Still, I keep seeing writeups of Spamassassin where they say
| that it is 99+% efficient at recognizing spam, at least now
| that it has the Baysian filtering in it...
| 
| Well, thats not the case here, it does recognize on the order
| of 300 messages a day (thank you, thank you), but probably 
| misses on the order of another 75-100.  
| 
| So thats more like 75-80% not 99+%. 
| 
| Now the Baysian Filtering is up (I think) and primed.
| the dcc stuff is up...
| 
| But HOW DO I TELL IF THESE THINGS ARE REALLY WORKING?
| 
| Is there any way to get any statistics out of SpamAssassin?
| 
| I do see comments about the dcc stuff in the maillog on occasion,
| mostly complaints about not being able to get to a specific site,
| 
| but it would be nice
| 
| to know if the Baysian stuff is working at all,
| or if it (SpamAssassin)
| was having long term problems getting to dcc sites.
| 
| I dont see any 'flags' for any statistics, am I missing something?
| Even a script to grep the 'tossed' messages (I save them for a few
| days) would be acceptable, but at the moment SpamAssassin is a great
| be black box,- it seems to work, but it could be having real problems
| and I wouldnt have a clue.
| 
| Well, thats longer than I wanted the message to be, but...
| 
| 
| -- 
|                                         Reg.Clemens
|                                         reg@dwf.com
| 
| 

Re: Any Statistics available from SpamAssassin?

Posted by Lucas Albers <ad...@cs.montana.edu>.
Bryan Britt said:
>
> Now with a default autolearned Bayes of about 18000 messages, which has
> quite a few mis-learned emails, I'm correctly catching 99.1% of my spam
> messages with only 1 FP. at a setting of 5.0 spam level.
>
> After getting Pyzor running, I'm going to dump my Bayes database and
> actively train it.
>
> So sitting at a set and forget a rate of 80% is good.  If you babysit it
> for a few thousand emails (a couple of days here), you can hit those
> numbers.
>
Look at the tar ball and it has a statistics file in masscheck, that lists
the fn/fp for each score level.
Mail-SpamAssassin-2.63/rules/Statistics.txt.

it's alright if your bayes database has a few miss-learned email in it,
the bayes will adjust for it.
Set set your learn hamd/learn spam levels good enough so you have very few
miss learnings.
I throw in pyzor/dcc/razor and on average that knocks the average score
upt to 17.5.
This is the average score for all email that scores 6 or higher, so is not
the actual average score.
In my case I'm only catching around 65%-70% of the incoming spam, but I
have too high a mail volume to relearn, nor can I have individual user
preferences. I catch more of the mail at the second SA install, which has
more customized installs.
I was too afraid of fp to use the sare rules except for:
backhair.cf, and randomword.cf, they both work well.

randomword.cf for bayesian poisoning.
body        RANDOMWORD_10 
/(?:\b(?!(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){10}/
describe    RANDOMWORD_10   String of 10+ random words
score       RANDOMWORD_10  1
body        RANDOMWORD_15 
/(?:\b(?!(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){15}/
describe    RANDOMWORD_15   String of 15+ random words
score       RANDOMWORD_15  3

body        RANDOMWORD_20 
/(?:\b(?!(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){20}/
describe    RANDOMWORD_20   String of 20+ random words
score       RANDOMWORD_20  5

#then I upped those scores by .4 to .6
#normally .1
score HTML_FONTCOLOR_UNSAFE  .5
score HTML_FONTCOLOR_UNKNOWN .5
#normally .4
score HTML_FONTCOLOR_INVISIBLE 1


Computer Science System Administrator
Security Administrator,College of Engineering
Montana State University-Bozeman,Montana


Re: Any Statistics available from SpamAssassin?

Posted by Bryan Britt <be...@beltane.com>.
I think that your stats of 80% is probably pretty accurate for a default
install.  I watched closely my spam and was able to see a few patterns
myself and therefore adjusted a couple of things.

I noticed that a very large portion of my spams, missed and not, were
coming from  SORBS and other blacklisted addresses.  So I raised each
one of them up to 2.5.  7-8 of them total.  With just those and a
default trained Bayes, I was consistantly hitting
about 93%.

I also installed several add on rules: weeds, chickenpox, antidrug.  I
left off backhair out of fear of FPing attachments that I get a lot of.

I've installed Razor2 and DCC.  I haven't gotten to Pyzor yet,
evaluating each one as I go.

Now with a default autolearned Bayes of about 18000 messages, which has
quite a few mis-learned emails, I'm correctly catching 99.1% of my spam
messages with only 1 FP. at a setting of 5.0 spam level.

After getting Pyzor running, I'm going to dump my Bayes database and
actively train it.

So sitting at a set and forget a rate of 80% is good.  If you babysit it
for a few thousand emails (a couple of days here), you can hit those
numbers.

Bryan Britt
Beltane Web Services


--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ICQ: 53037451
Bryan L. Britt                                        501-327-8558
Beltane Web Services, Conway, AR            http://www.beltane.com
~~~~~~~~~~Support Private Communications on the Internet~~~~~~~~~~



----------------------- Original Message -----------------------
On Thu, 19 Feb 2004 13:28:06 -0700, reg@dwf.com wrote:

> I have been using SpamAssassin for a year or so, and couldnt
> get along without it.
> 
> Still, I keep seeing writeups of Spamassassin where they say
> that it is 99+% efficient at recognizing spam, at least now
> that it has the Baysian filtering in it...
> 
> Well, thats not the case here, it does recognize on the order
> of 300 messages a day (thank you, thank you), but probably 
> misses on the order of another 75-100.  
> 
> So thats more like 75-80% not 99+%. 
> 
> Now the Baysian Filtering is up (I think) and primed.
> the dcc stuff is up...
> 
> But HOW DO I TELL IF THESE THINGS ARE REALLY WORKING?
> 
> Is there any way to get any statistics out of SpamAssassin?
> 
> I do see comments about the dcc stuff in the maillog on occasion,
> mostly complaints about not being able to get to a specific site,
> 
> but it would be nice
> 
> to know if the Baysian stuff is working at all,
> or if it (SpamAssassin)
> was having long term problems getting to dcc sites.
> 
> I dont see any 'flags' for any statistics, am I missing something?
> Even a script to grep the 'tossed' messages (I save them for a few
> days) would be acceptable, but at the moment SpamAssassin is a great
> be black box,- it seems to work, but it could be having real problems
> and I wouldnt have a clue.
> 
> Well, thats longer than I wanted the message to be, but...
> 
> 
> -- 
>                                         Reg.Clemens
>                                         reg@dwf.com



Re: Any Statistics available from SpamAssassin?

Posted by Morris Jones <mo...@whiteoaks.com>.
On Thu, 19 Feb 2004 reg@dwf.com wrote:

> Well, thats not the case here, it does recognize on the order
> of 300 messages a day (thank you, thank you), but probably 
> misses on the order of another 75-100.  

You might run a spam through spamassassin with the debug flag on and
check to make sure all the DNS black lists are getting called.

Mojo
-- 
Morris Jones         <*>
Monrovia, CA
mojo@whiteoaks.com
http://www.whiteoaks.com