You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Jerome Cartagena <je...@sandiego.edu> on 2005/02/25 19:28:18 UTC
Strange SpamAssassin Statistical Performance
I have been a member of the mail list for a couple months and have been
intently watching the active discussions taking place. This is my
first time posting so I'd like to greet everyone "hello" and introduce
myself.
I am using spamassassin through MailScanner on a University mail server
to help perform spam checks. I am using:
SpamAssassin version 3.0.2
running on Perl version
I have setup some scripts (spam-stats) to generate MRTG stats to help
give us an idea of how well performance is going. My main
problem/question is that according to our statistics we are reaching
some sort of upper bound on spam scanning performance. I have attached
2 files to help demonstrate what I am talking about. I am wondering if
we are hitting some sort of performance limit on our mail scanning
machines or is it simply the case that this is how much spam we are
actively collecting. I'd appreciate any comments, ideas, or
suggestions on a possible explanation regarding this situation.
Thank you in advance for all replies,
~Jerome Cartagena
Re: Strange SpamAssassin Statistical Performance
Posted by Jerome Cartagena <je...@sandiego.edu>.
Sorry for the confusion.
The blue lines represent HAM "clean" messages. While the green lines
represent SPAM.
~Jerome Cartagena
On Feb 25, 2005, at 1:22 PM, jdow wrote:
> What do the colors mean, Jerome?
> {^_^}
> ----- Original Message -----
> From: "Jerome Cartagena" <je...@sandiego.edu>
>
>
>> Ok. Here are the graph details of SpamAssassin performance:
>>
>>
>
> <pix>
>
>> spam_clean_day: (5 min avg)
>> Max spam: 1304.0 msgs Average spam: 489.0 msgs Current spam: 516.0
>> msgs
>> Max clean: 7224.0 msgs Average clean: 1309.0 msgs Current clean:
>> 1357.0
>> msgs
>
> <etc>
>
> {^_^}
>
Re: Strange SpamAssassin Statistical Performance
Posted by jdow <jd...@earthlink.net>.
What do the colors mean, Jerome?
{^_^}
----- Original Message -----
From: "Jerome Cartagena" <je...@sandiego.edu>
> Ok. Here are the graph details of SpamAssassin performance:
>
>
<pix>
> spam_clean_day: (5 min avg)
> Max spam: 1304.0 msgs Average spam: 489.0 msgs Current spam: 516.0 msgs
> Max clean: 7224.0 msgs Average clean: 1309.0 msgs Current clean: 1357.0
> msgs
<etc>
{^_^}
Re: Strange SpamAssassin Statistical Performance
Posted by Ken A <ka...@pacific.net>.
Matt Kettler wrote:
> At 02:04 PM 2/25/2005, Jerome Cartagena wrote:
>
>> The main reason I believe this is a performance issue is the strange
>> flat line that is demonstrated by the graph. Although it concerns me
>> that I get much more HAM than SPAM (I believe current industry
>> standards report 80+% spam traffic), I simply can't explain why we are
>> hitting that kind of upper limit.
>
>
> You mean why the spam rate is more-or-less constant?
>
> That's pretty normal..
>
> If you look at my graphs, they are a spam count for the day, resetting
> at midnight.. The fairly even rate of spam makes a nice, relatively
> even, sawtooth pattern. Changes in spam rate would make the sawtooth's
> curve. They do show a slight mid-day hump, but not by much.
>
> http://xanadu.evitechnology.com/mailscanner-mrtg/spam/spam.html
>
> Compare to my mail total graphs, which show significant curving due to
> mid-day ham levels being higher than mid-night ham levels:
>
> http://xanadu.evitechnology.com/mailscanner-mrtg/mail/mail.html
>
> If you're using mailscanner, a symptom of an overloaded system will be
> /var/spool/mqueue.in getting crowded. When the number of files in that
> directory keeps growing it means mail is coming in faster than
> MailScanner can feed it to SA and the AV tools. (Note: this is different
> than /var/spool/mqueue, which will get crowded when there's
> undeliverable messages)
>
>
Same kind of humpi-ness here, with a larger sample (see attached)
(the milter is milter-dnsrbl, using spamhaus rbl)
Ken
Pacific.Net
Re: Strange SpamAssassin Statistical Performance
Posted by Matt Kettler <mk...@evi-inc.com>.
At 02:04 PM 2/25/2005, Jerome Cartagena wrote:
>The main reason I believe this is a performance issue is the strange flat
>line that is demonstrated by the graph. Although it concerns me that I
>get much more HAM than SPAM (I believe current industry standards report
>80+% spam traffic), I simply can't explain why we are hitting that kind of
>upper limit.
You mean why the spam rate is more-or-less constant?
That's pretty normal..
If you look at my graphs, they are a spam count for the day, resetting at
midnight.. The fairly even rate of spam makes a nice, relatively even,
sawtooth pattern. Changes in spam rate would make the sawtooth's curve.
They do show a slight mid-day hump, but not by much.
http://xanadu.evitechnology.com/mailscanner-mrtg/spam/spam.html
Compare to my mail total graphs, which show significant curving due to
mid-day ham levels being higher than mid-night ham levels:
http://xanadu.evitechnology.com/mailscanner-mrtg/mail/mail.html
If you're using mailscanner, a symptom of an overloaded system will be
/var/spool/mqueue.in getting crowded. When the number of files in that
directory keeps growing it means mail is coming in faster than MailScanner
can feed it to SA and the AV tools. (Note: this is different than
/var/spool/mqueue, which will get crowded when there's undeliverable messages)
Re: Strange SpamAssassin Statistical Performance
Posted by Jerome Cartagena <je...@sandiego.edu>.
hello
> What makes you think this is a performance issue? The fact that you're
> getting more HAM than SPAM? or what?
The main reason I believe this is a performance issue is the strange
flat line that is demonstrated by the graph. Although it concerns me
that I get much more HAM than SPAM (I believe current industry
standards report 80+% spam traffic), I simply can't explain why we are
hitting that kind of upper limit.
> Do you have any front-end RBLs, Greylists, or other MTA layer filters
> that are filtering out some of the spam before it gets to MailScanner
> in the first place? (This would severely bias your numbers)
We are using RBL list on the mail gateway level. I am looking into
greylisting and other plug-ins, but I would like to make sure that
spamassassin is well tuned and efficient as we examine and add other
solutions.
~Jerome Cartagena
On Feb 25, 2005, at 10:55 AM, Matt Kettler wrote:
> At 01:43 PM 2/25/2005, Jerome Cartagena wrote:
>> spam_clean_day: (5 min avg)
>> Max spam: 1304.0 msgs Average spam: 489.0 msgs Current spam:
>> 516.0 msgs
>> Max clean: 7224.0 msgs Average clean: 1309.0 msgs Current
>> clean: 1357.0 msgs
>
> Ok, that's better. I know what I'm looking at now.
>
> What makes you think this is a performance issue? The fact that you're
> getting more HAM than SPAM? or what?
>
> Do you have any front-end RBLs, Greylists, or other MTA layer filters
> that are filtering out some of the spam before it gets to MailScanner
> in the first place? (This would severely bias your numbers)
>
>
Re: Strange SpamAssassin Statistical Performance
Posted by Matt Kettler <mk...@evi-inc.com>.
At 01:43 PM 2/25/2005, Jerome Cartagena wrote:
>spam_clean_day: (5 min avg)
>Max spam: 1304.0 msgs Average spam: 489.0 msgs Current spam:
>516.0 msgs
>Max clean: 7224.0 msgs Average clean: 1309.0 msgs Current clean:
>1357.0 msgs
Ok, that's better. I know what I'm looking at now.
What makes you think this is a performance issue? The fact that you're
getting more HAM than SPAM? or what?
Do you have any front-end RBLs, Greylists, or other MTA layer filters that
are filtering out some of the spam before it gets to MailScanner in the
first place? (This would severely bias your numbers)
Re: Strange SpamAssassin Statistical Performance
Posted by Jerome Cartagena <je...@sandiego.edu>.
Ok. Here are the graph details of SpamAssassin performance:
Re: Strange SpamAssassin Statistical Performance
Posted by Matt Kettler <mk...@evi-inc.com>.
At 01:28 PM 2/25/2005, Jerome Cartagena wrote:
>I am using spamassassin through MailScanner on a University mail server to
>help perform spam checks. I am using:
>SpamAssassin version 3.0.2
> running on Perl version
>I have setup some scripts (spam-stats) to generate MRTG stats to help give
>us an idea of how well performance is going. My main problem/question is
>that according to our statistics we are reaching some sort of upper bound
>on spam scanning performance.
I don't get it, as the graph axises are completely unlabeled...
Re: Strange SpamAssassin Statistical Performance
Posted by Matt Kettler <mk...@comcast.net>.
At 05:55 PM 2/26/2005, Justin Mason wrote:
>I'm thinking it might be worthwhile setting up a section of the FAQ
>for MailScanner users, similarly for amavisd users, etc. with these
>type of answers.
>
>I'd say pretty much all MailScanner sites with bayes running
>would need to use that cronjob tactic.
Agreed... Although realistically you might just want to point it to
MailScanner's FAQ which covers a lot of this stuff already.
Re: Strange SpamAssassin Statistical Performance
Posted by Matt Kettler <mk...@comcast.net>.
At 08:57 PM 2/25/2005, jdow wrote:
>Sometimes SA may time out. If it does there are no SA markups in the
>messages. Makes it easy to test for.
True, this can happen when using MailScanner..
Although, as it turns out, FN's aren't the poster's concern.
As for SA timeouts under MailScanner, they are usually caused by bayes
expiry. SA goes into an auto-expire while scanning a message and
MailScanner presumes it's hung up and kills it. I usually run with
bayes_auto_expire disabled, and have a cronjob run sa-learn --force-expire
against my bayes DB.
You can easily check for SA timeouts under MailScanner with grep:
grep "assin timed out" /var/log/maillog
Re: Strange SpamAssassin Statistical Performance
Posted by jdow <jd...@earthlink.net>.
Sometimes SA may time out. If it does there are no SA markups in the
messages. Makes it easy to test for.
{^_^}
----- Original Message -----
From: "Eric A. Hall" <eh...@ehsco.com>
To: "Jerome Cartagena" <je...@sandiego.edu>
Cc: <us...@spamassassin.apache.org>
Sent: 2005 February, 25, Friday 16:14
Subject: Re: Strange SpamAssassin Statistical Performance
>
> That's MailScanner; I'm suggesting that if you look to see if it was
> processed through SA or not (MS might be skipping if no processes are
> available, or might be using the wrong queue, or any number of other
> things could be going wrong).
>
> On 2/25/2005 6:51 PM, Jerome Cartagena wrote:
> > MailScanner does alter the Raw headers of each mail message and I can
> > verify that each message does not get delivered to the user's INBOX
> > until it has been processed.
> >
> > ~Jerome Cartagena
> >
> >
> > On Feb 25, 2005, at 11:28 AM, Eric A. Hall wrote:
> >
> >
> >>On 2/25/2005 2:00 PM, Jerome Cartagena wrote:
> >>
> >>
> >>>according to the graphs, the number of detected spam has a steady
> >>>upper
> >>>limit while the actual number of undetected spam fluctuates wildly.
> >>
> >>Can you tell if the undetected spam is getting processed (I like to tag
> >>all mail regardless of score).
> >>
> >>From your sentence above it sounds like you don't have enough processes
> >>for the volume and the overflow mail is taking a shortcut.
> >>
> >>--
> >>Eric A. Hall
> >>http://www.ehsco.com/
> >>Internet Core Protocols
> >>http://www.oreilly.com/catalog/coreprot/
> >>
> >
> >
>
> --
> Eric A. Hall http://www.ehsco.com/
> Internet Core Protocols http://www.oreilly.com/catalog/coreprot/
Re: SQL settings & Deprecated rulesets?
Posted by Robert Menschel <Ro...@Menschel.net>.
Hello Codger,
Sunday, February 27, 2005, 7:30:47 AM, you wrote:
C> 2. I have these old rulesets but I can't find the previous download
C> page a comprehensive list anywhere of the ones that are deprecated by
C> version 3.0.2 and I want to eliminate duplication. Which are absolutely
C> deprecated and which are recommended for continued use? (The asterisks
C> are the ones that I understand shouldn't be used with 3.0.2.)
C> 70_SARE_Adult.cf.bak
C> 70_SARE_Genlsubj0.cf.bak
C> 70_SARE_Header0.cf.bak
If you have those two you probably also want html0 and uri0.
C> 70_SARE_Random.cf.bak
C> 70_SARE_SPOOF.cf.bak
C> 71_SARE_Redirect_pre3.cf.bak
If you're now on 3.0, you want 72_sare_redirect_post3.0.0.cf
C> 72_SARE_BML.cf.bak
Incomplete name -- that's 72_sare_bml_post25x.cf.
C> 99_FVGT_Tripwire.cf.bak
Better: 88_FVGT_Tripwire.cf
C> 99_OBFU_drugs.cf.bak
Discontinue.
C> 99_SARE_Fraud.cf.bak
99_sare_fraud_post25x.cf?
C> 99_SARE_OEM.cf
Replace with 70_sare_oem.cf
C> * antidrug.cf.bak
C> backhair.cf.bak
C> * bigevil.cf.bak
C> bogus-virus-warnings.cf.bak
C> * chickenpox.cf.bak
C> evilnumbers.cf.bak
C> ratware.cf.bak
Discontinued. Incorporated into header/html/uri files.
C> useless.cf.bak
Discontinued -- incorporated into the HTML family
C> weeds.cf.bak
Bob Menschel
--
Best regards,
Robert mailto:Robert@Menschel.net
Re[2]: SQL settings & Deprecated rulesets?
Posted by Robert Menschel <Ro...@Menschel.net>.
Hello Loren,
Sunday, February 27, 2005, 7:37:58 PM, you wrote:
LW> Also, you show header0 and gensubj0. Almost always you would also
LW> want the "1" version of these rulesets. The "0" version mostly
LW> just sets up stuff used by the other sets, I believe.
Not quite. genlsubj0, header0, html0, uri0 all contain rules that hit
spam (10 or more in a single mass-check) and do not hit any ham.
Their matching files 1 (genlsubj1, header1, etc) contain rules that
either hit fewer spam (ie: might not be worth the resources) or hit
some ham (but with S/O better than 90%).
Bob Menschel
Re: SQL settings & Deprecated rulesets?
Posted by Loren Wilton <lw...@earthlink.net>.
Check the SARE page for the various rulesets to see if any have been
depreciated for 3.0. I don't believe any of the ones you have listed have
been, but it is worht a check.
That said: you have some ANCIENT rulesets there that have been updated
several times and have new names. I don't believe that we have any rulesets
now with numbers above 70.
Also, you show header0 and gensubj0. Almost always you would also want the
"1" version of these rulesets. The "0" version mostly just sets up stuff
used by the other sets, I believe.
> 70_SARE_Adult.cf.bak
> 70_SARE_Genlsubj0.cf.bak
> 70_SARE_Header0.cf.bak
> 70_SARE_Random.cf.bak
> 70_SARE_SPOOF.cf.bak
> 71_SARE_Redirect_pre3.cf.bak
> 72_SARE_BML.cf.bak
> 99_FVGT_Tripwire.cf.bak
> 99_OBFU_drugs.cf.bak
> 99_SARE_Fraud.cf.bak
> 99_SARE_OEM.cf
Antidrug is in 3.0
> * antidrug.cf.bak
> backhair.cf.bak
Dump bigevil! Turn on the net rules instead.
> * bigevil.cf.bak
> bogus-virus-warnings.cf.bak
> * chickenpox.cf.bak
Evilnumbers can be useful if you don't have net rules running, but the net
tests will generally do better.
> evilnumbers.cf.bak
Some of the ratware stuff is probably in 3.0. Also, there is some in SARE
rulesets, such as Random and Specific.
I think (though I'm not positive) that random and useless are old
depreciated rulesets.
> ratware.cf.bak
> useless.cf.bak
> weeds.cf.bak
Backhair, bogus-virus-warnings (which may have been updated since your
version), chickenpox, tripwire, and weeds can all still be useful rulesets,
even they haven't been updated in ages.
Loren
SQL settings & Deprecated rulesets?
Posted by Codger <li...@pmbx.net>.
OK, I've successfully transitioned from 2.63 to 3.0.2 but I have two
questions:
1. The sql users database doesn't seem to be used though it was working
fine in 2.63. I have the following configuration in local.cf. Is there
a plugin or other setting that I need to make this work? (I'm using
CGSPA and not spamd by the way in CommuniGate Pro if that helps. Don't
think it should make any difference.)
allow_user_rules 1
user_scores_dsn DBI:mysql:spamassassin:localhost:3306
user_scores_sql_username _mysqlusername_
user_scores_sql_password _mysqlpass_
2. I have these old rulesets but I can't find the previous download
page a comprehensive list anywhere of the ones that are deprecated by
version 3.0.2 and I want to eliminate duplication. Which are absolutely
deprecated and which are recommended for continued use? (The asterisks
are the ones that I understand shouldn't be used with 3.0.2.)
70_SARE_Adult.cf.bak
70_SARE_Genlsubj0.cf.bak
70_SARE_Header0.cf.bak
70_SARE_Random.cf.bak
70_SARE_SPOOF.cf.bak
71_SARE_Redirect_pre3.cf.bak
72_SARE_BML.cf.bak
99_FVGT_Tripwire.cf.bak
99_OBFU_drugs.cf.bak
99_SARE_Fraud.cf.bak
99_SARE_OEM.cf
* antidrug.cf.bak
backhair.cf.bak
* bigevil.cf.bak
bogus-virus-warnings.cf.bak
* chickenpox.cf.bak
evilnumbers.cf.bak
ratware.cf.bak
useless.cf.bak
weeds.cf.bak
Re: Strange SpamAssassin Statistical Performance
Posted by "Eric A. Hall" <eh...@ehsco.com>.
That's MailScanner; I'm suggesting that if you look to see if it was
processed through SA or not (MS might be skipping if no processes are
available, or might be using the wrong queue, or any number of other
things could be going wrong).
On 2/25/2005 6:51 PM, Jerome Cartagena wrote:
> MailScanner does alter the Raw headers of each mail message and I can
> verify that each message does not get delivered to the user's INBOX
> until it has been processed.
>
> ~Jerome Cartagena
>
>
> On Feb 25, 2005, at 11:28 AM, Eric A. Hall wrote:
>
>
>>On 2/25/2005 2:00 PM, Jerome Cartagena wrote:
>>
>>
>>>according to the graphs, the number of detected spam has a steady
>>>upper
>>>limit while the actual number of undetected spam fluctuates wildly.
>>
>>Can you tell if the undetected spam is getting processed (I like to tag
>>all mail regardless of score).
>>
>>>From your sentence above it sounds like you don't have enough processes
>>for the volume and the overflow mail is taking a shortcut.
>>
>>--
>>Eric A. Hall
>>http://www.ehsco.com/
>>Internet Core Protocols
>>http://www.oreilly.com/catalog/coreprot/
>>
>
>
--
Eric A. Hall http://www.ehsco.com/
Internet Core Protocols http://www.oreilly.com/catalog/coreprot/
Re: Strange SpamAssassin Statistical Performance
Posted by Jerome Cartagena <je...@sandiego.edu>.
MailScanner does alter the Raw headers of each mail message and I can
verify that each message does not get delivered to the user's INBOX
until it has been processed.
~Jerome Cartagena
On Feb 25, 2005, at 11:28 AM, Eric A. Hall wrote:
>
> On 2/25/2005 2:00 PM, Jerome Cartagena wrote:
>
>> according to the graphs, the number of detected spam has a steady
>> upper
>> limit while the actual number of undetected spam fluctuates wildly.
>
> Can you tell if the undetected spam is getting processed (I like to tag
> all mail regardless of score).
>
> From your sentence above it sounds like you don't have enough processes
> for the volume and the overflow mail is taking a shortcut.
>
> --
> Eric A. Hall
> http://www.ehsco.com/
> Internet Core Protocols
> http://www.oreilly.com/catalog/coreprot/
>
Re: Strange SpamAssassin Statistical Performance
Posted by "Eric A. Hall" <eh...@ehsco.com>.
On 2/25/2005 2:00 PM, Jerome Cartagena wrote:
> according to the graphs, the number of detected spam has a steady upper
> limit while the actual number of undetected spam fluctuates wildly.
Can you tell if the undetected spam is getting processed (I like to tag
all mail regardless of score).
>From your sentence above it sounds like you don't have enough processes
for the volume and the overflow mail is taking a shortcut.
--
Eric A. Hall http://www.ehsco.com/
Internet Core Protocols http://www.oreilly.com/catalog/coreprot/
Re: Strange SpamAssassin Statistical Performance
Posted by Jerome Cartagena <je...@sandiego.edu>.
Hello
> There are 86400 seconds in a 24-hour day, and if it takes you 10
> seconds
> per message (high but possible with large number of remote tests) with
> just one process (unlikely) then you are going to be capped at 8,640
> messages per day at flat-rate (nobody gets perfectly-distributed
> traffic
> patterns, especially with email).
This is a perfectly reasonable explanation except for the fact that
according to the graphs, the number of detected spam has a steady upper
limit while the actual number of undetected spam fluctuates wildly. I
am working under the assumption that all mail messages accounted for
has already passed through spamassassin and has either been detected as
"clean" or "spam". Thus, even clean messages are messages that have
been processed. This means that if we are experiencing hardware
specific limitations, the same behavior should apply to "clean"
messages (the blue line in the graph). However, this is not the case.
Another point to note is that if we are experiencing hardware
limitations on the amount of messages that can be processed, then mail
messages will essentially be queued and will essentially await
processing when cpu is available. This queueing process should be
reflected on the graphs, but unfortunately it is not.
Here are a few additional information about the system:
The server box we are examining has 4 CPU each running an average of
(10-18% usage) with load average between .2 and .8 throughout the day.
The box has 2 gigs of memory of which 82% of the memory is being used.
~Jerome Cartagena
On Feb 25, 2005, at 10:39 AM, Eric A. Hall wrote:
>
> On 2/25/2005 1:28 PM, Jerome Cartagena wrote:
>
>> problem/question is that according to our statistics we are reaching
>> some sort of upper bound on spam scanning performance. I have
>> attached
>> 2 files to help demonstrate what I am talking about. I am wondering
>> if
>> we are hitting some sort of performance limit on our mail scanning
>> machines or is it simply the case that this is how much spam we are
>> actively collecting. I'd appreciate any comments, ideas, or
>> suggestions on a possible explanation regarding this situation.
>
> Number of messages you can process per unit of time depens on several
> factors, namely:
>
> * number of messages per unit of time
>
> * amount of time needed to process each message
>
> * units of time available
>
> * number of processes/processors available
>
> * peak variances
>
> There are 86400 seconds in a 24-hour day, and if it takes you 10
> seconds
> per message (high but possible with large number of remote tests) with
> just one process (unlikely) then you are going to be capped at 8,640
> messages per day at flat-rate (nobody gets perfectly-distributed
> traffic
> patterns, especially with email).
>
> There are secondary factors like memory and cpu availability that will
> affect processing capacity and you need to look at that too. But for
> starters, plug in your own traffic values to see what you should be
> aiming
> for in terms of target number of available processes at peak load, and
> that will get you started.
>
> --
> Eric A. Hall
> http://www.ehsco.com/
> Internet Core Protocols
> http://www.oreilly.com/catalog/coreprot/
>
Re: Strange SpamAssassin Statistical Performance
Posted by "Eric A. Hall" <eh...@ehsco.com>.
On 2/25/2005 1:28 PM, Jerome Cartagena wrote:
> problem/question is that according to our statistics we are reaching
> some sort of upper bound on spam scanning performance. I have attached
> 2 files to help demonstrate what I am talking about. I am wondering if
> we are hitting some sort of performance limit on our mail scanning
> machines or is it simply the case that this is how much spam we are
> actively collecting. I'd appreciate any comments, ideas, or
> suggestions on a possible explanation regarding this situation.
Number of messages you can process per unit of time depens on several
factors, namely:
* number of messages per unit of time
* amount of time needed to process each message
* units of time available
* number of processes/processors available
* peak variances
There are 86400 seconds in a 24-hour day, and if it takes you 10 seconds
per message (high but possible with large number of remote tests) with
just one process (unlikely) then you are going to be capped at 8,640
messages per day at flat-rate (nobody gets perfectly-distributed traffic
patterns, especially with email).
There are secondary factors like memory and cpu availability that will
affect processing capacity and you need to look at that too. But for
starters, plug in your own traffic values to see what you should be aiming
for in terms of target number of available processes at peak load, and
that will get you started.
--
Eric A. Hall http://www.ehsco.com/
Internet Core Protocols http://www.oreilly.com/catalog/coreprot/