You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Thomas Cameron <th...@camerontech.com> on 2005/05/12 17:03:30 UTC

Suddenly load average of 15-18???

All -

spamc is suddenly bringing my mail server to its knees.

Running RHEL 4 with the spamassassin-3.0.1-0.EL4 (supplied by Red Hat) and 
spamass-milter-0.3.0-3 (I made that RPM) along with razor-agents-2.67-0, 
dcc-1.3.0-0 and pyzor-0.4.0-0.

All of a sudden about two days ago spamc processes were chewing up the 
machine - sendmail was actually rejecting messages because the load average 
was so high!  This is a machine that is only used for about 6 users...  It 
only handles around a thousand to two thousand messages a day.  I am the 
only admin on it and nothing has changed.

Here is my local.cf:

--- begin ---
required_score 5
report_safe 1
rewrite_header subject **SPAM** _SCORE_
ok_languages en
ok_locales en
use_dcc 1
use_pyzor 1
use_razor2 1
whitelist_from_rcvd *@apache.org
whitelist_from_rcvd *@nongnu.org

score ALL_TRUSTED 0 0 0 0
--- end ---

Here are the relevant lines from my sendmail.mc:

--- begin ---
INPUT_MAIL_FILTER(`greylist',`S=local:/var/milter-greylist/milter-greylist.sock')dnl
define(`confMILTER_MACROS_HELO', `{verify}, {cert_subject}')dnl
define(`confMILTER_MACROS_ENVFROM', `i, {auth_authen}')dnl

INPUT_MAIL_FILTER(`spamassassin', `S=local:/var/run/spamass.sock, F=, 
T=C:15m;S:4m;R:4m;E:10m')dnl
define(`confMILTER_MACROS_CONNECT',`b, j, _, {daemon_name}, {if_name}, 
{if_addr}')dnl

INPUT_MAIL_FILTER(`clamav-milter', 
`S=local:/var/run/clamav/clamav-milter.sock, F=T,T=S:4m;R:4m;E:10m')

--- end ---

I have no idea why it is doing this...  It was working fine and then this 
happened sort of out of the blue.  Any pointers?

Thanks!
Thomas

Re: Suddenly load average of 15-18???

Posted by Loren Wilton <lw...@earthlink.net>.

> symptom.  Anyone have any ideas why this would suddenly start?

Running Awl?  Running Bayes?  Since it starts immediately, it sounds like a
large expiry run for one or the other of them.  If you aren't running
either, then this may be the area where nobody really knows what is going
wrong.

        Loren

Re: Suddenly load average of 15-18???

Posted by Thomas Cameron <th...@camerontech.com>.

On Thu, 2005-05-12 at 18:10 +0200, Christoph Petersen wrote:
> Hi,
> 
> Thomas Cameron schrieb:
> > I just tried that and as soon as I restarted everything the load shot up
> > to ~ 6.  I had to kill everything and remove the SA milter.
> > 
> > I'd like to figure out what the root cause is rather than band-aid the
> > symptom.  Anyone have any ideas why this would suddenly start?
> > 
> 
> Do you use the sa-blacklist? I've recently had problems with it. My load
> was getting very high.

I have done nothing past the initial installation and adding spamass-
milter...  This is about as vanilla an installation as you can get.

Thomas

Re: Suddenly load average of 15-18???

Posted by Christoph Petersen <li...@peterschen.de>.

Hi,

Thomas Cameron schrieb:
> I just tried that and as soon as I restarted everything the load shot up
> to ~ 6.  I had to kill everything and remove the SA milter.
> 
> I'd like to figure out what the root cause is rather than band-aid the
> symptom.  Anyone have any ideas why this would suddenly start?
> 

Do you use the sa-blacklist? I've recently had problems with it. My load
was getting very high.

> Thomas

Greets
Christoph

Re: Suddenly load average of 15-18???

Posted by Thomas Cameron <th...@camerontech.com>.

On Thu, 2005-05-12 at 11:19 -0400, Stephen M. Przepiora wrote:
> Take a look at the switches you have in /etc/init.d/spamassassin change 
> them to only run 5 processess and to die off after 15 or twenty scans.
> -m5 --max-conn-per-child=5
> Steve

I just tried that and as soon as I restarted everything the load shot up
to ~ 6.  I had to kill everything and remove the SA milter.

I'd like to figure out what the root cause is rather than band-aid the
symptom.  Anyone have any ideas why this would suddenly start?

Thomas

Re: Suddenly load average of 15-18???

Posted by "Stephen M. Przepiora" <sm...@ncoastsoft.com>.

Take a look at the switches you have in /etc/init.d/spamassassin change 
them to only run 5 processess and to die off after 15 or twenty scans.
-m5 --max-conn-per-child=5
Steve

Thomas Cameron wrote:

> All -
>
> spamc is suddenly bringing my mail server to its knees.
>
> Running RHEL 4 with the spamassassin-3.0.1-0.EL4 (supplied by Red Hat) 
> and spamass-milter-0.3.0-3 (I made that RPM) along with 
> razor-agents-2.67-0, dcc-1.3.0-0 and pyzor-0.4.0-0.
>
> All of a sudden about two days ago spamc processes were chewing up the 
> machine - sendmail was actually rejecting messages because the load 
> average was so high!  This is a machine that is only used for about 6 
> users...  It only handles around a thousand to two thousand messages a 
> day.  I am the only admin on it and nothing has changed.
>
> Here is my local.cf:
>
> --- begin ---
> required_score 5
> report_safe 1
> rewrite_header subject **SPAM** _SCORE_
> ok_languages en
> ok_locales en
> use_dcc 1
> use_pyzor 1
> use_razor2 1
> whitelist_from_rcvd *@apache.org
> whitelist_from_rcvd *@nongnu.org
>
> score ALL_TRUSTED 0 0 0 0
> --- end ---
>
> Here are the relevant lines from my sendmail.mc:
>
> --- begin ---
> INPUT_MAIL_FILTER(`greylist',`S=local:/var/milter-greylist/milter-greylist.sock')dnl 
>
> define(`confMILTER_MACROS_HELO', `{verify}, {cert_subject}')dnl
> define(`confMILTER_MACROS_ENVFROM', `i, {auth_authen}')dnl
>
> INPUT_MAIL_FILTER(`spamassassin', `S=local:/var/run/spamass.sock, F=, 
> T=C:15m;S:4m;R:4m;E:10m')dnl
> define(`confMILTER_MACROS_CONNECT',`b, j, _, {daemon_name}, {if_name}, 
> {if_addr}')dnl
>
> INPUT_MAIL_FILTER(`clamav-milter', 
> `S=local:/var/run/clamav/clamav-milter.sock, F=T,T=S:4m;R:4m;E:10m')
>
> --- end ---
>
> I have no idea why it is doing this...  It was working fine and then 
> this happened sort of out of the blue.  Any pointers?
>
> Thanks!
> Thomas
>
>


-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.308 / Virus Database: 266.11.9 - Release Date: 5/12/2005

Re: Suddenly load average of 15-18???

Posted by jdow <jd...@earthlink.net>.

From: "Thomas Cameron" <th...@camerontech.com>

> On Thu, 2005-05-12 at 09:31 -0700, Loren Wilton wrote:
> > > Is there something I should/could do about these expiry runs?  It
seems
> > > odd that it's been like this for a couple of days now...  How could I
> > > know that this was the issue?
> >
> > Um, this isn't my area of expertise.  I suspect Matt or Justin will be
along
> > with a workable suggestion fairly soon.  I'm pretty sure that there is
some
> > logging to indicate when an expiry run happens, but I don't know
precisely
> > what to look for.
>
> OK, I'll look for that.
>
> > At least with bayes there is a way you can turn off the auto-expire and
then
> > use a cron job to schedule a manual expiry once a day/week/whatever.
I'm
> > not sure if similar functionality exists for awl.
>
> I don't know either.

Loren's suggestion is likely a very good one. "top" is a nice way to find
out WHAT is consuming the time. I do note that I do not use automatic
learning or whitelisting here. (Me paranoid. Me not trust 'em. So me
feed salearn manually. Me get outstanding results. Me happy. {^_-})

> > Did you happen to notice if all of your spamd children get fat at once,
or
> > if just one of them got really huge?  All of them gettiing big might
> > indicate something changed with your rules files.  A single fat child
would
> > be more indicitave of an expiry run.
> >
> >         Loren
>
> It didn't really look like any of them were really fat...  The machine's
> drives just started hammering and the load average shot up.
>
> It's all cleared up now after a reboot.

For how long? You did not SOLVE the problem. You paid it's blackmail.

{^_-}

Re: Suddenly load average of 15-18???

Posted by Thomas Cameron <th...@camerontech.com>.

On Thu, 2005-05-12 at 09:31 -0700, Loren Wilton wrote:
> > Is there something I should/could do about these expiry runs?  It seems
> > odd that it's been like this for a couple of days now...  How could I
> > know that this was the issue?
> 
> Um, this isn't my area of expertise.  I suspect Matt or Justin will be along
> with a workable suggestion fairly soon.  I'm pretty sure that there is some
> logging to indicate when an expiry run happens, but I don't know precisely
> what to look for.

OK, I'll look for that.

> At least with bayes there is a way you can turn off the auto-expire and then
> use a cron job to schedule a manual expiry once a day/week/whatever.  I'm
> not sure if similar functionality exists for awl.

I don't know either.

> Did you happen to notice if all of your spamd children get fat at once, or
> if just one of them got really huge?  All of them gettiing big might
> indicate something changed with your rules files.  A single fat child would
> be more indicitave of an expiry run.
> 
>         Loren

It didn't really look like any of them were really fat...  The machine's
drives just started hammering and the load average shot up.

It's all cleared up now after a reboot.

Thomas

Re: Suddenly load average of 15-18???

Posted by Loren Wilton <lw...@earthlink.net>.

> Is there something I should/could do about these expiry runs?  It seems
> odd that it's been like this for a couple of days now...  How could I
> know that this was the issue?

Um, this isn't my area of expertise.  I suspect Matt or Justin will be along
with a workable suggestion fairly soon.  I'm pretty sure that there is some
logging to indicate when an expiry run happens, but I don't know precisely
what to look for.

At least with bayes there is a way you can turn off the auto-expire and then
use a cron job to schedule a manual expiry once a day/week/whatever.  I'm
not sure if similar functionality exists for awl.

Did you happen to notice if all of your spamd children get fat at once, or
if just one of them got really huge?  All of them gettiing big might
indicate something changed with your rules files.  A single fat child would
be more indicitave of an expiry run.

        Loren

Re: Suddenly load average of 15-18???

Posted by "Martin G. Diehl" <md...@nac.net>.

jdow wrote:

> From: "Thomas Cameron" <th...@camerontech.com>
> 
>>On Thu, 2005-05-12 at 08:31 -0700, Loren Wilton wrote:
>>
>>>Usually a high load average means that a spamd child suddenly 
>>>(or possibly slowly) got fat, and you are out of memory and 
>>>thrashing to beat the band.
[snip]

> Again, study what causes the problem. Experiment gently if you must
> to characterize it properly. Then solve it. Don't reboot. That just
> defers the problem. It's like paying blackmail money. 
> The blackmailers never go away. And it's a constant drain.
You will enjoy reading about the "Dane-Geld";
http://www.poetryloverspage.com/poets/kipling/dane_geld.html

Or listen to it, as sung by Michael Longcor

[snip]

> 4) Live happily ever after or at least until the next crisis, which
> most likely will not be a repeat of this one.
> 
> This is one of the tricks of old age guile that allows us old folks to
> defeat youth and enthusiasm. {^_-}
> 
> {^_^}

-- 
Martin G. Diehl

Visit my online gallery: Renderosity, a 3D Artist's Community
http://www.renderosity.com/gallery.ez?ByArtist=Yes&Artist=MGD

So much wisdom and knowledge -- so little time and bandwidth.
--MGD

Reality: That which remains after you stop thinking about it.
--inspired by P. K. Dick

Re: Suddenly load average of 15-18???

Posted by jdow <jd...@earthlink.net>.

From: "Thomas Cameron" <th...@camerontech.com>

> On Thu, 2005-05-12 at 08:31 -0700, Loren Wilton wrote:
> > Usually a high load average means that a spamd child suddenly (or
possibly
> > slowly) got fat, and you are out of memory and thrashing to beat the
band.
> > The two most common causes of this seem to be Bayes expiry runs and Awl
> > expiry runs.  Sometimes though it can seemingly happen from some unknown
> > sequence of mail messages.
>
> Is there something I should/could do about these expiry runs?  It seems
> odd that it's been like this for a couple of days now...  How could I
> know that this was the issue?
>
> > How many children are you running?  What is the max lifetime (messages
> > processed) per child?  Limiting to probably 5 children, or maybe even
less
> > in your case with so few users, and limiting to maybe 20-100 connections
per
> > child will probably work around your problems.
>
> My rc file has this:
>
> SPAMDOPTIONS="-d -c -m5 --max-conn-per-child=5 -H"
>
> I just added the --max-conn-per-child=5 per Stephen Przepiora's
> suggestion but that didn't seem to help.
>
> > Oh, I'm assuming you have at least 512M or so.  If not, you might want
to
> > cut down to only a couple of children, and definitely go with the lower
> > number of connections per child.
>
> Yes, I have 512M.  As I said - this has been working flawlessly since
> the server was installed several weeks ago.  It just suddenly went
> bonkers a couple of days ago.

I read your "solved" remark with some bemusement. Hammering the machine
over the head to solve this sort of problem is "just not the way it's
done" in the 'nix world. I suspect you have not really found the reason
yet. If you administer that machine with KDE or GNOME running and have
five spamds allowed you are overloading the machine driving it into
virtual memory thrashing. Cut down the number of spamds to perhaps 3,
-m3. Each spamd here with 3.02 gets up to about 60 megabytes before
it is harvested by max connections and a new one created. Five of those
uses up a lot of memory, to be sure. I have X running here. But I have
a gigabyte of memory in the machine. I mostly manage to stay out of
swap so VM doesn't thrash.

The thing you really needed to do and seem to have not done is isolate
exactly what is causing the problem. Hammering it with a reboot just
means you get to reboot often. If you spend the time to figure out what
resource was exhausted on your machine and what was the chief villain
with regards to exhausting that resource then you can work to mitigate
the problem. And you can enjoy many year long uptimes unless you have
to update the kernel. It saves wear and tear on you, freeing you to
apply the same principles to solve other problems that might appear.
It also frees the time to be proactive about the problems that might
appear.

As my first paragraph implies I suspect memory is the resource and
spamd coupled with KDE or GNOME might be the problem. It is quite
sufficient to drive the machine to the edge. And any OS gets pokey
when you get to the edge. The machine that has SA 2.63 on it is a
66 MHz Pentium with 256 megs of memory. It takes a nearly couple
minutes to scan a message. It sits in console mode. It handles DNS
and the firewall as well as the email. It can handle the 1200 to
1500 emails per day that Loren and I were getting while I was still
on that machine. I have since installed 3.02 on a "spare" Linux
machine, my pet computer toy, and put my email filtering over on
it. I get on the order of a total of 1000 messages a day. It handles
them at under 1.5% of its potential It has a gigabyte of memory so
X's requirements are not a threat to the email filtering. Everything
runs fast. I also tuned the number of spamds and connections per
spamd to use only a reasonable chunk of the machine. (I untuned it
recently to test a fix for a scoring bug in 3.02. It probably is
time to reduce the -m value. I don't NEED it as high as I have it
now. {^_-})

Again, study what causes the problem. Experiment gently if you must
to characterize it properly. Then solve it. Don't reboot. That just
defers the problem. It's like paying blackmail money. The blackmailers
never go away. And it's a constant drain.

1) What resource is becoming saturated? It's not always obvious when
you first look at the problem. Dig to find the real bottleneck. (If a
small 66MHz machine can handle nearly the volume I believe you cited
then "time" is not where you want to look on a machine ten times faster.)

2) Find what is consuming overmuch of that resource.

3) Mitigate the excessive resource usage.

4) Live happily ever after or at least until the next crisis, which
most likely will not be a repeat of this one.

This is one of the tricks of old age guile that allows us old folks to
defeat youth and enthusiasm. {^_-}

{^_^}

Re: Suddenly load average of 15-18???

Posted by Thomas Cameron <th...@camerontech.com>.

On Thu, 2005-05-12 at 08:31 -0700, Loren Wilton wrote: 
> Usually a high load average means that a spamd child suddenly (or possibly
> slowly) got fat, and you are out of memory and thrashing to beat the band.
> The two most common causes of this seem to be Bayes expiry runs and Awl
> expiry runs.  Sometimes though it can seemingly happen from some unknown
> sequence of mail messages.

Is there something I should/could do about these expiry runs?  It seems
odd that it's been like this for a couple of days now...  How could I
know that this was the issue?

> How many children are you running?  What is the max lifetime (messages
> processed) per child?  Limiting to probably 5 children, or maybe even less
> in your case with so few users, and limiting to maybe 20-100 connections per
> child will probably work around your problems.

My rc file has this:

SPAMDOPTIONS="-d -c -m5 --max-conn-per-child=5 -H"

I just added the --max-conn-per-child=5 per Stephen Przepiora's
suggestion but that didn't seem to help.

> Oh, I'm assuming you have at least 512M or so.  If not, you might want to
> cut down to only a couple of children, and definitely go with the lower
> number of connections per child.

Yes, I have 512M.  As I said - this has been working flawlessly since
the server was installed several weeks ago.  It just suddenly went
bonkers a couple of days ago.

Thomas

Re: Suddenly load average of 15-18???

Posted by Loren Wilton <lw...@earthlink.net>.

Usually a high load average means that a spamd child suddenly (or possibly
slowly) got fat, and you are out of memory and thrashing to beat the band.
The two most common causes of this seem to be Bayes expiry runs and Awl
expiry runs.  Sometimes though it can seemingly happen from some unknown
sequence of mail messages.

How many children are you running?  What is the max lifetime (messages
processed) per child?  Limiting to probably 5 children, or maybe even less
in your case with so few users, and limiting to maybe 20-100 connections per
child will probably work around your problems.

Oh, I'm assuming you have at least 512M or so.  If not, you might want to
cut down to only a couple of children, and definitely go with the lower
number of connections per child.

        Loren

[SOLVED] Re: Suddenly load average of 15-18???

Posted by Thomas Cameron <th...@camerontech.com>.

OK, this is a weird solution...  I rebooted the server and all the
problems went away.  It's chuffing along happily now.

Memory leak, maybe?

Thomas

Re: Suddenly load average of 15-18???

Posted by Thomas Cameron <th...@camerontech.com>.

On Thu, 2005-05-12 at 10:53 -0500, Dan Nelson wrote:
> In the last episode (May 12), Thomas Cameron said:
> > spamc is suddenly bringing my mail server to its knees.
> > 
> > Running RHEL 4 with the spamassassin-3.0.1-0.EL4 (supplied by Red Hat) and 
> > spamass-milter-0.3.0-3 (I made that RPM) along with razor-agents-2.67-0, 
> > dcc-1.3.0-0 and pyzor-0.4.0-0.
> > 
> > All of a sudden about two days ago spamc processes were chewing up
> > the machine - sendmail was actually rejecting messages because the
> > load average was so high!  This is a machine that is only used for
> > about 6 users...  It only handles around a thousand to two thousand
> > messages a day.  I am the only admin on it and nothing has changed.
> 
> What's the average processing time for a message, and are you using any
> -i flags on your spamass-milter commandline?  Grep your maillog for 
> "in .* seconds," to get the timings.  If they're all under 10 seconds
> or so and you're not using -i, check for things like mail loops, or
> large outgoing mail bursts.  

It was up around 50-60 seconds per message.  I rebooted the machine and
it has cleared up.

Thanks for the help!

Thomas