You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Johann Spies <js...@sun.ac.za> on 2008/02/28 13:44:02 UTC

timeout-problem

On a new mailserver with 8Gb ram and 2xdual-core CPU's we get regular
messages in the log:

Feb 28 12:52:43 mail2 spamd[32558]: prefork: child states: BIBBB
Feb 28 12:52:44 mail2 spamd[459]: rules: failed to run TVD_STOCK1 test, skipping:
Feb 28 12:52:44 mail2 spamd[459]:  (child processing timeout at /usr/sbin/spamd line 1246.
Feb 28 12:52:44 mail2 spamd[459]: )

And every time it involves TVD_STOCK1.

Is this a bug in Spamassassin or in the rule? How do I fix it?

Version:  3.2.3-0.volatile1 (on Debian Stable).

Defaults: OPTIONS="--create-prefs --max-children 15 --helper-home-dir"

Regards

Johann



-- 
Johann Spies Telefoon: 021-808 4036 Informasietegnologie, Universiteit
van Stellenbosch

     "These things have I written unto you that believe on 
      the name of the Son of God; that ye may know that ye 
      have eternal life, and that ye may believe on the name
      of the Son of God."        I John 5:13 

Re: imeout-problem - additional information

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
On 11/03/2008 4:00 AM, Johann Spies wrote:
> I have discovered a mistake in my exim configuration that caused exim
> to hand over all sizes to spamassassin.  This was the same on the old
> and new servers, but apparently attributed greatly towards timeouts on
> the new server.  Correcting that mistake to make exim only hand
> messages smaller than 100k to spamassassin, had a clear effect on the
> timeouts.  Since that change in the configuration - about 18 hours
> ago, there was not one timeout.
> 
> It seems that the newer version of Spamassassin handles large messages
> less efficiently than the older version.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5717


Re: imeout-problem - additional information

Posted by Johann Spies <js...@sun.ac.za>.
I have discovered a mistake in my exim configuration that caused exim
to hand over all sizes to spamassassin.  This was the same on the old
and new servers, but apparently attributed greatly towards timeouts on
the new server.  Correcting that mistake to make exim only hand
messages smaller than 100k to spamassassin, had a clear effect on the
timeouts.  Since that change in the configuration - about 18 hours
ago, there was not one timeout.

It seems that the newer version of Spamassassin handles large messages
less efficiently than the older version.

Regards
Johann

-- 
Johann Spies          Telefoon: 021-808 4036
Informasietegnologie, Universiteit van Stellenbosch

     "I am the vine, ye are the branches: He that abideth in
      me, and I in him, the same bringeth forth much fruit: 
      for without me ye can do nothing."              
                                           John 15:5 

Re: imeout-problem - additional information

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
On 10/03/2008 6:36 AM, Johann Spies wrote:
> On Mon, Mar 10, 2008 at 05:24:24AM -0400, Daryl C. W. O'Shea wrote:
>> Try to roughly compare the actual amount of CPU time that the spamd
>> children are using on each server.  3.2 will use more resources than
> 
> How do I do that?  Just watching 'top' is not a reliable method I
> suspect.

Roughly. :)  Yeah, use top.  If messages are really taking on average of
a minute (or even a peak adjusted average of ~15 seconds) to process you
should be able to easily see if a bunch of apparently busy children
aren't really doing anything (are using a percentage of CPU time much
less than 100% * #cores / #active children, less whatever the MTA is
consuming).

>> Are the timeouts for the same zone(s)?  
> 
> Most of them are lookups against list.dsbl.org.   A dig
> 146.226.86.70.list.dsbl.org that timed out according to the log, took
> 352 milliseconds when checked by hand.  A little bit longer on the old
> one.

That's likely due to your geography.  You could reduce that by running a
copy of the zone locally with rbldnsd: http://dsbl.org/usage

>> Test lookups against those zones manually.
>> Is your upstream (or downstream) bandwidth usage near full capacity?  
> 
> It is 92.5% full at the moment.

A number of other DNSBLs can be rsync'd and run locally to save you even
more bandwidth and time.

92% is OK so long as it's near a peak and there's no traffic shaping
going on (unless its not a guaranteed rate interface which has since
slowed down to where 92% is a lot closer to wh.  With regards to DNS,
congested links and performance don't mix.

>> To the two servers share the same DNS setup?  
> 
> Yes

You would expect DNS or network congestion issues to affect both servers
equally, however, remember that 3.0 will (to its detriment) timeout DNS
queries a *lot* faster than 3.2.  So 3.0 will mask network issues at the
expense of accuracy, while 3.2 won't at the expense of time (but not
overall throughput as long as you have the memory for the additional
children that the CPU now has time to run).

>> Is there something else running on the new server that is driving
>> the load average up (a common cause of the "child processing
>> timeout" message)?
> 
> The load average on the new server is lower than that on the old
> server - as expected.  For the past 24 hours the highest load average
> was 1.6.

That's fine then.  I suspect that the child timeouts you were/are seeing
due to the TVD_STOCK1 test are due to a bug triggered by a specific form
of input.  Capturing a message (preferably a few) that triggers the
error may confirm that.

>> A little more work... review the debug output for a bunch of messages
>> (you'll have to separate each message's debug info from the combined
>> debug log).  What parts of the scanning process are taking the most
>> amount of time?
> 
> 
> I will do that.

Great.  That's your best and most straightforward way to concretely tell
you what's going on.

Daryl


Re: imeout-problem - additional information

Posted by Johann Spies <js...@sun.ac.za>.
On Mon, Mar 10, 2008 at 05:24:24AM -0400, Daryl C. W. O'Shea wrote:
> Try to roughly compare the actual amount of CPU time that the spamd
> children are using on each server.  3.2 will use more resources than

How do I do that?  Just watching 'top' is not a reliable method I
suspect.

> Are the timeouts for the same zone(s)?  

Most of them are lookups against list.dsbl.org.   A dig
146.226.86.70.list.dsbl.org that timed out according to the log, took
352 milliseconds when checked by hand.  A little bit longer on the old
one.


> Test lookups against those zones manually.
> Is your upstream (or downstream) bandwidth usage near full capacity?  

It is 92.5% full at the moment.

> To the two servers share the same DNS setup?  

Yes

> Is there something else running on the new server that is driving
> the load average up (a common cause of the "child processing
> timeout" message)?

The load average on the new server is lower than that on the old
server - as expected.  For the past 24 hours the highest load average
was 1.6.

> 
> A little more work... review the debug output for a bunch of messages
> (you'll have to separate each message's debug info from the combined
> debug log).  What parts of the scanning process are taking the most
> amount of time?


I will do that.

Regards
Johann

-- 
Johann Spies          Telefoon: 021-808 4036
Informasietegnologie, Universiteit van Stellenbosch

     "What doth it profit, my brethren, though a man say he
      hath faith, and have not works? can faith save him? If
      a brother or sister be naked, and destitute of daily
      food, And one of you say unto them, Depart in peace,
      be ye warmed and filled; notwithstanding ye give them
      not those things which are needful to the body; what
      doth it profit? Even so faith, if it hath not works,
      is dead, being alone."          James 2:14-17 

Re: imeout-problem - additional information

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
On 10/03/2008 4:46 AM, Johann Spies wrote:
> Hello Daryl,
> 
> Thanks for your reply.
> 
> On Fri, Mar 07, 2008 at 02:23:13PM -0500, Daryl C. W. O'Shea wrote:
>> On 05/03/2008 5:44 AM, Johann Spies wrote:
>>> On Thu, Feb 28, 2008 at 02:44:02PM +0200, Johann Spies wrote:
>>>> On a new mailserver with 8Gb ram and 2xdual-core CPU's we get regular
>>>> messages in the log:
>>>>
>>>> Feb 28 12:52:43 mail2 spamd[32558]: prefork: child states: BIBBB
>>>> Feb 28 12:52:44 mail2 spamd[459]: rules: failed to run TVD_STOCK1 test, skipping:
>>>> Feb 28 12:52:44 mail2 spamd[459]:  (child processing timeout at /usr/sbin/spamd line 1246.
>>>> Feb 28 12:52:44 mail2 spamd[459]: )
>>>>
>>>> And every time it involves TVD_STOCK1.
>> The rule doesn't look particular bad.  Have you been able to capture a
>> sample email that causes this?  Perhaps its an issue with a large
>> text/plain body with no line breaks.
> 
> 
> Unfortunately I do not know which messages caused the problem.
> 
>> 3.2.3 changes the way DNS timeouts are calculated (SA used to time out
>> its second round of DNS lookups way too early).  Is the machine (or
>> specifically the spamd children) actually busy, or is everything sitting
>> rather idle.
> 
> I am comparing an older server (mail1) with the new server (mail2) in
> this case.  Both running exim, clamav and spamassassin.  This
> statistics is over a period of about 24 hours on 7/8 March last week.
> 
> 
> 	      	   mail1	     mail2
> SA-version    	   3.0.3-2sarge1     3.2.3-0.volatile1
> Messages scanned   43338	     22873
> Timeouts (exim)	   0		     36
> --max-children	   5		     15
> Ram		   4G		     7G

Try to roughly compare the actual amount of CPU time that the spamd
children are using on each server.  3.2 will use more resources than
3.0, but not twice as much (and throughput shouldn't be 1/2 on faster
hardware).  I suspect you'll find the children on the new server mainly
idle compared to on the old server.

> I have activated the --debug option now and so far have seen 14
> dns-timeouts in the past 40 minutes on mail2.

Keep an eye on your log volume free space if you're not rotating out
logs based on size.

Are the timeouts for the same zone(s)?  Test lookups against those zones
manually.  Is your upstream (or downstream) bandwidth usage near full
capacity?  To the two servers share the same DNS setup?  Is there
something else running on the new server that is driving the load
average up (a common cause of the "child processing timeout" message)?

A little more work... review the debug output for a bunch of messages
(you'll have to separate each message's debug info from the combined
debug log).  What parts of the scanning process are taking the most
amount of time?

Daryl


Re: imeout-problem - additional information

Posted by Johann Spies <js...@sun.ac.za>.
Hello Daryl,

Thanks for your reply.

On Fri, Mar 07, 2008 at 02:23:13PM -0500, Daryl C. W. O'Shea wrote:
> On 05/03/2008 5:44 AM, Johann Spies wrote:
> > On Thu, Feb 28, 2008 at 02:44:02PM +0200, Johann Spies wrote:
> >> On a new mailserver with 8Gb ram and 2xdual-core CPU's we get regular
> >> messages in the log:
> >>
> >> Feb 28 12:52:43 mail2 spamd[32558]: prefork: child states: BIBBB
> >> Feb 28 12:52:44 mail2 spamd[459]: rules: failed to run TVD_STOCK1 test, skipping:
> >> Feb 28 12:52:44 mail2 spamd[459]:  (child processing timeout at /usr/sbin/spamd line 1246.
> >> Feb 28 12:52:44 mail2 spamd[459]: )
> >>
> >> And every time it involves TVD_STOCK1.
> 
> The rule doesn't look particular bad.  Have you been able to capture a
> sample email that causes this?  Perhaps its an issue with a large
> text/plain body with no line breaks.


Unfortunately I do not know which messages caused the problem.

> 
> 3.2.3 changes the way DNS timeouts are calculated (SA used to time out
> its second round of DNS lookups way too early).  Is the machine (or
> specifically the spamd children) actually busy, or is everything sitting
> rather idle.

I am comparing an older server (mail1) with the new server (mail2) in
this case.  Both running exim, clamav and spamassassin.  This
statistics is over a period of about 24 hours on 7/8 March last week.


	      	   mail1	     mail2
SA-version    	   3.0.3-2sarge1     3.2.3-0.volatile1
Messages scanned   43338	     22873
Timeouts (exim)	   0		     36
--max-children	   5		     15
Ram		   4G		     7G


I have activated the --debug option now and so far have seen 14
dns-timeouts in the past 40 minutes on mail2.

Regards
Johann

-- 
Johann Spies          Telefoon: 021-808 4036
Informasietegnologie, Universiteit van Stellenbosch

     "What doth it profit, my brethren, though a man say he
      hath faith, and have not works? can faith save him? If
      a brother or sister be naked, and destitute of daily
      food, And one of you say unto them, Depart in peace,
      be ye warmed and filled; notwithstanding ye give them
      not those things which are needful to the body; what
      doth it profit? Even so faith, if it hath not works,
      is dead, being alone."          James 2:14-17 

Re: timeout-problem - additional information

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
On 05/03/2008 5:44 AM, Johann Spies wrote:
> On Thu, Feb 28, 2008 at 02:44:02PM +0200, Johann Spies wrote:
>> On a new mailserver with 8Gb ram and 2xdual-core CPU's we get regular
>> messages in the log:
>>
>> Feb 28 12:52:43 mail2 spamd[32558]: prefork: child states: BIBBB
>> Feb 28 12:52:44 mail2 spamd[459]: rules: failed to run TVD_STOCK1 test, skipping:
>> Feb 28 12:52:44 mail2 spamd[459]:  (child processing timeout at /usr/sbin/spamd line 1246.
>> Feb 28 12:52:44 mail2 spamd[459]: )
>>
>> And every time it involves TVD_STOCK1.

The rule doesn't look particular bad.  Have you been able to capture a
sample email that causes this?  Perhaps its an issue with a large
text/plain body with no line breaks.

>> Is this a bug in Spamassassin or in the rule? How do I fix it?
>>
>> Version:  3.2.3-0.volatile1 (on Debian Stable).
>>
>> Defaults: OPTIONS="--create-prefs --max-children 15 --helper-home-dir"
> 
> I have seen no reaction to the message quoted here.
> 
> After a "score TVD_STOCK1 0" the "child processing timeout" messages
> stopped, but exim is still complaining from time to time (11 times so
> far today and 239 times yesterday): 
> 
> error reading from spamd socket: Connection timed out
> 
> In /var/log/mail.info the message "prefork: server reached
> --max-children setting, consider raising it" appeared 188 times
> yesterday and 27 times today in the logs on the new server
> (spamassassin version 3.2.3-0.volatile1) with the --max-children 15
> setting.  On the older server (--max-children 5 and version
> 3.0.3-2sarge1) that has handled about double the numer of emails
> during the past 24 hours no such problem was reported either by exim
> or spamassassin.

3.2.3 changes the way DNS timeouts are calculated (SA used to time out
its second round of DNS lookups way too early).  Is the machine (or
specifically the spamd children) actually busy, or is everything sitting
rather idle.

You only say how many error messages you've received but not actually
how many messages you're trying to scan (and what your peak volume is),
so I can't tell how frequent this really occurs or recommend how many
children you may need.

Additionally, your mail (spam) volume may have gone up enough recently
that you'd be seeing the same messages with 3.0.3.  I've been seeing the
normal rapid annual increase in volume (between November and March) on
nearly every system I have access to.

> Any idea on what is going on here?

You could run spamd in debug mode to see where time is being spent doing
whatever it is doing.

Daryl


Re: timeout-problem - additional information

Posted by Johann Spies <js...@sun.ac.za>.
On Thu, Feb 28, 2008 at 02:44:02PM +0200, Johann Spies wrote:
> On a new mailserver with 8Gb ram and 2xdual-core CPU's we get regular
> messages in the log:
> 
> Feb 28 12:52:43 mail2 spamd[32558]: prefork: child states: BIBBB
> Feb 28 12:52:44 mail2 spamd[459]: rules: failed to run TVD_STOCK1 test, skipping:
> Feb 28 12:52:44 mail2 spamd[459]:  (child processing timeout at /usr/sbin/spamd line 1246.
> Feb 28 12:52:44 mail2 spamd[459]: )
> 
> And every time it involves TVD_STOCK1.
> 
> Is this a bug in Spamassassin or in the rule? How do I fix it?
> 
> Version:  3.2.3-0.volatile1 (on Debian Stable).
> 
> Defaults: OPTIONS="--create-prefs --max-children 15 --helper-home-dir"

I have seen no reaction to the message quoted here.

After a "score TVD_STOCK1 0" the "child processing timeout" messages
stopped, but exim is still complaining from time to time (11 times so
far today and 239 times yesterday): 

error reading from spamd socket: Connection timed out

In /var/log/mail.info the message "prefork: server reached
--max-children setting, consider raising it" appeared 188 times
yesterday and 27 times today in the logs on the new server
(spamassassin version 3.2.3-0.volatile1) with the --max-children 15
setting.  On the older server (--max-children 5 and version
3.0.3-2sarge1) that has handled about double the numer of emails
during the past 24 hours no such problem was reported either by exim
or spamassassin.

That does not make sense to me unless there is some bug in the newer
version of Spamassassin.

Any idea on what is going on here?

Regards
Johann
-- 
Johann Spies          Telefoon: 021-808 4036
Informasietegnologie, Universiteit van Stellenbosch

     "This is the day which the LORD hath made; we will 
      rejoice and be glad in it."           Psalms 118:24