You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Johann Spies <js...@sun.ac.za> on 2008/03/10 09:46:04 UTC

Re: imeout-problem - additional information

Hello Daryl,

Thanks for your reply.

On Fri, Mar 07, 2008 at 02:23:13PM -0500, Daryl C. W. O'Shea wrote:
> On 05/03/2008 5:44 AM, Johann Spies wrote:
> > On Thu, Feb 28, 2008 at 02:44:02PM +0200, Johann Spies wrote:
> >> On a new mailserver with 8Gb ram and 2xdual-core CPU's we get regular
> >> messages in the log:
> >>
> >> Feb 28 12:52:43 mail2 spamd[32558]: prefork: child states: BIBBB
> >> Feb 28 12:52:44 mail2 spamd[459]: rules: failed to run TVD_STOCK1 test, skipping:
> >> Feb 28 12:52:44 mail2 spamd[459]:  (child processing timeout at /usr/sbin/spamd line 1246.
> >> Feb 28 12:52:44 mail2 spamd[459]: )
> >>
> >> And every time it involves TVD_STOCK1.
> 
> The rule doesn't look particular bad.  Have you been able to capture a
> sample email that causes this?  Perhaps its an issue with a large
> text/plain body with no line breaks.


Unfortunately I do not know which messages caused the problem.

> 
> 3.2.3 changes the way DNS timeouts are calculated (SA used to time out
> its second round of DNS lookups way too early).  Is the machine (or
> specifically the spamd children) actually busy, or is everything sitting
> rather idle.

I am comparing an older server (mail1) with the new server (mail2) in
this case.  Both running exim, clamav and spamassassin.  This
statistics is over a period of about 24 hours on 7/8 March last week.


	      	   mail1	     mail2
SA-version    	   3.0.3-2sarge1     3.2.3-0.volatile1
Messages scanned   43338	     22873
Timeouts (exim)	   0		     36
--max-children	   5		     15
Ram		   4G		     7G


I have activated the --debug option now and so far have seen 14
dns-timeouts in the past 40 minutes on mail2.

Regards
Johann

-- 
Johann Spies          Telefoon: 021-808 4036
Informasietegnologie, Universiteit van Stellenbosch

     "What doth it profit, my brethren, though a man say he
      hath faith, and have not works? can faith save him? If
      a brother or sister be naked, and destitute of daily
      food, And one of you say unto them, Depart in peace,
      be ye warmed and filled; notwithstanding ye give them
      not those things which are needful to the body; what
      doth it profit? Even so faith, if it hath not works,
      is dead, being alone."          James 2:14-17 

Re: imeout-problem - additional information

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
On 11/03/2008 4:00 AM, Johann Spies wrote:
> I have discovered a mistake in my exim configuration that caused exim
> to hand over all sizes to spamassassin.  This was the same on the old
> and new servers, but apparently attributed greatly towards timeouts on
> the new server.  Correcting that mistake to make exim only hand
> messages smaller than 100k to spamassassin, had a clear effect on the
> timeouts.  Since that change in the configuration - about 18 hours
> ago, there was not one timeout.
> 
> It seems that the newer version of Spamassassin handles large messages
> less efficiently than the older version.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5717


Re: imeout-problem - additional information

Posted by Johann Spies <js...@sun.ac.za>.
I have discovered a mistake in my exim configuration that caused exim
to hand over all sizes to spamassassin.  This was the same on the old
and new servers, but apparently attributed greatly towards timeouts on
the new server.  Correcting that mistake to make exim only hand
messages smaller than 100k to spamassassin, had a clear effect on the
timeouts.  Since that change in the configuration - about 18 hours
ago, there was not one timeout.

It seems that the newer version of Spamassassin handles large messages
less efficiently than the older version.

Regards
Johann

-- 
Johann Spies          Telefoon: 021-808 4036
Informasietegnologie, Universiteit van Stellenbosch

     "I am the vine, ye are the branches: He that abideth in
      me, and I in him, the same bringeth forth much fruit: 
      for without me ye can do nothing."              
                                           John 15:5 

Re: imeout-problem - additional information

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
On 10/03/2008 6:36 AM, Johann Spies wrote:
> On Mon, Mar 10, 2008 at 05:24:24AM -0400, Daryl C. W. O'Shea wrote:
>> Try to roughly compare the actual amount of CPU time that the spamd
>> children are using on each server.  3.2 will use more resources than
> 
> How do I do that?  Just watching 'top' is not a reliable method I
> suspect.

Roughly. :)  Yeah, use top.  If messages are really taking on average of
a minute (or even a peak adjusted average of ~15 seconds) to process you
should be able to easily see if a bunch of apparently busy children
aren't really doing anything (are using a percentage of CPU time much
less than 100% * #cores / #active children, less whatever the MTA is
consuming).

>> Are the timeouts for the same zone(s)?  
> 
> Most of them are lookups against list.dsbl.org.   A dig
> 146.226.86.70.list.dsbl.org that timed out according to the log, took
> 352 milliseconds when checked by hand.  A little bit longer on the old
> one.

That's likely due to your geography.  You could reduce that by running a
copy of the zone locally with rbldnsd: http://dsbl.org/usage

>> Test lookups against those zones manually.
>> Is your upstream (or downstream) bandwidth usage near full capacity?  
> 
> It is 92.5% full at the moment.

A number of other DNSBLs can be rsync'd and run locally to save you even
more bandwidth and time.

92% is OK so long as it's near a peak and there's no traffic shaping
going on (unless its not a guaranteed rate interface which has since
slowed down to where 92% is a lot closer to wh.  With regards to DNS,
congested links and performance don't mix.

>> To the two servers share the same DNS setup?  
> 
> Yes

You would expect DNS or network congestion issues to affect both servers
equally, however, remember that 3.0 will (to its detriment) timeout DNS
queries a *lot* faster than 3.2.  So 3.0 will mask network issues at the
expense of accuracy, while 3.2 won't at the expense of time (but not
overall throughput as long as you have the memory for the additional
children that the CPU now has time to run).

>> Is there something else running on the new server that is driving
>> the load average up (a common cause of the "child processing
>> timeout" message)?
> 
> The load average on the new server is lower than that on the old
> server - as expected.  For the past 24 hours the highest load average
> was 1.6.

That's fine then.  I suspect that the child timeouts you were/are seeing
due to the TVD_STOCK1 test are due to a bug triggered by a specific form
of input.  Capturing a message (preferably a few) that triggers the
error may confirm that.

>> A little more work... review the debug output for a bunch of messages
>> (you'll have to separate each message's debug info from the combined
>> debug log).  What parts of the scanning process are taking the most
>> amount of time?
> 
> 
> I will do that.

Great.  That's your best and most straightforward way to concretely tell
you what's going on.

Daryl


Re: imeout-problem - additional information

Posted by Johann Spies <js...@sun.ac.za>.
On Mon, Mar 10, 2008 at 05:24:24AM -0400, Daryl C. W. O'Shea wrote:
> Try to roughly compare the actual amount of CPU time that the spamd
> children are using on each server.  3.2 will use more resources than

How do I do that?  Just watching 'top' is not a reliable method I
suspect.

> Are the timeouts for the same zone(s)?  

Most of them are lookups against list.dsbl.org.   A dig
146.226.86.70.list.dsbl.org that timed out according to the log, took
352 milliseconds when checked by hand.  A little bit longer on the old
one.


> Test lookups against those zones manually.
> Is your upstream (or downstream) bandwidth usage near full capacity?  

It is 92.5% full at the moment.

> To the two servers share the same DNS setup?  

Yes

> Is there something else running on the new server that is driving
> the load average up (a common cause of the "child processing
> timeout" message)?

The load average on the new server is lower than that on the old
server - as expected.  For the past 24 hours the highest load average
was 1.6.

> 
> A little more work... review the debug output for a bunch of messages
> (you'll have to separate each message's debug info from the combined
> debug log).  What parts of the scanning process are taking the most
> amount of time?


I will do that.

Regards
Johann

-- 
Johann Spies          Telefoon: 021-808 4036
Informasietegnologie, Universiteit van Stellenbosch

     "What doth it profit, my brethren, though a man say he
      hath faith, and have not works? can faith save him? If
      a brother or sister be naked, and destitute of daily
      food, And one of you say unto them, Depart in peace,
      be ye warmed and filled; notwithstanding ye give them
      not those things which are needful to the body; what
      doth it profit? Even so faith, if it hath not works,
      is dead, being alone."          James 2:14-17 

Re: imeout-problem - additional information

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
On 10/03/2008 4:46 AM, Johann Spies wrote:
> Hello Daryl,
> 
> Thanks for your reply.
> 
> On Fri, Mar 07, 2008 at 02:23:13PM -0500, Daryl C. W. O'Shea wrote:
>> On 05/03/2008 5:44 AM, Johann Spies wrote:
>>> On Thu, Feb 28, 2008 at 02:44:02PM +0200, Johann Spies wrote:
>>>> On a new mailserver with 8Gb ram and 2xdual-core CPU's we get regular
>>>> messages in the log:
>>>>
>>>> Feb 28 12:52:43 mail2 spamd[32558]: prefork: child states: BIBBB
>>>> Feb 28 12:52:44 mail2 spamd[459]: rules: failed to run TVD_STOCK1 test, skipping:
>>>> Feb 28 12:52:44 mail2 spamd[459]:  (child processing timeout at /usr/sbin/spamd line 1246.
>>>> Feb 28 12:52:44 mail2 spamd[459]: )
>>>>
>>>> And every time it involves TVD_STOCK1.
>> The rule doesn't look particular bad.  Have you been able to capture a
>> sample email that causes this?  Perhaps its an issue with a large
>> text/plain body with no line breaks.
> 
> 
> Unfortunately I do not know which messages caused the problem.
> 
>> 3.2.3 changes the way DNS timeouts are calculated (SA used to time out
>> its second round of DNS lookups way too early).  Is the machine (or
>> specifically the spamd children) actually busy, or is everything sitting
>> rather idle.
> 
> I am comparing an older server (mail1) with the new server (mail2) in
> this case.  Both running exim, clamav and spamassassin.  This
> statistics is over a period of about 24 hours on 7/8 March last week.
> 
> 
> 	      	   mail1	     mail2
> SA-version    	   3.0.3-2sarge1     3.2.3-0.volatile1
> Messages scanned   43338	     22873
> Timeouts (exim)	   0		     36
> --max-children	   5		     15
> Ram		   4G		     7G

Try to roughly compare the actual amount of CPU time that the spamd
children are using on each server.  3.2 will use more resources than
3.0, but not twice as much (and throughput shouldn't be 1/2 on faster
hardware).  I suspect you'll find the children on the new server mainly
idle compared to on the old server.

> I have activated the --debug option now and so far have seen 14
> dns-timeouts in the past 40 minutes on mail2.

Keep an eye on your log volume free space if you're not rotating out
logs based on size.

Are the timeouts for the same zone(s)?  Test lookups against those zones
manually.  Is your upstream (or downstream) bandwidth usage near full
capacity?  To the two servers share the same DNS setup?  Is there
something else running on the new server that is driving the load
average up (a common cause of the "child processing timeout" message)?

A little more work... review the debug output for a bunch of messages
(you'll have to separate each message's debug info from the combined
debug log).  What parts of the scanning process are taking the most
amount of time?

Daryl