You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Theo Van Dinter <fe...@apache.org> on 2006/09/03 05:58:33 UTC

Re: Weekly run ...

My weekly run is still going (19 hours now -- MUCH longer than it used to
take).  I started watching the processes, and I'm fairly positive we have a
major memory leak somewhere.

My processes are limited to 200M for data, and I watched processes start
from ~90M and grow to 200M, when one gets killed and the other child
procs are restarted:

29593 corpus    28  10  199m 166m 1964 R 12.0  8.3   5:59.69 mass-check

So something happened in August (probably between the 11th and the 19th,
since the 19th was the first time I saw the problem).  I haven't dug
around to debug it yet.  <sigh>


On Mon, Aug 21, 2006 at 12:00:01AM -0400, Theo Van Dinter wrote:
> Did anyone else have similar issues on Saturday?  It's the first day I saw
> this:
> 
> On Sat, Aug 19, 2006 at 06:56:10PM -0400, Cron Daemon wrote:
> [...]
> > status: starting run stage                               now: 2006-08-19 05:12:52
> > Out of memory!
> > archive-iterator: readline failed, attempting to recover
> > Out of memory!
> > archive-iterator: readline failed, attempting to recover
> > Out of memory!
> > archive-iterator: readline failed, attempting to recover
> > status:  10% ham: 0      spam: 14802  date: 2006-08-06   now: 2006-08-19 06:28:20
> > Out of memory!
> > archive-iterator: readline failed, attempting to recover
> > Out of memory!
> > archive-iterator: readline failed, attempting to recover
> [...]
> > Out of memory!
> > *** glibc detected *** /usr/bin/perl: double free or corruption (out): 0x000000000a172af0 ***
> [...]
> > Out of memory!
> > *** glibc detected *** /usr/bin/perl: free(): invalid pointer: 0x0000003a5574be00 ***
> [...]
> > Attempt to free temp prematurely: SV 0x8c1f900, Perl interpreter: 0x505010 at /home/corpus/SA/spamassassin-corpora/masses/../lib/Mail/SpamAssassin/Timeout.pm line 142.
> > Attempt to free unreferenced scalar: SV 0x8c1f900, Perl interpreter: 0x505010 at /home/corpus/SA/spamassassin-corpora/masses/../lib/Mail/SpamAssassin/Timeout.pm line 163.
> > Out of memory!
> 
> -- 
> Randomly Generated Tagline:
> "When in doubt, parenthesize.  At the very least it will let some
>  poor schmuck bounce on the % key in vi."   - Larry Wall



-- 
Randomly Generated Tagline:
"Beaten paths are for beaten men."         - Unknown

Re: Weekly run ...

Posted by Theo Van Dinter <fe...@apache.org>.
On Sun, Sep 03, 2006 at 11:13:46PM +0100, Justin Mason wrote:
> > I do think that we still have leaks elsewhere since the process memory
> > continues to grow, but it's much lower than before.  I'll keep digging.
> 
> Please do...
> 
> Last time I looked, there were some tiny leaks over time in the perl
> interpreter, but nothing to be worried about (since we restart
> every N messages in spamd and mass-check).

Well, that's not a reason to not be worried about them. :)    Not everyone
restarts their processes, so it's a problem we should deal with if we can.

The more I look at things, the more I believe there's a major leak
related to the URIDNSBL plugin.  Unfortunately, I can no longer figure
out how that code works due to the complexity.  I've spent a good amount
of time looking at it and find that it's too confusing to follow.

In short, if I run just SURBL rules, I OOM.  If I run other network
tests (say, RCVD_IN_XBL,) I don't.  If I don't run net rules at all, I
don't OOM.  Therefore it seems as if the plugin is the issue.  I noticed
that URIDNSBL is the only thing using the AsyncLoop code, so there could
be a problem in there instead/as well.

Justin, since you were the one who put in AsyncLoop, can you poke around and
figure what's causing the issue?

-- 
Randomly Generated Tagline:
"Sorry, not tonight.  I have to floss my cat." - Random Turn-down line

Re: Weekly run ...

Posted by Theo Van Dinter <fe...@apache.org>.
On Sat, Sep 02, 2006 at 11:58:33PM -0400, Theo Van Dinter wrote:
> My weekly run is still going (19 hours now -- MUCH longer than it used to
> take).  I started watching the processes, and I'm fairly positive we have a
> major memory leak somewhere.
> 
> So something happened in August (probably between the 11th and the 19th,
> since the 19th was the first time I saw the problem).  I haven't dug
> around to debug it yet.  <sigh>

Did some more debugging.  The problem is r432244:

r432244 | jm | 2006-08-17 10:09:27 -0400 (Thu, 17 Aug 2006) | 1 line
optimisation: don't delete all keys in PerMsgStatus, just the ones that need it

432243 runs fine, but 432244 causes OOMs.  I poked around some more to
determine the other keys that need to be deleted, and was going to just
add them in, but I got to thinking:  Since any plugin can add in data to
PMS, there can be any number of circular refs that we don't know about, so
deleting all of the keys is the only way we can be sure to clear them out.

FWIW: I found resolver and uribl_scanstate which needed to be removed, but
even then I still got OOMs, it just took longer.  I assume there are other
net-test-related things that I'm somehow not seeing via Data::Dumper.

I'm going to commit in a change that changes PMS::finish() to do:

%{$self} = ();

which ends up deleting all the keys while not requiring the foreach() method
that we used previous, so I hope this solves JM's original issue if that being
inefficient.

I do think that we still have leaks elsewhere since the process memory
continues to grow, but it's much lower than before.  I'll keep digging.

-- 
Randomly Generated Tagline:
It looked like something resembling white marble, which was
 probably what it was: something resembling white marble.
 		-- Douglas Adams, "The Hitchhikers Guide to the Galaxy"