You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2006/10/19 21:40:53 UTC

[Bug 5141] New: ArchiveIterator::message_array() etc keep file list in memory

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5141

           Summary: ArchiveIterator::message_array() etc keep file list in
                    memory
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: major
          Priority: P5
         Component: Libraries
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: felicity@apache.org


For the past 2 weeks I found that my nightly runs would OOM during the scan
phase of mass-check.  After finally doing some debugging, it turns out the
problem is that I've accepted more spam that I used to, so storing all the
records in memory goes over the process' limit.

ie: During the scan phase, ArchiveIterator stores the spam/ham message listing
in two array references $self->{s} and $self->{h}, then merge that together into
a @messages array (so before the two reference vars are undef'ed, that's 2x
memory usage), more processing, then return that for the run phase.

This isn't really an issue most of the time because spamassassin and sa-learn
typically don't process a huge number of messages.  mass-check, however, can
process several hundred thousand (or more) messages at a time, and keeping all
that information in memory can cause OOMs.

So I suggest two things:

- I'm going to commit a patch shortly which at least cuts the memory usage for
"mass-check -n" down a bit so that my nightly runs can actually run.

- We ought to use temp files for the ham/spam arrays, and then process out to a
third temp file.  That way, the memory use will be minimal, and mass-check can
stop doing the "fork a process for scanning" thing, and everything will be happier.

Note: this assumes that there's enough temp disk space to store the indexes, but
that's much more likely than having enough RAM IMO.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5141] ArchiveIterator::message_array() etc keep file list in memory

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5141





------- Additional Comments From parkerm@pobox.com  2006-10-25 09:36 -------
If temp files are used there needs to be a mechanism so that we can specify
where those files are created/stored.  Also, some effort to make them unique per
instance would be good.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5141] ArchiveIterator::message_array() etc keep file list in memory

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5141





------- Additional Comments From felicity@apache.org  2006-11-27 19:44 -------
(In reply to comment #1)
> temp files would be good.
> 
> alternatives:
> 
> - use RAM for the first N entries, then "page out" the remainder into temp files
> 
> - use a single delimiter-separated string in RAM; strings are much more
> RAM-efficient than perl hashes or arrays

yeah, I had a few thoughts about how to do it.  the useful thoughts involved:

- pass in function callbacks such that mass-check and spamassassin/sa-learn can
function differently.  for example, only mass-check cares about opt_n,
after/before, etc.

- if we're going to use temp files, we should generally be able to handle any
amount of input.  I'm worried about the performance penalty of doing everything
in temp files, so yeah, churning through 50-100k entries in memory, then shove
it out to a temp file.  that way a small mass-check will still be all in memory,
but larger ones will function appropriately.

What I haven't figured out yet is the algorithm by which to handle the multiple
message pools.  It's pretty straightforward I think, though head and tail seems
problematic.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5141] ArchiveIterator::message_array() etc keep file list in memory

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5141


felicity@apache.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|3.2.0                       |Future




------- Additional Comments From felicity@apache.org  2006-12-05 07:11 -------
This isn't really necessary for 3.2, just something to get done at some point,
so moving to future.

I did add in function callbacks to the scan phase of AI, so the coding should be
able to happen in mass-check only now. :)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5141] ArchiveIterator::message_array() etc keep file list in memory

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5141


felicity@apache.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|Undefined                   |3.2.0






------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5141] ArchiveIterator::message_array() etc keep file list in memory

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5141





------- Additional Comments From jm@jmason.org  2006-10-25 09:22 -------
temp files would be good.

alternatives:

- use RAM for the first N entries, then "page out" the remainder into temp files

- use a single delimiter-separated string in RAM; strings are much more
RAM-efficient than perl hashes or arrays



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.