You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2012/04/23 21:32:57 UTC

[Bug 6793] New: PATCH reduce sa-awl memory usage

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

             Bug #: 6793
           Summary: PATCH reduce sa-awl memory usage
           Product: Spamassassin
           Version: unspecified
          Platform: PC
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Tools
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: vitalyb@telenet.dn.ua
    Classification: Unclassified


Created attachment 5050
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5050
patch

Hello,

current version of sa-awl loads full database key list to memory before showing
any
stats or performing maintenance. I believe it's obvious that this behavior is
undesirable and makes large databases impossible to handle.

The patch below improves sa-awl scaling and responsiveness by scanning database
row-by-row basis instead of loading all keys to memory first.

Tested cleaning db with over 8 million rows.

For a cached db with 850K rows memory usage lowers from 1G to 6M, execution
time
is around 12% slower (26 vs. 29 sec), though.

I'm not a perl expert, please review.

Thanks.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

Mark Martinec <Ma...@ijs.si> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #5053|0                           |1
        is obsolete|                            |

--- Comment #9 from Mark Martinec <Ma...@ijs.si> 2012-04-24 10:38:29 UTC ---
Created attachment 5054
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5054
oops

(the previous patch included an unrelated patch, sorry)

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

--- Comment #5 from Henrik Krohns <he...@hege.li> 2012-04-24 08:57:00 UTC ---
(In reply to comment #4)
> New patch has the same root issue as original sa-awl has - potentially stores a
> lot of keys in memory.

>From what I understand, the amount of deleted keys normally would be very small
compared to previous code which arrayed _all_ keys. So there really isn't an
issue here?

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

Mark Martinec <Ma...@ijs.si> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #5055|0                           |1
        is obsolete|                            |

--- Comment #18 from Mark Martinec <Ma...@ijs.si> 2012-04-24 15:27:18 UTC ---
Created attachment 5056
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5056
how about this then

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

--- Comment #20 from Vitaly V. Bursov <vi...@telenet.dn.ua> 2012-04-24 16:37:39 UTC ---
Well, I haven't found any more issues.

Performance is surprisingly good too: processed DB with >8M entries (two
database keys per entry), 7.5M of which were deleted in less than 40 minutes.

BerkeleyDB 4.8 if it matters.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

--- Comment #14 from Mark Martinec <Ma...@ijs.si> 2012-04-24 14:38:18 UTC ---
> 5053 and 5054 are the same regarding sa-awl
> (the 5053 contains an additional unrelated patch, no harm)

My mistake again, sorry, will fix that...
Just use the 5053.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

Mark Martinec <Ma...@ijs.si> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #5050|0                           |1
        is obsolete|                            |

--- Comment #2 from Mark Martinec <Ma...@ijs.si> 2012-04-23 23:13:42 UTC ---
Created attachment 5051
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5051
the promised patch

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

Henrik Krohns <he...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hege@hege.li

--- Comment #3 from Henrik Krohns <he...@hege.li> 2012-04-24 06:22:26 UTC ---
+1 for mark

The old way makes no sense even as an option.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

Kevin A. McGrail <km...@pccc.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #25 from Kevin A. McGrail <km...@pccc.com> ---
Thanks Vitaly and Mark!

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

Mark Martinec <Ma...@ijs.si> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #5054|0                           |1
        is obsolete|                            |

--- Comment #15 from Mark Martinec <Ma...@ijs.si> 2012-04-24 14:40:09 UTC ---
Created attachment 5055
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5055
oops2, hope this diff file is the correct one

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

--- Comment #11 from Vitaly V. Bursov <vi...@telenet.dn.ua> 2012-04-24 13:12:25 UTC ---
Tested. Progress nearly stops after first 10000 removed keys, I guess
@delete_keys contents should be pruned.

I hope I got the correct version of the patch, as 'oops' looks unrelated too.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

--- Comment #17 from Mark Martinec <Ma...@ijs.si> 2012-04-24 15:20:01 UTC ---
> after
>   delete $h{$_}  for @delete_keys;
> @delete_keys contents is not needed on next for(;;) loop iteration, something
> like
> 
> -  my(@delete_keys);
>    for (;;) {
> +    my(@delete_keys);
> 
> but I haven't tested it.

Indeed, good catch. Too much hurry is never a good thing.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

--- Comment #16 from Vitaly V. Bursov <vi...@telenet.dn.ua> 2012-04-24 14:58:44 UTC ---
(In reply to comment #13)
> Can you clarify what you mean?

after
  delete $h{$_}  for @delete_keys;
@delete_keys contents is not needed on next for(;;) loop iteration, something
like

-  my(@delete_keys);
   for (;;) {
+    my(@delete_keys);

but I haven't tested it.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

--- Comment #19 from Kevin A. McGrail <km...@pccc.com> 2012-04-24 15:37:23 UTC ---
(In reply to comment #18)
> Created attachment 5056 [details]
> how about this then

The 3rd one burned, fell over into the swamp but the 4th one stood. ;-)

The logic is sound and the scoping change on delete_keys makes sense to me. 
Again, just reviewed the logic and not trying the actual patch.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

Mark Martinec <Ma...@ijs.si> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|unspecified                 |3.3.2
   Target Milestone|Undefined                   |3.4.0

--- Comment #21 from Mark Martinec <Ma...@ijs.si> 2012-04-24 17:35:51 UTC ---
> Well, I haven't found any more issues. 
> Performance is surprisingly good too: processed DB with >8M entries (two
> database keys per entry), 7.5M of which were deleted in less than 40 minutes.
> BerkeleyDB 4.8 if it matters.

Very good, thanks for testing!

trunk (3.4):
  Bug 6793: reduce sa-awl memory usage
Sending sa-awl.raw
Committed revision 1329877.

Can we close this now?

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

--- Comment #23 from Vitaly V. Bursov <vi...@telenet.dn.ua> 2012-04-24 18:03:23 UTC ---
(In reply to comment #22)

> A bit off topic, but have you ever tried DB_TXN_NOSYNC ?

No, tests were performed on hardware 2 disk SATA RAID with write caching
enabled, NOSYNC shouldn't make a big difference. Most likely it was
read-congested, not write.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

--- Comment #7 from Vitaly V. Bursov <vi...@telenet.dn.ua> 2012-04-24 09:22:12 UTC ---
(In reply to comment #5)
> From what I understand, the amount of deleted keys normally would be very small
> compared to previous code which arrayed _all_ keys. So there really isn't an
> issue here?

Yes, makes sense, I completely missed that.

So, to avoid a memory exhaustion in a rare case when a lot of keys should be
deleted either 3a from comment #4 will do or iteration can be simply aborted
when 10000-100000 keys are collected in @delete_keys, warning message printed
and non-zero exit code returned. Just in case.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

--- Comment #12 from Kevin A. McGrail <km...@pccc.com> 2012-04-24 13:19:50 UTC ---
(In reply to comment #11)
> Tested. Progress nearly stops after first 10000 removed keys, I guess
> @delete_keys contents should be pruned.
> 
> I hope I got the correct version of the patch, as 'oops' looks unrelated too.

It looks like I tested 5053 and 5054.  Mark, can you clarify?

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

--- Comment #6 from Henrik Krohns <he...@hege.li> 2012-04-24 09:08:38 UTC ---
(In reply to comment #5)
> (In reply to comment #4)
> > New patch has the same root issue as original sa-awl has - potentially stores a
> > lot of keys in memory.
> 
> From what I understand, the amount of deleted keys normally would be very small
> compared to previous code which arrayed _all_ keys. So there really isn't an
> issue here?

Nothing against making it safe for the worst scenario though.. I would go for
KISS method and just restart when @deleted_keys is over 100k-1M keys. I doubt
anyone would encounter the limit in normal usage and it's pretty much 0-cpu
operation to check array size in the loop. If someone is deleting that much
keys, probably he's deleting most of them, thus restarting pretty much starts
where it was left at..

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

Mark Martinec <Ma...@ijs.si> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #5051|0                           |1
        is obsolete|                            |

--- Comment #8 from Mark Martinec <Ma...@ijs.si> 2012-04-24 10:22:27 UTC ---
Created attachment 5053
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5053
do it in chunks if huge

> Nothing against making it safe for the worst scenario though.. I would go
> for KISS method and just restart when @deleted_keys is over 100k-1M keys.

Fully agree. I was considering this, but the hour was late...

A new patch attached.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

Henrik Krohns <he...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|3.3.2                       |unspecified
   Target Milestone|3.4.0                       |Undefined

--- Comment #22 from Henrik Krohns <he...@hege.li> 2012-04-24 17:39:07 UTC ---

First of all, +1 but didn't look so indepth this time either. Enough to me that
it's pretty well tested already. :-)


(In reply to comment #20)
> Well, I haven't found any more issues.
> 
> Performance is surprisingly good too: processed DB with >8M entries (two
> database keys per entry), 7.5M of which were deleted in less than 40 minutes.
> 
> BerkeleyDB 4.8 if it matters.

A bit off topic, but have you ever tried DB_TXN_NOSYNC ? Should add
considerable performance, and the "risks" are more than fine for this use..

You could try adding it in this line in BDB.pm:
my $flags = DB_INIT_LOCK|DB_INIT_LOG|DB_INIT_MPOOL|DB_INIT_TXN;

In fact, I think the flags should be user configurable to some degree..

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

Kevin A. McGrail <km...@pccc.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kmcgrail@pccc.com

--- Comment #10 from Kevin A. McGrail <km...@pccc.com> 2012-04-24 12:48:28 UTC ---
+1 on patch oops but I only reviewed the logic.  And I'm definitely a slow and
steady wins the race over optimization with the potential memory usage being
discussed so no worries there.

Vitaly, can you test Mark's patch by any chance?

regards,
KAM

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

--- Comment #24 from Mark Martinec <Ma...@ijs.si> ---
Close?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

--- Comment #13 from Mark Martinec <Ma...@ijs.si> 2012-04-24 14:33:47 UTC ---
> It looks like I tested 5053 and 5054.  Mark, can you clarify?

5053 and 5054 are the same regarding sa-awl
(the 5053 contains an additional unrelated patch, no harm)

> Tested. Progress nearly stops after first 10000 removed keys,
> I guess @delete_keys contents should be pruned.

Can you clarify what you mean?

> Progress nearly stops after first 10000 removed keys.

You are probably observing the deletion phase for the
first collected set of 10000 keys, there are no printouts
during this phase. Or is it something else?

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

Henrik Krohns <he...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|unspecified                 |3.3.2
   Target Milestone|Undefined                   |3.4.0

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

Vitaly V. Bursov <vi...@telenet.dn.ua> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |vitalyb@telenet.dn.ua

--- Comment #4 from Vitaly V. Bursov <vi...@telenet.dn.ua> 2012-04-24 07:47:11 UTC ---
New patch has the same root issue as original sa-awl has - potentially stores a
lot of keys in memory. I had to reduce a DB from 8M keys to 800K and proposed 
algorithm would keep around 3M in memory keys and if 800K keys eat 1G of RAM on
x86-64.... well, not much better than original.

Looks like there are few options then.

1. Create a new DB file and replace the old one with it. Probably it's hard to
do correctly if SA is running.

2. Leave missed or duplicate keys as is - probably duplicates are harmless and
missed keys will be handled on next runs. Ugly.

3. Modify the algorithm. Few options here as well:

a) check size of @delete_keys, stop iteration if over limit, delete keys, start
all over again (very slow);

b) if we got 'totscore' key, also get and check 'count' (strip off /totscore$/
part from the key name), if it's 'count' key proceed as usual. If keys should
be deleted delete the current one and remember another in @delete_keys. The
trick is that keys should not be deleted afterwards all at once but on every
iteration key name should be checked if it's in @delete_keys and if so, this
key must be deleted and removed from @delete_keys. The size of @delete_keys
should be checked also to keep it from consuming too much memory.

c) store keys that should be deleted in another on-disk DB.

Few more thoughts.

It's hard to predict how efficient method 3b going to be compared to 3a as keys
are randomly sorted (I think), probably it's best to have two implementations
and make it an option - the 'fast' one but not entirely correct for huge DBs
(like 2) and, default, correct one and not so fast (like 3b) for small-medium
sized DBs.

Hope this helps,
Thanks.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6793] PATCH reduce sa-awl memory usage

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

--- Comment #1 from Mark Martinec <Ma...@ijs.si> 2012-04-23 23:12:10 UTC ---
> current version of sa-awl loads full database key list to memory before
> showing any stats or performing maintenance. I believe it's obvious that
> this behavior is undesirable and makes large databases impossible to handle

I fully agree. Operations should not depend on being able to load a
complete database into memory if the size of a database is potentially
unlimited.

> The patch below improves sa-awl scaling and responsiveness by scanning
> database row-by-row basis instead of loading all keys to memory first.
> Tested cleaning db with over 8 million rows.
> For a cached db with 850K rows memory usage lowers from 1G to 6M,

Great, thanks!

> execution time is around 12% slower (26 vs. 29 sec), though.

Despite Kevin's concern on the ML, I don't think this is worth
introducing yet another option (and risking people using it
without fully understanding the implications).

One can make a program arbitrarily fast
if correct and reliable operation is not mandated  :)

> I'm not a perl expert, please review.

My only concern is deleting a key while the each() operator is active.

Please see my alternative patch.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.