You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@apr.apache.org by Neil Conway <nr...@cs.berkeley.edu> on 2009/09/14 21:55:24 UTC

apr_hash: Memory consumption

I notice that the implementation of expand_array() in
tables/apr_hash.c allocates a new bucket array, without making any
attempt to release the memory allocated for the previous bucket array.
That wastes memory: if the hash table grows exponentially, this
strategy wastes O(n) extra memory.

Is there a rational for the current behavior? If not, perhaps the
easiest fix is to allocate the bucket array in a subpool, and then
create a new subpool and destroy the old one in expand_array(). If a
patch to do this would be accepted, let me know and I'll submit one.

Neil

Re: apr_hash: Memory consumption

Posted by Graham Leggett <mi...@sharp.fm>.
Hi,

> I notice that the implementation of expand_array() in
> tables/apr_hash.c allocates a new bucket array, without making any
> attempt to release the memory allocated for the previous bucket array.
> That wastes memory: if the hash table grows exponentially, this
> strategy wastes O(n) extra memory.
> 
> Is there a rational for the current behavior? If not, perhaps the
> easiest fix is to allocate the bucket array in a subpool, and then
> create a new subpool and destroy the old one in expand_array(). If a
> patch to do this would be accepted, let me know and I'll submit one.

If the patch improves performance, then +1 to that.

Regards,
Graham
--

Re: apr_hash: Memory consumption

Posted by Ruediger Pluem <rp...@apache.org>.

On 09/22/2009 11:05 PM, Neil Conway wrote:
> On Tue, Sep 22, 2009 at 1:21 PM, Graham Leggett <mi...@sharp.fm> wrote:
>> Thanks for this, I've committed it to trunk, and backported it to 1.4
>> and 1.3.
> 
> Whoops -- the patch had a boneheaded typo in the changes to
> apr_hash_copy(). Attached is a fix for the bug.
> 
> Sorry, my mistake.

Thanks for noting and supplying a patch. Fixed:

http://svn.apache.org/viewvc?rev=817858&view=rev
http://svn.apache.org/viewvc?rev=817860&view=rev
http://svn.apache.org/viewvc?rev=817861&view=rev

Regards

Rüdiger


Re: apr_hash: Memory consumption

Posted by Neil Conway <nr...@eecs.berkeley.edu>.
On Tue, Sep 22, 2009 at 1:21 PM, Graham Leggett <mi...@sharp.fm> wrote:
> Thanks for this, I've committed it to trunk, and backported it to 1.4
> and 1.3.

Whoops -- the patch had a boneheaded typo in the changes to
apr_hash_copy(). Attached is a fix for the bug.

Sorry, my mistake.

Neil

Re: apr_hash: Memory consumption

Posted by Graham Leggett <mi...@sharp.fm>.
Neil Conway wrote:

> Fair enough. The one place this doesn't work is when expanding the
> bucket array in apr_hash_set() (since it returns NULL), but in that
> case we can just skip the expansion if the apr_pool_create() fails,
> and retry it next time.
> 
> Attached is a revised patch -- thanks for the feedback.

Thanks for this, I've committed it to trunk, and backported it to 1.4
and 1.3.

Regards,
Graham
--

Re: apr_hash: Memory consumption

Posted by Neil Conway <nr...@eecs.berkeley.edu>.
On Sat, Sep 19, 2009 at 10:09 AM, Graham Leggett <mi...@sharp.fm> wrote:
> Hmmm. I think it's reasonable to return NULL in the failure cases, where
> NULL means "out of memory".

Fair enough. The one place this doesn't work is when expanding the
bucket array in apr_hash_set() (since it returns NULL), but in that
case we can just skip the expansion if the apr_pool_create() fails,
and retry it next time.

Attached is a revised patch -- thanks for the feedback.

Neil

Re: apr_hash: Memory consumption

Posted by Graham Leggett <mi...@sharp.fm>.
Neil Conway wrote:

> Unfortunately, the APR hash table API doesn't provide any means to
> report errors, so we can't easily propagate apr_pool_create() failures
> back to the client program. Presuming changing the APR hash API isn't
> an option for 1.3.x / 1.4.x, is there a better solution than just
> ignoring apr_pool_create() failures?

Hmmm. I think it's reasonable to return NULL in the failure cases, where
NULL means "out of memory".

Regards,
Graham
--

Re: apr_hash: Memory consumption

Posted by Neil Conway <nr...@cs.berkeley.edu>.
On Mon, Sep 14, 2009 at 7:43 PM, Bing Swen <bs...@pku.edu.cn> wrote:
> We'd love to see such a patch. we've been hampered by this problem for years.

Attached is a patch that implements the scheme I suggested earlier
(allocating the bucket array in a subpool). The patch is against the
1.3 branch.

Unfortunately, the APR hash table API doesn't provide any means to
report errors, so we can't easily propagate apr_pool_create() failures
back to the client program. Presuming changing the APR hash API isn't
an option for 1.3.x / 1.4.x, is there a better solution than just
ignoring apr_pool_create() failures?

I've also attached a trivial benchmark program that performs 10
million insertions. Vanilla APR 1.3 branch (r813588):

./testhashperf  0.42s user 0.10s system 96% cpu 0.537 total
./testhashperf  0.42s user 0.09s system 97% cpu 0.526 total
./testhashperf  0.42s user 0.09s system 97% cpu 0.527 total
./testhashperf  0.42s user 0.09s system 97% cpu 0.526 total
./testhashperf  0.42s user 0.10s system 98% cpu 0.526 total
./testhashperf  0.42s user 0.10s system 98% cpu 0.527 total

With the patch applied:

./testhashperf  0.39s user 0.10s system 97% cpu 0.501 total
./testhashperf  0.39s user 0.10s system 98% cpu 0.501 total
./testhashperf  0.39s user 0.09s system 97% cpu 0.502 total
./testhashperf  0.39s user 0.10s system 97% cpu 0.501 total
./testhashperf  0.39s user 0.10s system 98% cpu 0.501 total
./testhashperf  0.39s user 0.10s system 97% cpu 0.501 total

(OSX 10.6, gcc 4.2.1, "-O2")

> The easiest remedy we've been using is to add a new "API" function like this:
>
> APR_DECLARE(apr_hash_t *) apr_hash_make_ex(apr_pool_t *pool, unsigned int init_size)

Yeah, I added something similar to my local copy in the meanwhile.
When the hash table size is predictable, this avoids the excess memory
consumption, but can also improve performance significantly: I've
found expand_array() to be quite expensive when a hash table grows
from empty to a few million entries. Such an API might be worth
considering for APR 1.4 or 2.0.

> Also, we'd like very much to see the possiblity whether apr_hash_t could be extended to make use of the "move-to-front" technique that is presented by the following papers:
>
>        "In-memory hash tables for accumulating text vocabularies", J. Zobel, S. Heinz, and H.E. Williams, Information Processing Letters, 80(6):271-277, December 2001. (http://www.cs.rmit.edu.au/~jz/fulltext/ipl01.pdf)
>
>        "Cache-conscious collision resolution in string hash tables", N. Askitis and J. Zobel, Proceedings of the SPIRE String Processing and Information Retrieval Symposium, November 2005, pp. 91-102. LNCS 3772.  (http://www.cs.rmit.edu.au/~jz/fulltext/spire05.pdf)

Seems like it would be easy to do, but I wonder if it is worth
including in APR: this technique is only sensible when the hash
workload has significant temporal locality.

Neil

RE: apr_hash: Memory consumption

Posted by Bing Swen <bs...@pku.edu.cn>.
We'd love to see such a patch. we've been hampered by this problem for years. We use apr_hash_t to look up several in-memory lexicons, each of them with over half a million words. The initial slot array size expands from 2^n - 1, n = 4. That wastes too much as we need to put all the hash lexicons in persistent pools. The easiest remedy we've been using is to add a new "API" function like this:

APR_DECLARE(apr_hash_t *) apr_hash_make_ex(apr_pool_t *pool, unsigned int init_size)
{
    apr_hash_t *ht;
    ht = apr_palloc(pool, sizeof(apr_hash_t));
    ht->pool = pool;
    ht->free = NULL;
    ht->count = 0;
    ht->max = init_size;  // INITIAL_MAX;
    ht->array = alloc_array(ht, ht->max);
    ht->hash_func = apr_hashfunc_default;
    return ht;
}

which resembles the other two APR functions:

	APR_DECLARE(apr_table_t *) apr_table_make(apr_pool_t *p, int nelts);
	APR_DECLARE(apr_array_header_t *) apr_array_make(apr_pool_t *p, int nelts, int elt_size);

Also, we'd like very much to see the possiblity whether apr_hash_t could be extended to make use of the "move-to-front" technique that is presented by the following papers:

	"In-memory hash tables for accumulating text vocabularies", J. Zobel, S. Heinz, and H.E. Williams, Information Processing Letters, 80(6):271-277, December 2001. (http://www.cs.rmit.edu.au/~jz/fulltext/ipl01.pdf)
 
	"Cache-conscious collision resolution in string hash tables", N. Askitis and J. Zobel, Proceedings of the SPIRE String Processing and Information Retrieval Symposium, November 2005, pp. 91-102. LNCS 3772.  (http://www.cs.rmit.edu.au/~jz/fulltext/spire05.pdf)

Cheers,
Bing Swen



-----Original Message-----
From: Neil Conway [mailto:nrc@cs.berkeley.edu] 
Sent: Tuesday, September 15, 2009 3:55 AM
To: dev@apr.apache.org
Subject: apr_hash: Memory consumption

I notice that the implementation of expand_array() in
tables/apr_hash.c allocates a new bucket array, without making any
attempt to release the memory allocated for the previous bucket array.
That wastes memory: if the hash table grows exponentially, this
strategy wastes O(n) extra memory.

Is there a rational for the current behavior? If not, perhaps the
easiest fix is to allocate the bucket array in a subpool, and then
create a new subpool and destroy the old one in expand_array(). If a
patch to do this would be accepted, let me know and I'll submit one.

Neil