You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Joydeep Sen Sarma <js...@facebook.com> on 2009/02/03 03:04:50 UTC

better caching for UDF states in map-side group bys

So I had this thought - why not use arrays of primitive types to store UDF state instead of objects.

The background is that if one stores int [] intarray - then java uses 4 bytes for each additional element in the array (verified). Instead if one stores an array of objects that store an int - then there seems to be about 16-20 bytes of extra overhead per object (not sure precisely - this is what it seems on my limited experiments).

So imagine that:
-          we maintained states for UDFs in primitive arrays (this is the UDFs responsibility)
-          we had a customized HashMap implementation that stored an index (int) as value for a key (keys are still objects - but values are just 4 byte ints)
o        looked at the jdk source - this seems straightforward
-          to update state - we give the index to the evaluator. The evaluator can then index into whatever arrays it maintains and do whatever it wants. If it allocates a new slot in the array - then it can return the allocated index back to the framework (to store against the key)

this way - at least we can get rid of the object overhead from the value part of the hashmap.

Somewhat hacky (getting Java to work like C) - but this can be made to work I think.

There is the issue of managing the free slots on the array (which are created on a flush) - but I think we can overlap a free list on top of the primitive array (say every free slot stores the index of the next free slot. When slots are freed - we can chain the new free slots into the existing head of the free list).

Thoughts?

RE: better caching for UDF states in map-side group bys

Posted by Joydeep Sen Sarma <js...@facebook.com>.
Currently the flush can only flush part of the hashmap (under memory pressure). This makes sense (especially combined with a lfu replacement strategy). So based on this - the free slot management would be necessary.

If we can make the serialization fast enough - key serialization could a big win ..

-----Original Message-----
From: Zheng Shao [mailto:zshao9@gmail.com] 
Sent: Monday, February 02, 2009 7:36 PM
To: hive-dev@hadoop.apache.org
Subject: Re: better caching for UDF states in map-side group bys

Yeah that will only work with a customized HashMap, but it's definitely
possible to do.
We might want to serialize the key to byte[] as well (using
TBinarySortableProtocol etc).

The free slot management does not seem to be a necessity, since when we
flush all slots will be free.

Zheng

On Mon, Feb 2, 2009 at 6:04 PM, Joydeep Sen Sarma <js...@facebook.com>wrote:

> So I had this thought - why not use arrays of primitive types to store UDF
> state instead of objects.
>
> The background is that if one stores int [] intarray - then java uses 4
> bytes for each additional element in the array (verified). Instead if one
> stores an array of objects that store an int - then there seems to be about
> 16-20 bytes of extra overhead per object (not sure precisely - this is what
> it seems on my limited experiments).
>
> So imagine that:
> -          we maintained states for UDFs in primitive arrays (this is the
> UDFs responsibility)
> -          we had a customized HashMap implementation that stored an index
> (int) as value for a key (keys are still objects - but values are just 4
> byte ints)
> o        looked at the jdk source - this seems straightforward
> -          to update state - we give the index to the evaluator. The
> evaluator can then index into whatever arrays it maintains and do whatever
> it wants. If it allocates a new slot in the array - then it can return the
> allocated index back to the framework (to store against the key)
>
> this way - at least we can get rid of the object overhead from the value
> part of the hashmap.
>
> Somewhat hacky (getting Java to work like C) - but this can be made to work
> I think.
>
> There is the issue of managing the free slots on the array (which are
> created on a flush) - but I think we can overlap a free list on top of the
> primitive array (say every free slot stores the index of the next free slot.
> When slots are freed - we can chain the new free slots into the existing
> head of the free list).
>
> Thoughts?
>



-- 
Yours,
Zheng

Re: better caching for UDF states in map-side group bys

Posted by Zheng Shao <zs...@gmail.com>.
Yeah that will only work with a customized HashMap, but it's definitely
possible to do.
We might want to serialize the key to byte[] as well (using
TBinarySortableProtocol etc).

The free slot management does not seem to be a necessity, since when we
flush all slots will be free.

Zheng

On Mon, Feb 2, 2009 at 6:04 PM, Joydeep Sen Sarma <js...@facebook.com>wrote:

> So I had this thought - why not use arrays of primitive types to store UDF
> state instead of objects.
>
> The background is that if one stores int [] intarray - then java uses 4
> bytes for each additional element in the array (verified). Instead if one
> stores an array of objects that store an int - then there seems to be about
> 16-20 bytes of extra overhead per object (not sure precisely - this is what
> it seems on my limited experiments).
>
> So imagine that:
> -          we maintained states for UDFs in primitive arrays (this is the
> UDFs responsibility)
> -          we had a customized HashMap implementation that stored an index
> (int) as value for a key (keys are still objects - but values are just 4
> byte ints)
> o        looked at the jdk source - this seems straightforward
> -          to update state - we give the index to the evaluator. The
> evaluator can then index into whatever arrays it maintains and do whatever
> it wants. If it allocates a new slot in the array - then it can return the
> allocated index back to the framework (to store against the key)
>
> this way - at least we can get rid of the object overhead from the value
> part of the hashmap.
>
> Somewhat hacky (getting Java to work like C) - but this can be made to work
> I think.
>
> There is the issue of managing the free slots on the array (which are
> created on a flush) - but I think we can overlap a free list on top of the
> primitive array (say every free slot stores the index of the next free slot.
> When slots are freed - we can chain the new free slots into the existing
> head of the free list).
>
> Thoughts?
>



-- 
Yours,
Zheng