You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Aaron McCurry <am...@gmail.com> on 2010/04/11 21:48:35 UTC

Cluster By Algorithm?

I have a search solution that is down stream of some Netezza data marts that
I'm replacing with a Hive solution.  We already partition the data for the
search solution 32 ways and I would like to take advantage of the data
clustering in Hive (buckets), so that I don't have to do any post
processing.  Is there documentation that describes how the data is hashed or
how it's organized across the buckets?  Or could someone point me to a class
that implements it?  Thanks!

Aaron

Re: Cluster By Algorithm?

Posted by Aaron McCurry <am...@gmail.com>.

Thanks a lot!  I figured it was that simple.

Aaron

On Sun, Apr 11, 2010 at 5:16 PM, Zheng Shao <zs...@gmail.com> wrote:

> Its as simple as taking a hashcode of the key and mod by number of
> reducers. To get started, have a try of any .q files in clientpositive
> directory.
>
> On the code side, HiveKey.java has the implementation.
>
>
>
> Sent from my iPhone
>
>
> On Apr 11, 2010, at 2:48 PM, Aaron McCurry <am...@gmail.com> wrote:
>
>  I have a search solution that is down stream of some Netezza data marts
>> that I'm replacing with a Hive solution.  We already partition the data for
>> the search solution 32 ways and I would like to take advantage of the data
>> clustering in Hive (buckets), so that I don't have to do any post
>> processing.  Is there documentation that describes how the data is hashed or
>> how it's organized across the buckets?  Or could someone point me to a class
>> that implements it?  Thanks!
>>
>> Aaron
>>
>

Re: Cluster By Algorithm?

Posted by Zheng Shao <zs...@gmail.com>.

Its as simple as taking a hashcode of the key and mod by number of  
reducers. To get started, have a try of any .q files in clientpositive  
directory.

On the code side, HiveKey.java has the implementation.

Sent from my iPhone

On Apr 11, 2010, at 2:48 PM, Aaron McCurry <am...@gmail.com> wrote:

> I have a search solution that is down stream of some Netezza data  
> marts that I'm replacing with a Hive solution.  We already partition  
> the data for the search solution 32 ways and I would like to take  
> advantage of the data clustering in Hive (buckets), so that I don't  
> have to do any post processing.  Is there documentation that  
> describes how the data is hashed or how it's organized across the  
> buckets?  Or could someone point me to a class that implements it?   
> Thanks!
>
> Aaron