You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Echo Li <ec...@gmail.com> on 2013/02/21 01:19:43 UTC

bucketing on a column with millions of unique IDs

Hi guys,

I plan to bucket a table by "userid" as I'm going to do intense calculation
using "group by userid". there are about 110 million rows, with 7 million
unique userid, so my question is what is a good number of buckets for this
scenario, and how to determine number of buckets?

Any input is apprecaited :)

Echo

Re: bucketing on a column with millions of unique IDs

Posted by be...@yahoo.com.

Hi Li

The major consideration you should give is regarding the size of bucket. One bucket corresponds to a file in hdfs and you should ensure that every bucket is atleast a block size or in the worst case atleast majority of the buckets should be.

So based on the data size you should derive on this rather than the number of rows/records.

Regards 
Bejoy KS

Sent from remote device, Please excuse typos

-----Original Message-----
From: Echo Li <ec...@gmail.com>
Date: Wed, 20 Feb 2013 16:19:43 
To: <us...@hive.apache.org>
Reply-To: user@hive.apache.org
Subject: bucketing on a column with millions of unique IDs

Hi guys,

I plan to bucket a table by "userid" as I'm going to do intense calculation
using "group by userid". there are about 110 million rows, with 7 million
unique userid, so my question is what is a good number of buckets for this
scenario, and how to determine number of buckets?

Any input is apprecaited :)

Echo