You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Xun TANG <ta...@gmail.com> on 2012/07/11 04:08:22 UTC

How to Decide How Many Buckets to Use?

Hi,

We use Hive to store user click logs and dig out interesting trends, doing
a lot of big table joins.
We mostly partition by date, and inside each partition data is bucketed by
some hashed-id.

We are wondering how many buckets to use?
If I understand correctly, Hive DB stores each bucket in one file. One file
could be break down to several buckets based on bucket size.
Is this one of the criteria to determine bucket size s.t. each file in one
block? What else should be considered?

Is there a rule-of-thumb in selecting how many buckets to use? Any
insight/comment welcome!

Thanks,
Xun

Re: How to Decide How Many Buckets to Use?

Posted by Mark Grover <mg...@oanda.com>.
Hi Xun,
Here is an answer I had given earlier on the mailing list. See if that helps.

http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/%3C350967547.114894.1335230199385.JavaMail.root@sms-zimbra-message-store-03.sms.scalar.ca%3E

Mark

----- Original Message -----
From: "Xun TANG" <ta...@gmail.com>
To: user@hive.apache.org
Sent: Tuesday, July 10, 2012 10:08:22 PM
Subject: How to Decide How Many Buckets to Use?

Hi, 


We use Hive to store user click logs and dig out interesting trends, doing a lot of big table joins. 
We mostly partition by date, and inside each partition data is bucketed by some hashed-id. 


We are wondering how many buckets to use? 
If I understand correctly, Hive DB stores each bucket in one file. One file could be break down to several buckets based on bucket size. 
Is this one of the criteria to determine bucket size s.t. each file in one block? What else should be considered? 


Is there a rule-of-thumb in selecting how many buckets to use? Any insight/comment welcome! 


Thanks, 
Xun