You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by KayVajj <va...@gmail.com> on 2014/06/26 19:55:25 UTC

Determining the number of buckets in hive

Hi,

I know this question might have been beaten to death, but I could not find
an answer to a particular question. I'm using hive 0.10.x

I have a table partitioned on day and I would like to bucket the table on a
different column to avail of the SMB join optimization. I have seen an
earlier thread around determining the number of buckets.

http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/%3C350967547.114894.1335230199385.JavaMail.root@sms-zimbra-message-store-03.sms.scalar.ca%3E

It looks like for my purposes 4 or  8 buckets should suffice for my current
data load. But the volume is projected to increase significantly. This I
see is causing a problem.

What I see is hive in a multi stage query in the last stage creates as many
reducers as the number of buckets defined in the table spec. When the
volume increases it still  has to use this predefined number to distribute
all the data and it runs into an out of memory error.

Now I could conservatively increase the bucket size to 16 or even 32. But
for days when the volume is low, it starts creating too many small files.
Now I have conundrum of balancing the small files and join optimization.

I wanted to see if you guys have any suggestions around this.

I'm thinking if it is generating such small files, then may be SMB join
optimization is not necessary. Is that observation correct?

Would appreciate your inputs.

Thanks
K