You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "Natarajan, Prabakaran 1. (NSN - IN/Bangalore)" <pr...@nsn.com> on 2014/08/22 08:01:25 UTC

Ideal Bucket Size

Hi

How can I determine a ideal bucket size?

Info:

1)      I have 2 billion rows in a hive table, it is in ORC format
2)      I want to create bucket on a column   X.
3)      Column X has 100 million unique values.
4)      Reason for bucketing - Want to make efficient distinct count on X - this is over my own UDAF.   In merge function I will just count++  instead of merging the Set.

Thanks and Regards
Prabakaran.N  aka NP
Nokia Networks, Bangalore
When "I" is replaced by "We" - even Illness becomes "Wellness"