You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by si...@bt.com on 2013/10/23 13:03:57 UTC

Hash partition question

Hi there,

I have created a table of numbers using clustered by and am sampling it using buckets. 

If I am selecting 10000 candidates from ~125m how can I get good random selections?

Should I create 12500 clusters? Or should I create 100 clusters and then use the sample function (... from 12500) ?

Simon