You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Jan Morlock <ja...@googlemail.com> on 2016/12/07 17:35:39 UTC

group by across multiple partitions of clustered table.

Hi,

in our company, we are using a Hive table which is both partitioned and
clustered similar to the following snippet:

PARTITIONED BY (year INT, month INT, day INT, feed STRING)
CLUSTERED BY (key) INTO 1024 BUCKETS

Using this input table we regularly perform queries where we group by key
across multiple partitions.

Now, my questions are the following:

1. Does Hive take advantage from such a table layout in a way that the
group by operation is executed more efficiently (in comparison to a similar
table, which is partitioned but not clustered)?
2. If yes, is this kind of behaviour enabled by default or do I have to
specify certain options?
3. Would it help to sort the buckets?

Our Hive version is 1.1.0.

Thank you very much in advance.

Cheers
Jan