You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by KayVajj <va...@gmail.com> on 2014/01/27 05:03:21 UTC

Question regarding cluster by multiple columns

Hi,

I'm studying the bucketed tables as an option for my storage. What would be
use case where it is useful to cluster by multiple columns?

I 'm trying to solve a problem of optimizing a join between two tables with
filtering.

Let's say Table A has columns (id, country, .....) and table has columns
(Id, country....)

Note: A country could have multiple Ids.

Single column clustering

If I cluster both tables by Id column, into 8 buckets.

Table A would have files FileA1, FileA2..FileA8

And similarly Table B would have FileB1..FileB8

In case of a join on column Id, I would imagine FileA1 would be joined with
FileB1.. FileA2 with FileB2... so on and so forth. the filter is applied on
the country in each join. This would avoid the need for comparing FileA1
with files other than FileB1 and I see a performance gain.


Multiple Column Clustering

How would clustering on two columns Id and country play in this scenario..


Your inputs are very much appreciated.

Thanks
Kishore