You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by KayVajj <va...@gmail.com> on 2014/01/27 05:03:21 UTC
Question regarding cluster by multiple columns
Hi,
I'm studying the bucketed tables as an option for my storage. What would be
use case where it is useful to cluster by multiple columns?
I 'm trying to solve a problem of optimizing a join between two tables with
filtering.
Let's say Table A has columns (id, country, .....) and table has columns
(Id, country....)
Note: A country could have multiple Ids.
Single column clustering
If I cluster both tables by Id column, into 8 buckets.
Table A would have files FileA1, FileA2..FileA8
And similarly Table B would have FileB1..FileB8
In case of a join on column Id, I would imagine FileA1 would be joined with
FileB1.. FileA2 with FileB2... so on and so forth. the filter is applied on
the country in each join. This would avoid the need for comparing FileA1
with files other than FileB1 and I see a performance gain.
Multiple Column Clustering
How would clustering on two columns Id and country play in this scenario..
Your inputs are very much appreciated.
Thanks
Kishore