You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Soupam Mandal <so...@gmail.com> on 2019/08/26 17:01:32 UTC

Hive Query Optimization

0

We have 7 tables and each table is partitioned by record_date.There is a
query which involves inner join with all these tables and join is based on
consumer_id. The join involves multiple partition join. Currently querying
1 week data takes very long time around 20-30 mins. We want to optimize
this query. The root cause of this slowness is the data shuffling and map
reduce.Each table contains around 8TB data on compression. Table is
compressed with ORC-Zlib.

We tried bucketing along with the partitioning. So we bucketed each
partition of the table into 100 buckets based on the consumer_id as
consumer_id is the joining key. When there is a query involving one
partition there is improvement in the query perf as it uses merge join. But
whenever there is a query involving multiple partitions the query plan
starts shuffling data and no improvements in query perf. We are using hive
tez engine and vectorization is enabled.Can anyone suggest what approach we
should follow when there is query with multiple tables join and multiple
partitions?