You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Rajbir singh <ra...@gmail.com> on 2019/12/02 23:24:27 UTC

Hive Query Performance Tuning

Hi All,

I have a hive query which does the aggregation of amounts by reading from
hive tables and loads the results to another hive table.
I am trying to fine tune the attached query. Read online and came up with
following. Any Ideas I would be really appreciate. Thank you


      1. Indexing:- We can create Index on the tables. (some folks says
index actually make it worse)
https://community.cloudera.com/t5/Support-Questions/Creating-Indexes-in-Hive/td-p/149601

      2. Execution :- Right now the hive queries run mapreduce engine. We
can set the execution engine to Tez for the improved performance (looks
like Cloudera doesn't support Tez )
https://community.cloudera.com/t5/Support-Questions/Tez-Engine-not-working-over-CDH-5-8-2/td-p/49477

      3. Bucketing: improves the join performance if the bucket key and
join keys are common. Bucketing in Hive distributes the data in different
buckets        based on the hash results on the bucket key. It also reduces
the I/O scans during the join process if the process is happening on the
same keys        (columns).
            SET hive.enforce.bucketing=true;
            SET hive.optimize.bucketmapjoin=true.

       4. Cost-Based Optimization in Hive (CBO)
           before submitting for final execution Hive optimizes each
Query’s logical and physical execution plan
           However, CBO, performs, further optimizations based on query
cost in a recent addition to Hive. That results in potentially different
decisions: how to order joins, which type of join to
  perform, the degree of parallelism and others.
           set hive.cbo.enable=true;
           set hive.compute.query.using.stats=true;
           set hive.stats.fetch.column.stats=true;
           set hive.stats.fetch.partition.stats=true;


       5. Vectorization In Hive
           To improve the performance of operations we can use Vectorized
query execution. It happens by performing them in batches of 1024 rows at
once instead of single row each time.It
significantly improves query execution time, and is easily enabled with two
parameters settings
           set hive.vectorized.execution = true
           set hive.vectorized.execution.enabled = true


Bucketing parameters already set to true
Cost-Based Optimization in Hive (CBO) parameters set to true
Vectorization parameters set to true


I am not sure what else I can do to make the query work faster

-- 
Regards,
Rajbir

Re: Hive Query Performance Tuning

Posted by Matthew Dixon <ma...@jagex.com>.
Hi Rajbir, some thoughts to consider,

I’m wondering what the row_number() functionality is doing.  Because the window frame has no ORDER BY clause the result may not be deterministic, is this the expected behaviour?  I ask because analytic functions can be expensive to compute so make sure you definitely need it.  Are you specifically trying to remove the first event from all entityID’s history (rank > 1 on lines 116 and 181) or are you trying to remove duplicates perhaps?  What is the cardinality of the entityID fields you’re using to calculate it?  If the cardinality is low relative to the overall row count this would definitely impact runtime since all data for a single entity must go to the same task.

I’m assuming you’ve got hive.exec.parallel set to something appropriate here?

You’re using the IN keyword to pass a list of entityIDs in the second and third queries in the UNION (lines 109 and 174).  You may try rewriting the query to JOIN to these since I’m not sure if the map-side join optimisation will be leveraged using IN-syntax.

As a general piece of advice I find that watching a Hive application unfold on Resource Manager and looking at if certain tasks or stages take longer can be a good way to understand what aspects of the query are most expensive to compute.

Matt


From: Rajbir singh <ra...@gmail.com>
Reply to: "user@hive.apache.org" <us...@hive.apache.org>
Date: Tuesday, 3 December 2019 at 09:25
To: "user@hive.apache.org" <us...@hive.apache.org>
Subject: Hive Query Performance Tuning

Hi All,

I have a hive query which does the aggregation of amounts by reading from hive tables and loads the results to another hive table.
I am trying to fine tune the attached query. Read online and came up with following. Any Ideas I would be really appreciate. Thank you


      1. Indexing:- We can create Index on the tables. (some folks says index actually make it worse) https://community.cloudera.com/t5/Support-Questions/Creating-Indexes-in-Hive/td-p/149601<https://community.cloudera.com/t5/Support-Questions/Creating-Indexes-in-Hive/td-p/149601>

      2. Execution :- Right now the hive queries run mapreduce engine. We can set the execution engine to Tez for the improved performance (looks like Cloudera doesn't support Tez ) https://community.cloudera.com/t5/Support-Questions/Tez-Engine-not-working-over-CDH-5-8-2/td-p/49477<https://community.cloudera.com/t5/Support-Questions/Tez-Engine-not-working-over-CDH-5-8-2/td-p/49477>

      3. Bucketing: improves the join performance if the bucket key and join keys are common. Bucketing in Hive distributes the data in different buckets        based on the hash results on the bucket key. It also reduces the I/O scans during the join process if the process is happening on the same keys        (columns).
            SET hive.enforce.bucketing=true;
            SET hive.optimize.bucketmapjoin=true.

       4. Cost-Based Optimization in Hive (CBO)
           before submitting for final execution Hive optimizes each Query’s logical and physical execution plan
           However, CBO, performs, further optimizations based on query cost in a recent addition to Hive. That results in potentially different decisions: how to order joins, which type of join to                         perform, the degree of parallelism and others.
           set hive.cbo.enable=true;
           set hive.compute.query.using.stats=true;
           set hive.stats.fetch.column.stats=true;
           set hive.stats.fetch.partition.stats=true;


       5. Vectorization In Hive
To improve the performance of operations we can use Vectorized query execution. It happens by performing them in batches of 1024 rows at once instead of single row each time.It<http://time.It> significantly improves query execution time, and is easily enabled with two parameters settings
           set hive.vectorized.execution = true
           set hive.vectorized.execution.enabled = true


Bucketing parameters already set to true
Cost-Based Optimization in Hive (CBO) parameters set to true
Vectorization parameters set to true


I am not sure what else I can do to make the query work faster

--
Regards,
Rajbir

Disclaimer

The sender does not guarantee that this message, including any attachment, is secure or virus free. Also, it is confidential and may be privileged or otherwise protected from disclosure. 
If you are not the intended recipient, do not disclose or copy it or its contents. Please telephone or email the sender and delete the message entirely from your system. 
No binding obligations or payment commitments are to be derived from the contents of this email unless and until a clear written agreement containing all the necessary terms and conditions is properly executed. 
Jagex Limited is a company registered in England & Wales with company number 03982706 and a registered office at 220 Science Park, Milton Road, Cambridge, CB4 0WA, UK.