You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Daniel Haviv <da...@veracity-group.com> on 2015/03/26 18:26:44 UTC
Understanding Hive's execution plan
Hi,
Can anyone direct me to a good explanation on understanding Hive's execution plan?
Thanks,
Daniel
Re: Understanding Hive's execution plan
Posted by saquib khan <sk...@gmail.com>.
Try this, it has a chapter on hive
http://hdl.handle.net/10919/5122 <http://hdl.handle.net/10919/51223>3
On Thursday, March 26, 2015, Daniel Haviv <da...@veracity-group.com>
wrote:
> Hi,
> Can anyone direct me to a good explanation on understanding Hive's
> execution plan?
>
> Thanks,
> Daniel
>
RE: Understanding Hive's execution plan
Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Sure Daniel. Apologies.
Mich Talebzadeh
http://talebzadehmich.wordpress.com
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Daniel Haviv [mailto:daniel.haviv@veracity-group.com]
Sent: 27 March 2015 08:13
To: user@hive.apache.org
Subject: Re: Understanding Hive's execution plan
Hi Mich,
For the future, please refrain from hijacking threads and ask your questions in a separate one.
Thanks,
Daniel
On 27 במרץ 2015, at 00:44, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
I am very new to hive optimiser
Here I have a table with 4 million rows imported from Oracle via sqoop/hive. In this table object_id column is unique. Oracle table has primary key constraint on object_id column which is basically a unique B-tree index.
I do a very simple query to see how many unique values are for object_id in table. The answer is they are as many as number of rows.
So query like below in Oracle
SELECT (COUNT(DISTINCT(object_id))/COUNT(object_id)) FROM hddtester.tdash;
Should return 1. Now Oracle optimiser only needs to read the index key and work it out WITHOUT touching the underlying table and it does that
----------------------------------------------------------
Plan hash value: 1988751498
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | 13952 (1)| 00:02:48 |
| 1 | SORT AGGREGATE | | 1 | 26 | | |
| 2 | VIEW | VW_DAG_0 | 4000K| 99M| 13952 (1)| 00:02:48 |
| 3 | SORT GROUP BY NOSORT| | 4000K| 22M| 13952 (1)| 00:02:48 |
| 4 | INDEX FULL SCAN | TDASH_PK | 4000K| 22M| 13952 (1)| 00:02:48 |
-----------------------------------------------------------------------------------
Here it is shown as Operation Id = 4 “INDEX FULL SCAN”. Please note that the table itself is not touched as expected
Now I have the same table “tdash” in Hive with a compact index on object_id. I have analysed stats for table with “analyze table tdash compute statistics”. Now I do explain as below
hive> explain SELECT (COUNT(DISTINCT(object_id))/COUNT(object_id)) FROM tdash;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: tdash
Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: object_id (type: double)
outputColumnNames: object_id
Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(DISTINCT object_id), count(object_id)
keys: object_id (type: double)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: double)
sort order: +
Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE Column stats: NONE
value expressions: _col2 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(DISTINCT KEY._col0:0._col0), count(VALUE._col1)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: (_col0 / _col1) (type: double)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.691 seconds, Fetched: 50 row(s)
Trying to understand above does keys: object_id (type: double) refers to use of index here? I dropped that index and the same plan was produced!
Thanks
Mich Talebzadeh
http://talebzadehmich.wordpress.com
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Daniel Haviv [mailto:daniel.haviv@veracity-group.com]
Sent: 26 March 2015 17:27
To: user@hive.apache.org
Subject: Understanding Hive's execution plan
Hi,
Can anyone direct me to a good explanation on understanding Hive's execution plan?
Thanks,
Daniel
Re: Understanding Hive's execution plan
Posted by Daniel Haviv <da...@veracity-group.com>.
Hi Mich,
For the future, please refrain from hijacking threads and ask your questions in a separate one.
Thanks,
Daniel
> On 27 במרץ 2015, at 00:44, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>
> I am very new to hive optimiser
>
> Here I have a table with 4 million rows imported from Oracle via sqoop/hive. In this table object_id column is unique. Oracle table has primary key constraint on object_id column which is basically a unique B-tree index.
>
> I do a very simple query to see how many unique values are for object_id in table. The answer is they are as many as number of rows.
>
> So query like below in Oracle
>
> SELECT (COUNT(DISTINCT(object_id))/COUNT(object_id)) FROM hddtester.tdash;
>
> Should return 1. Now Oracle optimiser only needs to read the index key and work it out WITHOUT touching the underlying table and it does that
>
> ----------------------------------------------------------
> Plan hash value: 1988751498
>
> -----------------------------------------------------------------------------------
> | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
> -----------------------------------------------------------------------------------
> | 0 | SELECT STATEMENT | | 1 | 26 | 13952 (1)| 00:02:48 |
> | 1 | SORT AGGREGATE | | 1 | 26 | | |
> | 2 | VIEW | VW_DAG_0 | 4000K| 99M| 13952 (1)| 00:02:48 |
> | 3 | SORT GROUP BY NOSORT| | 4000K| 22M| 13952 (1)| 00:02:48 |
> | 4 | INDEX FULL SCAN | TDASH_PK | 4000K| 22M| 13952 (1)| 00:02:48 |
> -----------------------------------------------------------------------------------
>
> Here it is shown as Operation Id = 4 “INDEX FULL SCAN”. Please note that the table itself is not touched as expected
>
> Now I have the same table “tdash” in Hive with a compact index on object_id. I have analysed stats for table with “analyze table tdash compute statistics”. Now I do explain as below
>
> hive> explain SELECT (COUNT(DISTINCT(object_id))/COUNT(object_id)) FROM tdash;
> OK
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
>
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Map Operator Tree:
> TableScan
> alias: tdash
> Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE Column stats: NONE
> Select Operator
> expressions: object_id (type: double)
> outputColumnNames: object_id
> Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE Column stats: NONE
> Group By Operator
> aggregations: count(DISTINCT object_id), count(object_id)
> keys: object_id (type: double)
> mode: hash
> outputColumnNames: _col0, _col1, _col2
> Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col0 (type: double)
> sort order: +
> Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE Column stats: NONE
> value expressions: _col2 (type: bigint)
> Reduce Operator Tree:
> Group By Operator
> aggregations: count(DISTINCT KEY._col0:0._col0), count(VALUE._col1)
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
> Select Operator
> expressions: (_col0 / _col1) (type: double)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
>
> Time taken: 0.691 seconds, Fetched: 50 row(s)
>
> Trying to understand above does keys: object_id (type: double) refers to use of index here? I dropped that index and the same plan was produced!
>
> Thanks
>
>
> Mich Talebzadeh
>
> http://talebzadehmich.wordpress.com
>
> Publications due shortly:
> Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache
>
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>
> From: Daniel Haviv [mailto:daniel.haviv@veracity-group.com]
> Sent: 26 March 2015 17:27
> To: user@hive.apache.org
> Subject: Understanding Hive's execution plan
>
> Hi,
> Can anyone direct me to a good explanation on understanding Hive's execution plan?
>
> Thanks,
> Daniel
RE: Understanding Hive's execution plan
Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
I am very new to hive optimiser
Here I have a table with 4 million rows imported from Oracle via sqoop/hive. In this table object_id column is unique. Oracle table has primary key constraint on object_id column which is basically a unique B-tree index.
I do a very simple query to see how many unique values are for object_id in table. The answer is they are as many as number of rows.
So query like below in Oracle
SELECT (COUNT(DISTINCT(object_id))/COUNT(object_id)) FROM hddtester.tdash;
Should return 1. Now Oracle optimiser only needs to read the index key and work it out WITHOUT touching the underlying table and it does that
----------------------------------------------------------
Plan hash value: 1988751498
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | 13952 (1)| 00:02:48 |
| 1 | SORT AGGREGATE | | 1 | 26 | | |
| 2 | VIEW | VW_DAG_0 | 4000K| 99M| 13952 (1)| 00:02:48 |
| 3 | SORT GROUP BY NOSORT| | 4000K| 22M| 13952 (1)| 00:02:48 |
| 4 | INDEX FULL SCAN | TDASH_PK | 4000K| 22M| 13952 (1)| 00:02:48 |
-----------------------------------------------------------------------------------
Here it is shown as Operation Id = 4 “INDEX FULL SCAN”. Please note that the table itself is not touched as expected
Now I have the same table “tdash” in Hive with a compact index on object_id. I have analysed stats for table with “analyze table tdash compute statistics”. Now I do explain as below
hive> explain SELECT (COUNT(DISTINCT(object_id))/COUNT(object_id)) FROM tdash;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: tdash
Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: object_id (type: double)
outputColumnNames: object_id
Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(DISTINCT object_id), count(object_id)
keys: object_id (type: double)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: double)
sort order: +
Statistics: Num rows: 4000000 Data size: 32564651117 Basic stats: COMPLETE Column stats: NONE
value expressions: _col2 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(DISTINCT KEY._col0:0._col0), count(VALUE._col1)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: (_col0 / _col1) (type: double)
outputColumnNames: _col0
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.691 seconds, Fetched: 50 row(s)
Trying to understand above does keys: object_id (type: double) refers to use of index here? I dropped that index and the same plan was produced!
Thanks
Mich Talebzadeh
http://talebzadehmich.wordpress.com
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Daniel Haviv [mailto:daniel.haviv@veracity-group.com]
Sent: 26 March 2015 17:27
To: user@hive.apache.org
Subject: Understanding Hive's execution plan
Hi,
Can anyone direct me to a good explanation on understanding Hive's execution plan?
Thanks,
Daniel
Re: Understanding Hive's execution plan
Posted by Bala Krishna Gangisetty <ba...@altiscale.com>.
Hi Daniel - the below link can shed some light on it.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain
--Bala G.
On Thu, Mar 26, 2015 at 10:26 AM, Daniel Haviv <
daniel.haviv@veracity-group.com> wrote:
> Hi,
> Can anyone direct me to a good explanation on understanding Hive's
> execution plan?
>
> Thanks,
> Daniel
>