You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Mich Talebzadeh <mi...@peridale.co.uk> on 2016/01/22 11:14:57 UTC

Query with Hive on spark fails on partitioned table

Hi,

 

Hive version Hive 1.2.1,  Spark  version 1.3.1

 

I have an ORC table called sales as follows:

 

+---------------------------------------------------------------------------
----+--+

|                                createtab_stmt
|

+---------------------------------------------------------------------------
----+--+

| CREATE TABLE `sales`(
|

|   `prod_id` bigint,
|

|   `cust_id` bigint,
|

|   `time_id` timestamp,
|

|   `channel_id` bigint,
|

|   `promo_id` bigint,
|

|   `quantity_sold` decimal(10,0),
|

|   `amount_sold` decimal(10,0))
|

| PARTITIONED BY (
|

|   `year` int,
|

|   `month` int)
|

| CLUSTERED BY (
|

|   prod_id,
|

|   cust_id,
|

|   time_id,
|

|   channel_id,
|

|   promo_id)
|

| INTO 256 BUCKETS
|

| ROW FORMAT SERDE
|

|   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
|

| STORED AS INPUTFORMAT
|

|   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
|

| OUTPUTFORMAT
|

|   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
|

| LOCATION
|

|   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales'
|

| TBLPROPERTIES (
|

|
'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID',
|

|   'orc.bloom.filter.fpp'='0.05',
|

|   'orc.compress'='SNAPPY',
|

|   'orc.create.index'='true',
|

|   'orc.row.index.stride'='10000',
|

|   'orc.stripe.size'='268435456',
|

|   'transient_lastDdlTime'='1451814921')
|

+---------------------------------------------------------------------------
----+--+

 

Do a simple query --> select count(1) from smallsales where channel_id = 3;

 

Now with MapReduce the results comes back with no error

 

0: jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;

No rows affected (0.001 seconds)

0: jdbc:hive2://rhes564:10010/default> select count(1) from smallsales where
channel_id = 3;

INFO  : Starting Job = job_1453456459095_0001, Tracking URL =
http://rhes564:8088/proxy/application_1453456459095_0001/

INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
job_1453456459095_0001

INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of
reducers: 1

INFO  : 2016-01-22 09:55:03,511 Stage-1 map = 0%,  reduce = 0%

INFO  : 2016-01-22 09:55:13,821 Stage-1 map = 100%,  reduce = 0%, Cumulative
CPU 5.12 sec

INFO  : 2016-01-22 09:55:19,993 Stage-1 map = 100%,  reduce = 100%,
Cumulative CPU 6.39 sec

INFO  : MapReduce Total cumulative CPU time: 6 seconds 390 msec

INFO  : Ended Job = job_1453456459095_0001

+----------+--+

|   _c0    |

+----------+--+

| 2939942  |

+----------+--+

 

However, with Hive on spark the same query fails

 

0: jdbc:hive2://rhes564:10010/default> set hive.execution.engine=spark;

No rows affected (0.003 seconds)

0: jdbc:hive2://rhes564:10010/default> select count(1) from sales where
channel_id = 3;

INFO  :

Query Hive on Spark job[0] stages:

INFO  : 0

INFO  : 1

INFO  :

Status: Running (Hive on Spark job[0])

INFO  : Job Progress Format

CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]

INFO  : 2016-01-22 09:59:05,417 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1

INFO  : 2016-01-22 09:59:08,429 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1

INFO  : 2016-01-22 09:59:11,441 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1

INFO  : 2016-01-22 09:59:14,453 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1

INFO  : 2016-01-22 09:59:17,465 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1

INFO  : 2016-01-22 09:59:20,478 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1

INFO  : 2016-01-22 09:59:23,490 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1

INFO  : 2016-01-22 09:59:26,503 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1

INFO  : 2016-01-22 09:59:29,514 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1

INFO  : 2016-01-22 09:59:32,526 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1

INFO  : 2016-01-22 09:59:34,535 Stage-0_0: 0(+1,-1)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 09:59:37,555 Stage-0_0: 0(+1,-1)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 09:59:40,567 Stage-0_0: 0(+1,-1)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 09:59:43,579 Stage-0_0: 0(+1,-1)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 09:59:46,592 Stage-0_0: 0(+1,-1)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 09:59:49,604 Stage-0_0: 0(+1,-1)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 09:59:52,616 Stage-0_0: 0(+1,-1)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 09:59:55,628 Stage-0_0: 0(+1,-1)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 09:59:58,640 Stage-0_0: 0(+1,-1)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 10:00:01,652 Stage-0_0: 0(+1,-1)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 10:00:02,656 Stage-0_0: 0(+1,-2)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 10:00:05,668 Stage-0_0: 0(+1,-2)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 10:00:08,681 Stage-0_0: 0(+1,-2)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 10:00:11,693 Stage-0_0: 0(+1,-2)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 10:00:14,705 Stage-0_0: 0(+1,-2)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 10:00:17,717 Stage-0_0: 0(+1,-2)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 10:00:20,730 Stage-0_0: 0(+1,-2)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 10:00:23,744 Stage-0_0: 0(+1,-3)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 10:00:26,758 Stage-0_0: 0(+1,-3)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 10:00:29,770 Stage-0_0: 0(+1,-3)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 10:00:32,782 Stage-0_0: 0(+1,-3)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 10:00:35,794 Stage-0_0: 0(+1,-3)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 10:00:38,807 Stage-0_0: 0(+1,-3)/1   Stage-1_0: 0/1

INFO  : 2016-01-22 10:00:41,819 Stage-0_0: 0(+1,-3)/1   Stage-1_0: 0/1

ERROR : Status: Failed

Error: Error while processing statement: FAILED: Execution Error, return
code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
(state=08S01,code=3)

 

Sounds like Hive on Spark with ORC partitioned table is problematic as I
have exactly the same table defined as text with the same number of rows NOT
partitioned and it works

 

0: jdbc:hive2://rhes564:10010/default> show create table smallsales;

+-------------------------------------------------------------------------+-
-+

|                             createtab_stmt                              |

+-------------------------------------------------------------------------+-
-+

| CREATE TABLE `smallsales`(                                              |

|   `prod_id` bigint,                                                     |

|   `cust_id` bigint,                                                     |

|   `time_id` timestamp,                                                  |

|   `channel_id` bigint,                                                  |

|   `promo_id` bigint,                                                    |

|   `quantity_sold` decimal(10,0),                                        |

|   `amount_sold` decimal(10,0))                                          |

| ROW FORMAT SERDE                                                        |

|   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'                  |

| STORED AS INPUTFORMAT                                                   |

|   'org.apache.hadoop.mapred.TextInputFormat'                            |

| OUTPUTFORMAT                                                            |

|   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'          |

| LOCATION                                                                |

|   'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/smallsales'  |

| TBLPROPERTIES (                                                         |

|   'COLUMN_STATS_ACCURATE'='true',                                       |

|   'last_modified_by'='hduser',                                          |

|   'last_modified_time'='1451644705',                                    |

|   'numFiles'='1',                                                       |

|   'numRows'='5000000',                                                  |

|   'rawDataSize'='193437457',                                            |

|   'totalSize'='198437457',                                              |

|   'transient_lastDdlTime'='1451784743')                                 |

+-------------------------------------------------------------------------+-
-+

 

0: jdbc:hive2://rhes564:10010/default> set hive.execution.engine=spark;

No rows affected (0.001 seconds)

0: jdbc:hive2://rhes564:10010/default> select count(1) from smallsales where
channel_id = 3;

INFO  :

Query Hive on Spark job[1] stages:

INFO  : 2

INFO  : 3

INFO  :

Status: Running (Hive on Spark job[1])

INFO  : Job Progress Format

CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]

INFO  : 2016-01-22 10:22:18,126 Stage-2_0: 0(+1)/1      Stage-3_0: 0/1

INFO  : 2016-01-22 10:22:21,140 Stage-2_0: 1/1 Finished Stage-3_0: 0(+1)/1

INFO  : 2016-01-22 10:22:22,143 Stage-2_0: 1/1 Finished Stage-3_0: 1/1
Finished

INFO  : Status: Finished successfully in 5.03 seconds

+----------+--+

|   _c0    |

+----------+--+

| 2939942  |

+----------+--+

1 row selected (5.16 seconds)

 

Dr Mich Talebzadeh

 

LinkedIn
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABU
rV8Pw>
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908
.pdf>
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.