You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Mich Talebzadeh <mi...@peridale.co.uk> on 2015/12/29 11:53:41 UTC
Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Hi,
I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion
rows as follows:
show create table sales;
+---------------------------------------------------------------------------
----+--+
| createtab_stmt
|
+---------------------------------------------------------------------------
----+--+
| CREATE TABLE `sales`(
|
| `prod_id` bigint,
|
| `cust_id` bigint,
|
| `time_id` timestamp,
|
| `channel_id` bigint,
|
| `promo_id` bigint,
|
| `quantity_sold` decimal(10,0),
|
| `amount_sold` decimal(10,0))
|
| CLUSTERED BY (
|
| prod_id,
|
| cust_id,
|
| time_id,
|
| channel_id,
|
| promo_id)
|
| INTO 256 BUCKETS
|
| ROW FORMAT SERDE
|
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
|
| STORED AS INPUTFORMAT
|
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
|
| OUTPUTFORMAT
|
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
|
| LOCATION
|
| 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales'
|
| TBLPROPERTIES (
|
| 'COLUMN_STATS_ACCURATE'='true',
|
| 'last_modified_by'='hduser',
|
| 'last_modified_time'='1451305626',
|
| 'numFiles'='11',
|
| 'numRows'='1000000000',
|
|
'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID',
|
| 'orc.bloom.filter.fpp'='0.05',
|
| 'orc.compress'='SNAPPY',
|
| 'orc.create.index'='true',
|
| 'orc.row.index.stride'='10000',
|
| 'orc.stripe.size'='268435456',
|
| 'rawDataSize'='296000000000',
|
| 'totalSize'='2678882153',
|
| 'transient_lastDdlTime'='1451305626')
|
+---------------------------------------------------------------------------
----+--+
I use the following query to run against sales table only against Hive
SELECT
rs.Customer_ID
, rs.Number_of_orders
, rs.Total_customer_amount
, rs.Average_order
, rs.Standard_deviation
FROM
(
SELECT cust_id AS Customer_ID,
COUNT(amount_sold) AS Number_of_orders,
SUM(amount_sold) AS Total_customer_amount,
AVG(amount_sold) AS Average_order,
stddev_samp(amount_sold) AS Standard_deviation
FROM sales
GROUP BY cust_id
HAVING SUM(amount_sold) > 94000
AND AVG(amount_sold) < stddev_samp(amount_sold)
) rs
ORDER BY
-- Total_customer_amount DESC
3 DESC
Hive comes back in 17 minutes with 5,948 rows
bl -f sales.hql > sales.log
Connecting to jdbc:hive2://rhes564:10010/default
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
No rows affected (0.097 seconds)
No rows affected (0.001 seconds)
No rows affected (0.001 seconds)
No rows affected (0.038 seconds)
INFO : Warning: Using constant number 3 in order by. If you try to use
position alias when hive.groupby.orderby.position.alias is false, the
position alias will be ignored.
INFO :
Query Hive on Spark job[0] stages:
INFO : 0
INFO : 1
INFO : 2
INFO :
Status: Running (Hive on Spark job[0])
INFO : Job Progress Format
CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
INFO : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009
Stage-2_0: 0/1
INFO : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009
Stage-2_0: 0/1
INFO : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009
Stage-2_0: 0/1
INFO : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009
Stage-2_0: 0/1
INFO : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009
Stage-2_0: 0/1
INFO : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009
Stage-2_0: 0/1
INFO : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009
Stage-2_0: 0/1
INFO : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009
Stage-2_0: 0/1
INFO : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009
Stage-2_0: 0/1
INFO : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009
Stage-2_0: 0/1
INFO : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009
Stage-2_0: 0/1
INFO : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009
Stage-2_0: 0/1
INFO : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009
Stage-2_0: 0/1
INFO : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009
Stage-2_0: 0/1
INFO : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished Stage-1_0:
0(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished Stage-1_0:
6(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished Stage-1_0:
30(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished Stage-1_0:
59(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished Stage-1_0:
90(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished Stage-1_0:
124(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished Stage-1_0:
160(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished Stage-1_0:
193(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished Stage-1_0:
228(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished Stage-1_0:
262(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished Stage-1_0:
298(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished Stage-1_0:
338(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished Stage-1_0:
376(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished Stage-1_0:
417(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished Stage-1_0:
460(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished Stage-1_0:
502(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished Stage-1_0:
542(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished Stage-1_0:
584(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished Stage-1_0:
624(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished Stage-1_0:
667(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished Stage-1_0:
709(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished Stage-1_0:
754(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished Stage-1_0:
797(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished Stage-1_0:
844(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished Stage-1_0:
888(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished Stage-1_0:
934(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished Stage-1_0:
981(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished Stage-1_0:
1009/1009 Finished Stage-2_0: 0(+1)/1
INFO : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished Stage-1_0:
1009/1009 Finished Stage-2_0: 1/1 Finished
INFO : Status: Finished successfully in 1036.00 seconds
5,948 rows selected (1074.817 seconds)
So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in
23 seconds
Sybase IQ is a columnar database so each column is created as a fast
projection index by default. In addition I have created LF (bitmap) indexes
on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now
the query only touches CUST_ID.
My suspicion is that it is the Standard Deviation function stddev_samp()
that could be the bottleneck?
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908
.pdf>
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf
Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly
<http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com
NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.
RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
I agree but Spark 1.3.1 on Hive is the only one I have managed to make it work. Still it is twice as fast as Hive on MapReduce.
Just to clarify my understanding is that the optimiser is provided by Hive and is the same for both executions engines. Is there anything specific that Spark 1.3.1 lacks compared to Spark 1.5.1 when executing the query?
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 31 December 2015 18:44
To: user@hive.apache.org
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
You are using an old version of Spark and it cannot leverage all optimizations of Hive, so I think that your conclusion cannot be as easy as you might think.
On 31 Dec 2015, at 19:34, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:
Ok guys.
I have not succeeded in installing TEZ. Yet so I can try the query on TEZ as well.
Just to remind that the query is used is pretty common. Get the total amount sold for each calendar month from sales (I billion rows) and times
SELECT t.calendar_month_desc, SUM(s.amount_sold)
FROM sales s, times t WHERE s.time_id = t.time_id
GROUP BY t.calendar_month_desc;
In total 48 rows are returned back
Now having thought about It, granted TEZ is going to be faster than MR as it is basically MR with DAG thrown at it. On the other Spark will have both DAG and in-memory calculation.
The results are as follow:
Optimiser Engine Timing Compression Total Table size
Hive MapReduce 4673.035 seconds Snappy totalSize=2678882153 = 2.5GB
Hive Spark 1.3.1 1578.817 seconds Snappy
Columnar Sybase IQ 30.000 seconds Native 5GB
It is pretty obvious that Spark outperforms MapReduce more than twice even taking into account the number of rows on the FACT table and frankly I would not have thought that TEZ is going to beat Spark (to be seen). Having said that Hive storage is twice more efficient but I am not sure what one can do to improve the performance. Table in Hive is stored as ORC table and it has crossed my mind that maybe we should think about storing every column of an ORC table as an index. That may improve the performance further.
HTH
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Marcin Tustin [mailto:mtustin@handybook.com]
Sent: 30 December 2015 19:27
To: user@hive.apache.org <ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much faster than MR in pretty much all cases. Also, with hive, you'll make sure you've performed optimizations like aligning ORC stripe sizes with HDFS block sizes, and concatenated your tables (not so much an optimization as a must for avoiding the small files problem).
On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:
Thanks again Jorn.
Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I have compression enabled. The FACT table in IQ (sales) has LF (read bitmap) indexes on the time_id column. For the dimension table (times) I have time_id defined as primary key. Also Sybase IQ creates FP (fast projection) indexes on every column by default.
Anyway I am trying to download and build TEZ. Do we know which version of TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Posted by Jörn Franke <jo...@gmail.com>.
You are using an old version of Spark and it cannot leverage all optimizations of Hive, so I think that your conclusion cannot be as easy as you might think.
> On 31 Dec 2015, at 19:34, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>
> Ok guys.
>
> I have not succeeded in installing TEZ. Yet so I can try the query on TEZ as well.
>
> Just to remind that the query is used is pretty common. Get the total amount sold for each calendar month from sales (I billion rows) and times
>
> SELECT t.calendar_month_desc, SUM(s.amount_sold)
> FROM sales s, times t WHERE s.time_id = t.time_id
> GROUP BY t.calendar_month_desc;
>
> In total 48 rows are returned back
> Now having thought about It, granted TEZ is going to be faster than MR as it is basically MR with DAG thrown at it. On the other Spark will have both DAG and in-memory calculation.
>
>
> The results are as follow:
>
>
> Optimiser Engine Timing Compression Total Table size
> Hive MapReduce 4673.035 seconds Snappy totalSize=2678882153 = 2.5GB
> Hive Spark 1.3.1 1578.817 seconds Snappy
> Columnar Sybase IQ 30.000 seconds Native 5GB
>
>
> It is pretty obvious that Spark outperforms MapReduce more than twice even taking into account the number of rows on the FACT table and frankly I would not have thought that TEZ is going to beat Spark (to be seen). Having said that Hive storage is twice more efficient but I am not sure what one can do to improve the performance. Table in Hive is stored as ORC table and it has crossed my mind that maybe we should think about storing every column of an ORC table as an index. That may improve the performance further.
>
> HTH
>
>
> Mich Talebzadeh
>
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
>
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>
> From: Marcin Tustin [mailto:mtustin@handybook.com]
> Sent: 30 December 2015 19:27
> To: user@hive.apache.org
> Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
>
> I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much faster than MR in pretty much all cases. Also, with hive, you'll make sure you've performed optimizations like aligning ORC stripe sizes with HDFS block sizes, and concatenated your tables (not so much an optimization as a must for avoiding the small files problem).
>
> On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
> Thanks again Jorn.
>
>
> Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I have compression enabled. The FACT table in IQ (sales) has LF (read bitmap) indexes on the time_id column. For the dimension table (times) I have time_id defined as primary key. Also Sybase IQ creates FP (fast projection) indexes on every column by default.
>
> Anyway I am trying to download and build TEZ. Do we know which version of TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
>
> Thanks
>
> Mich Talebzadeh
>
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
>
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>
RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Ok guys.
I have not succeeded in installing TEZ. Yet so I can try the query on TEZ as well.
Just to remind that the query is used is pretty common. Get the total amount sold for each calendar month from sales (I billion rows) and times
SELECT t.calendar_month_desc, SUM(s.amount_sold)
FROM sales s, times t WHERE s.time_id = t.time_id
GROUP BY t.calendar_month_desc;
In total 48 rows are returned back
Now having thought about It, granted TEZ is going to be faster than MR as it is basically MR with DAG thrown at it. On the other Spark will have both DAG and in-memory calculation.
The results are as follow:
Optimiser Engine Timing Compression Total Table size
Hive MapReduce 4673.035 seconds Snappy totalSize=2678882153 = 2.5GB
Hive Spark 1.3.1 1578.817 seconds Snappy
Columnar Sybase IQ 30.000 seconds Native 5GB
It is pretty obvious that Spark outperforms MapReduce more than twice even taking into account the number of rows on the FACT table and frankly I would not have thought that TEZ is going to beat Spark (to be seen). Having said that Hive storage is twice more efficient but I am not sure what one can do to improve the performance. Table in Hive is stored as ORC table and it has crossed my mind that maybe we should think about storing every column of an ORC table as an index. That may improve the performance further.
HTH
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Marcin Tustin [mailto:mtustin@handybook.com]
Sent: 30 December 2015 19:27
To: user@hive.apache.org
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much faster than MR in pretty much all cases. Also, with hive, you'll make sure you've performed optimizations like aligning ORC stripe sizes with HDFS block sizes, and concatenated your tables (not so much an optimization as a must for avoiding the small files problem).
On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:
Thanks again Jorn.
Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I have compression enabled. The FACT table in IQ (sales) has LF (read bitmap) indexes on the time_id column. For the dimension table (times) I have time_id defined as primary key. Also Sybase IQ creates FP (fast projection) indexes on every column by default.
Anyway I am trying to download and build TEZ. Do we know which version of TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
Re: Running the same query on 1 billion rows fact table in Hive on
Spark compared to Sybase IQ columnar database
Posted by Marcin Tustin <mt...@handybook.com>.
Yes, that's why I haven't had to compile anything.
On Wed, Dec 30, 2015 at 4:16 PM, Jörn Franke <jo...@gmail.com> wrote:
> Hdp Should have TEZ already on-Board bye default.
>
> On 30 Dec 2015, at 21:42, Marcin Tustin <mt...@handybook.com> wrote:
>
> I'm afraid I use the HDP distribution so I haven't yet had to compile
> anything. (Incidentally, this isn't a recommendation of HDP over anything
> else).
>
> On Wed, Dec 30, 2015 at 3:33 PM, Mich Talebzadeh <mi...@peridale.co.uk>
> wrote:
>
>> Thanks Marcin
>>
>>
>>
>> Trying to build TEZ 0.7 in
>>
>>
>>
>> /usr/lib/apache-tez-0.7.0-src
>>
>>
>>
>> using
>>
>>
>>
>> mvn -X clean package -DskipTests=true -Dmaven.javadoc.skip=true
>>
>>
>>
>> with mvn version 3.2.5 (as opposed to 3.3) as I read that I can build it
>> OK with 3.2.5 following the same error ass below
>>
>>
>>
>> mvn --version
>>
>> Apache Maven *3.2.5* (12a6b3acb947671f09b81f49094c53f426d8cea1;
>> 2014-12-14T17:29:23+00:00)
>>
>> Maven home: /usr/local/apache-maven/apache-maven-3.2.5
>>
>> Java version: 1.7.0_25, vendor: Oracle Corporation
>>
>> Java home: /usr/java/jdk1.7.0_25/jre
>>
>>
>>
>> *I get this error*
>>
>>
>>
>> [INFO] tez-ui ............................................. FAILURE [
>> 0.411 s]
>>
>> [
>>
>>
>>
>> DEBUG] -- end configuration --
>>
>> [INFO] Running 'npm install --color=false' in
>> /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp
>>
>> [INFO]
>> /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp/node/with_new_path.sh:
>> line 3: 23781 Aborted "$@"
>>
>>
>>
>>
>>
>> [ERROR] Failed to execute goal
>> com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on
>> project tez-ui: Failed to run task: 'npm install --color=false' failed.
>> (error code 134) -> [Help 1]
>>
>> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
>> goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on
>> project tez-ui: Failed to run task
>>
>>
>>
>>
>>
>> any ideas as there is little info available in net.
>>
>>
>>
>>
>>
>> Thanks
>>
>>
>>
>> Mich Talebzadeh
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>> *From:* Marcin Tustin [mailto:mtustin@handybook.com]
>> *Sent:* 30 December 2015 19:27
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: Running the same query on 1 billion rows fact table in
>> Hive on Spark compared to Sybase IQ columnar database
>>
>>
>>
>> I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is
>> much faster than MR in pretty much all cases. Also, with hive, you'll make
>> sure you've performed optimizations like aligning ORC stripe sizes with
>> HDFS block sizes, and concatenated your tables (not so much an optimization
>> as a must for avoiding the small files problem).
>>
>>
>>
>> On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <mi...@peridale.co.uk>
>> wrote:
>>
>> Thanks again Jorn.
>>
>>
>>
>>
>>
>> Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I
>> have compression enabled. The FACT table in IQ (sales) has LF (read bitmap)
>> indexes on the time_id column. For the dimension table (times) I have
>> time_id defined as primary key. Also Sybase IQ creates FP (fast projection)
>> indexes on every column by default.
>>
>>
>>
>> Anyway I am trying to download and build TEZ. Do we know which version of
>> TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
>>
>>
>>
>> Thanks
>>
>>
>>
>> Mich Talebzadeh
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>> *From:* Jörn Franke [mailto:jornfranke@gmail.com]
>> *Sent:* 30 December 2015 16:29
>>
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: Running the same query on 1 billion rows fact table in
>> Hive on Spark compared to Sybase IQ columnar database
>>
>>
>>
>>
>> Hmm i think the execution Engine TEZ has (currently) the most
>> optimizations on Hive. What about your hardware - is it the same? Do you
>> have also compression on Sybase?
>>
>> Alternatively you need to wait for Hive for interactive analytics (tez
>> 0.8 + llap).
>>
>>
>> On 30 Dec 2015, at 13:47, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>>
>> Hi Jorn,
>>
>>
>>
>> Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have
>> not tried it on TEZ. I tried the query on MR engine and it did nor fair
>> better. I also ran it without SDDDEV function and found out that the
>> function did not slow it down.
>>
>>
>>
>> I tried a simple query as follows builr in sales FACT table 1e9 rows and
>> dimension table times (1826 rows)
>>
>>
>>
>> --
>>
>> -- Get the total amount sold for each calendar month
>>
>> --
>>
>> *SELECT t.calendar_month_desc, SUM(s.amount_sold)*
>>
>> *FROM sales s, times t WHERE s.time_id = t.time_id*
>>
>> *GROUP BY t.calendar_month_desc;*
>>
>>
>>
>> Now Sybase IQ comes back in around 30 seconds.
>>
>>
>>
>> Started query at Dec 30 2015 08:14:33:399AM
>>
>> (48 rows affected)
>>
>> Finished query at Dec 30 2015 08:15:04:640AM
>>
>>
>>
>> Whereas Hive with the following setting and running the same query
>>
>>
>>
>> set
>> hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
>>
>> set hive.optimize.bucketmapjoin=true;
>>
>> set hive.optimize.bucketmapjoin.sortedmerge=true;
>>
>>
>>
>> Comes back in
>>
>>
>>
>> 48 rows selected (1514.687 seconds)
>>
>>
>>
>> I don’t know what else can be done. Obviously this is all schema on read
>> so I am not sure I can change bucketing on FACT table based on one query
>> alone!
>>
>>
>>
>>
>>
>>
>>
>> +--------------------------------------------------------------------+--+
>>
>> | createtab_stmt |
>>
>> +--------------------------------------------------------------------+--+
>>
>> | CREATE TABLE `times`( |
>>
>> | `time_id` timestamp, |
>>
>> | `day_name` varchar(9), |
>>
>> | `day_number_in_week` int, |
>>
>> | `day_number_in_month` int, |
>>
>> | `calendar_week_number` int, |
>>
>> | `fiscal_week_number` int, |
>>
>> | `week_ending_day` timestamp, |
>>
>> | `week_ending_day_id` bigint, |
>>
>> | `calendar_month_number` int, |
>>
>> | `fiscal_month_number` int, |
>>
>> | `calendar_month_desc` varchar(8), |
>>
>> ----------
>>
>> | `days_in_fis_year` bigint, |
>>
>> | `end_of_cal_year` timestamp, |
>>
>> | `end_of_fis_year` timestamp) |
>>
>> | CLUSTERED BY ( |
>>
>> | time_id) |
>>
>> | INTO 256 BUCKETS |
>>
>> | ROW FORMAT SERDE |
>>
>> | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
>>
>> | STORED AS INPUTFORMAT |
>>
>> | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
>>
>> | OUTPUTFORMAT |
>>
>> | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
>>
>> | LOCATION |
>>
>> | 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times' |
>>
>> | TBLPROPERTIES ( |
>>
>> | 'COLUMN_STATS_ACCURATE'='true', |
>>
>> | 'numFiles'='1', |
>>
>> | 'numRows'='1826', |
>>
>> | 'orc.bloom.filter.columns'='TIME_ID', |
>>
>> | 'orc.bloom.filter.fpp'='0.05', |
>>
>> | 'orc.compress'='SNAPPY', |
>>
>> | 'orc.create.index'='true', |
>>
>> | 'orc.row.index.stride'='10000', |
>>
>> | 'orc.stripe.size'='268435456', |
>>
>> | 'rawDataSize'='0', |
>>
>> | 'totalSize'='11155', |
>>
>> | 'transient_lastDdlTime'='1451429900') |
>>
>>
>>
>> ;
>>
>>
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>> *From:* Jörn Franke [mailto:jornfranke@gmail.com <jo...@gmail.com>]
>> *Sent:* 30 December 2015 08:28
>> *To:* user@hive.apache.org
>> *Subject:* Re: Running the same query on 1 billion rows fact table in
>> Hive on Spark compared to Sybase IQ columnar database
>>
>>
>>
>> Have you tried it with Hive ob TEZ? It contains (currently) more
>> optimizations than Hive on Spark.
>>
>> I assume you use the latest Hive version.
>>
>> Additionally you may want to think about calculating statistics
>> (depending on your configuration you need to trigger it) - I am not sure if
>> Spark can use them.
>>
>> I am not sure if bloom filters on the columns you mention make sense. You
>> may also want to increase stride size (depending on your data).
>>
>> Currently you bucket by a lot of fields, which may not make sense. You
>> also may want to sort the data by customer Id in the table.
>>
>> You also seem to have a lot of reducers, which you may want to decrease.
>>
>>
>>
>> Have you tried without "having stddev_samp" ? Is the query exactly the
>> same as in Sybase?
>>
>>
>> On 29 Dec 2015, at 11:53, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>>
>> Hi,
>>
>>
>>
>> I have a fact table in Hive imported from Sybase IQ via SQOOP with 1
>> billion rows as follows:
>>
>>
>>
>> show create table sales;
>>
>>
>> +-------------------------------------------------------------------------------+--+
>>
>> |
>> createtab_stmt |
>>
>>
>> +-------------------------------------------------------------------------------+--+
>>
>> | CREATE TABLE
>> `sales`( |
>>
>> | `prod_id`
>> bigint, |
>>
>> | `cust_id`
>> bigint, |
>>
>> | `time_id`
>> timestamp, |
>>
>> | `channel_id`
>> bigint, |
>>
>> | `promo_id`
>> bigint, |
>>
>> | `quantity_sold`
>> decimal(10,0), |
>>
>> | `amount_sold`
>> decimal(10,0)) |
>>
>> | CLUSTERED BY (
>> |
>>
>> |
>> prod_id,
>> |
>>
>> |
>> cust_id,
>> |
>>
>> | time_id,
>> |
>>
>> |
>> channel_id,
>> |
>>
>> |
>> promo_id)
>> |
>>
>> | INTO 256 BUCKETS
>> |
>>
>> | ROW FORMAT
>> SERDE |
>>
>> |
>> 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>> |
>>
>> | STORED AS INPUTFORMAT
>> |
>>
>> |
>> 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>> |
>>
>> |
>> OUTPUTFORMAT
>> |
>>
>> | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>> |
>>
>> |
>> LOCATION
>> |
>>
>> |
>> 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales'
>> |
>>
>> | TBLPROPERTIES
>> ( |
>>
>> |
>> 'COLUMN_STATS_ACCURATE'='true',
>> |
>>
>> |
>> 'last_modified_by'='hduser',
>> |
>>
>> |
>> 'last_modified_time'='1451305626',
>> |
>>
>> |
>> 'numFiles'='11',
>> |
>>
>> |
>> 'numRows'='1000000000',
>> |
>>
>> |
>> 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID',
>> |
>>
>> |
>> 'orc.bloom.filter.fpp'='0.05',
>> |
>>
>> |
>> 'orc.compress'='SNAPPY',
>> |
>>
>> |
>> 'orc.create.index'='true',
>> |
>>
>> |
>> 'orc.row.index.stride'='10000',
>> |
>>
>> |
>> 'orc.stripe.size'='268435456',
>> |
>>
>> |
>> 'rawDataSize'='296000000000',
>> |
>>
>> | 'totalSize'='2678882153',
>> |
>>
>> |
>> 'transient_lastDdlTime'='1451305626')
>> |
>>
>>
>> +-------------------------------------------------------------------------------+--+
>>
>>
>>
>> I use the following query to run against sales table only against Hive
>>
>>
>>
>> SELECT
>>
>> rs.Customer_ID
>>
>> , rs.Number_of_orders
>>
>> , rs.Total_customer_amount
>>
>> , rs.Average_order
>>
>> , rs.Standard_deviation
>>
>> FROM
>>
>> (
>>
>> SELECT cust_id AS Customer_ID,
>>
>> COUNT(amount_sold) AS Number_of_orders,
>>
>> SUM(amount_sold) AS Total_customer_amount,
>>
>> AVG(amount_sold) AS Average_order,
>>
>> stddev_samp(amount_sold) AS Standard_deviation
>>
>> FROM sales
>>
>> GROUP BY cust_id
>>
>> HAVING SUM(amount_sold) > 94000
>>
>> AND AVG(amount_sold) < stddev_samp(amount_sold)
>>
>> ) rs
>>
>> ORDER BY
>>
>> -- Total_customer_amount DESC
>>
>> 3 DESC
>>
>>
>>
>> Hive comes back in 17 minutes with 5,948 rows
>>
>>
>>
>> bl -f sales.hql > sales.log
>>
>> Connecting to jdbc:hive2://rhes564:10010/default
>>
>> Connected to: Apache Hive (version 1.2.1)
>>
>> Driver: Hive JDBC (version 1.2.1)
>>
>> Transaction isolation: TRANSACTION_REPEATABLE_READ
>>
>> Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
>>
>> No rows affected (0.097 seconds)
>>
>> No rows affected (0.001 seconds)
>>
>> No rows affected (0.001 seconds)
>>
>> No rows affected (0.038 seconds)
>>
>> INFO : Warning: Using constant number 3 in order by. If you try to use
>> position alias when hive.groupby.orderby.position.alias is false, the
>> position alias will be ignored.
>>
>> INFO :
>>
>> Query Hive on Spark job[0] stages:
>>
>> INFO : 0
>>
>> INFO : 1
>>
>> INFO : 2
>>
>> INFO :
>>
>> Status: Running (Hive on Spark job[0])
>>
>> INFO : Job Progress Format
>>
>> CurrentTime StageId_StageAttemptId:
>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>> [StageCost]
>>
>> INFO : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009
>> Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009
>> Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11 Stage-1_0:
>> 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11 Stage-1_0:
>> 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11 Stage-1_0:
>> 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11 Stage-1_0:
>> 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11 Stage-1_0:
>> 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11 Stage-1_0:
>> 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11 Stage-1_0:
>> 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11 Stage-1_0:
>> 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11 Stage-1_0:
>> 0/1009 Stage-2_0: 0/1
>>
>>
>>
>> INFO : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11 Stage-1_0:
>> 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11 Stage-1_0:
>> 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11 Stage-1_0:
>> 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished
>> Stage-1_0: 0(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished
>> Stage-1_0: 6(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished
>> Stage-1_0: 30(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished
>> Stage-1_0: 59(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished
>> Stage-1_0: 90(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished
>> Stage-1_0: 124(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished
>> Stage-1_0: 160(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished
>> Stage-1_0: 193(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished
>> Stage-1_0: 228(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished
>> Stage-1_0: 262(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished
>> Stage-1_0: 298(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished
>> Stage-1_0: 338(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished
>> Stage-1_0: 376(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished
>> Stage-1_0: 417(+3)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished
>> Stage-1_0: 460(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished
>> Stage-1_0: 502(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished
>> Stage-1_0: 542(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished
>> Stage-1_0: 584(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished
>> Stage-1_0: 624(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished
>> Stage-1_0: 667(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished
>> Stage-1_0: 709(+3)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished
>> Stage-1_0: 754(+3)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished
>> Stage-1_0: 797(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished
>> Stage-1_0: 844(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished
>> Stage-1_0: 888(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished
>> Stage-1_0: 934(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished
>> Stage-1_0: 981(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished
>> Stage-1_0: 1009/1009 Finished Stage-2_0: 0(+1)/1
>>
>> INFO : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished
>> Stage-1_0: 1009/1009 Finished Stage-2_0: 1/1 Finished
>>
>> INFO : Status: Finished successfully in 1036.00 seconds
>>
>> *5,948 rows selected (1074.817 seconds)*
>>
>>
>>
>> So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows
>> in 23 seconds
>>
>>
>>
>> Sybase IQ is a columnar database so each column is created as a fast
>> projection index by default. In addition I have created LF (bitmap) indexes
>> on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now
>> the query only touches CUST_ID.
>>
>>
>>
>> My suspicion is that it is the Standard Deviation function stddev_samp() that
>> could be the bottleneck?
>>
>>
>>
>> Thanks
>>
>>
>>
>> Mich Talebzadeh
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>>
>>
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>>
>> Latest news <http://www.handy.com/press> at Handy
>>
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>> by Fidelity
>>
>>
>>
>> <image001.jpg>
>>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
> by Fidelity
>
>
--
Want to work at Handy? Check out our culture deck and open roles
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
by Fidelity
Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Posted by Jörn Franke <jo...@gmail.com>.
Hdp Should have TEZ already on-Board bye default.
> On 30 Dec 2015, at 21:42, Marcin Tustin <mt...@handybook.com> wrote:
>
> I'm afraid I use the HDP distribution so I haven't yet had to compile anything. (Incidentally, this isn't a recommendation of HDP over anything else).
>
>> On Wed, Dec 30, 2015 at 3:33 PM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>> Thanks Marcin
>>
>>
>>
>> Trying to build TEZ 0.7 in
>>
>>
>>
>> /usr/lib/apache-tez-0.7.0-src
>>
>>
>>
>> using
>>
>>
>>
>> mvn -X clean package -DskipTests=true -Dmaven.javadoc.skip=true
>>
>>
>>
>> with mvn version 3.2.5 (as opposed to 3.3) as I read that I can build it OK with 3.2.5 following the same error ass below
>>
>>
>>
>> mvn --version
>>
>> Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1; 2014-12-14T17:29:23+00:00)
>>
>> Maven home: /usr/local/apache-maven/apache-maven-3.2.5
>>
>> Java version: 1.7.0_25, vendor: Oracle Corporation
>>
>> Java home: /usr/java/jdk1.7.0_25/jre
>>
>>
>>
>> I get this error
>>
>>
>>
>> [INFO] tez-ui ............................................. FAILURE [ 0.411 s]
>>
>> [
>>
>>
>>
>> DEBUG] -- end configuration --
>>
>> [INFO] Running 'npm install --color=false' in /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp
>>
>> [INFO] /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp/node/with_new_path.sh: line 3: 23781 Aborted "$@"
>>
>>
>>
>>
>>
>> [ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on project tez-ui: Failed to run task: 'npm install --color=false' failed. (error code 134) -> [Help 1]
>>
>> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on project tez-ui: Failed to run task
>>
>>
>>
>>
>>
>> any ideas as there is little info available in net.
>>
>>
>>
>>
>>
>> Thanks
>>
>>
>>
>> Mich Talebzadeh
>>
>>
>>
>> Sybase ASE 15 Gold Medal Award 2008
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
>>
>> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
>>
>> Publications due shortly:
>>
>> Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
>>
>> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>>
>>
>>
>> From: Marcin Tustin [mailto:mtustin@handybook.com]
>> Sent: 30 December 2015 19:27
>>
>>
>> To: user@hive.apache.org
>> Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
>>
>>
>> I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much faster than MR in pretty much all cases. Also, with hive, you'll make sure you've performed optimizations like aligning ORC stripe sizes with HDFS block sizes, and concatenated your tables (not so much an optimization as a must for avoiding the small files problem).
>>
>>
>>
>> On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>>
>> Thanks again Jorn.
>>
>>
>>
>>
>>
>> Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I have compression enabled. The FACT table in IQ (sales) has LF (read bitmap) indexes on the time_id column. For the dimension table (times) I have time_id defined as primary key. Also Sybase IQ creates FP (fast projection) indexes on every column by default.
>>
>>
>>
>> Anyway I am trying to download and build TEZ. Do we know which version of TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
>>
>>
>>
>> Thanks
>>
>>
>>
>> Mich Talebzadeh
>>
>>
>>
>> Sybase ASE 15 Gold Medal Award 2008
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
>>
>> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
>>
>> Publications due shortly:
>>
>> Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
>>
>> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>>
>>
>>
>> From: Jörn Franke [mailto:jornfranke@gmail.com]
>> Sent: 30 December 2015 16:29
>>
>>
>> To: user@hive.apache.org
>> Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
>>
>>
>>
>>
>> Hmm i think the execution Engine TEZ has (currently) the most optimizations on Hive. What about your hardware - is it the same? Do you have also compression on Sybase?
>>
>> Alternatively you need to wait for Hive for interactive analytics (tez 0.8 + llap).
>>
>>
>> On 30 Dec 2015, at 13:47, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>>
>> Hi Jorn,
>>
>>
>>
>> Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not tried it on TEZ. I tried the query on MR engine and it did nor fair better. I also ran it without SDDDEV function and found out that the function did not slow it down.
>>
>>
>>
>> I tried a simple query as follows builr in sales FACT table 1e9 rows and dimension table times (1826 rows)
>>
>>
>>
>> --
>>
>> -- Get the total amount sold for each calendar month
>>
>> --
>>
>> SELECT t.calendar_month_desc, SUM(s.amount_sold)
>>
>> FROM sales s, times t WHERE s.time_id = t.time_id
>>
>> GROUP BY t.calendar_month_desc;
>>
>>
>>
>> Now Sybase IQ comes back in around 30 seconds.
>>
>>
>>
>> Started query at Dec 30 2015 08:14:33:399AM
>>
>> (48 rows affected)
>>
>> Finished query at Dec 30 2015 08:15:04:640AM
>>
>>
>>
>> Whereas Hive with the following setting and running the same query
>>
>>
>>
>> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
>>
>> set hive.optimize.bucketmapjoin=true;
>>
>> set hive.optimize.bucketmapjoin.sortedmerge=true;
>>
>>
>>
>> Comes back in
>>
>>
>>
>> 48 rows selected (1514.687 seconds)
>>
>>
>>
>> I don’t know what else can be done. Obviously this is all schema on read so I am not sure I can change bucketing on FACT table based on one query alone!
>>
>>
>>
>>
>>
>>
>>
>> +--------------------------------------------------------------------+--+
>>
>> | createtab_stmt |
>>
>> +--------------------------------------------------------------------+--+
>>
>> | CREATE TABLE `times`( |
>>
>> | `time_id` timestamp, |
>>
>> | `day_name` varchar(9), |
>>
>> | `day_number_in_week` int, |
>>
>> | `day_number_in_month` int, |
>>
>> | `calendar_week_number` int, |
>>
>> | `fiscal_week_number` int, |
>>
>> | `week_ending_day` timestamp, |
>>
>> | `week_ending_day_id` bigint, |
>>
>> | `calendar_month_number` int, |
>>
>> | `fiscal_month_number` int, |
>>
>> | `calendar_month_desc` varchar(8), |
>>
>> ----------
>>
>> | `days_in_fis_year` bigint, |
>>
>> | `end_of_cal_year` timestamp, |
>>
>> | `end_of_fis_year` timestamp) |
>>
>> | CLUSTERED BY ( |
>>
>> | time_id) |
>>
>> | INTO 256 BUCKETS |
>>
>> | ROW FORMAT SERDE |
>>
>> | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
>>
>> | STORED AS INPUTFORMAT |
>>
>> | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
>>
>> | OUTPUTFORMAT |
>>
>> | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
>>
>> | LOCATION |
>>
>> | 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times' |
>>
>> | TBLPROPERTIES ( |
>>
>> | 'COLUMN_STATS_ACCURATE'='true', |
>>
>> | 'numFiles'='1', |
>>
>> | 'numRows'='1826', |
>>
>> | 'orc.bloom.filter.columns'='TIME_ID', |
>>
>> | 'orc.bloom.filter.fpp'='0.05', |
>>
>> | 'orc.compress'='SNAPPY', |
>>
>> | 'orc.create.index'='true', |
>>
>> | 'orc.row.index.stride'='10000', |
>>
>> | 'orc.stripe.size'='268435456', |
>>
>> | 'rawDataSize'='0', |
>>
>> | 'totalSize'='11155', |
>>
>> | 'transient_lastDdlTime'='1451429900') |
>>
>>
>>
>> ;
>>
>>
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>>
>>
>>
>> From: Jörn Franke [mailto:jornfranke@gmail.com]
>> Sent: 30 December 2015 08:28
>> To: user@hive.apache.org
>> Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
>>
>>
>>
>> Have you tried it with Hive ob TEZ? It contains (currently) more optimizations than Hive on Spark.
>>
>> I assume you use the latest Hive version.
>>
>> Additionally you may want to think about calculating statistics (depending on your configuration you need to trigger it) - I am not sure if Spark can use them.
>>
>> I am not sure if bloom filters on the columns you mention make sense. You may also want to increase stride size (depending on your data).
>>
>> Currently you bucket by a lot of fields, which may not make sense. You also may want to sort the data by customer Id in the table.
>>
>> You also seem to have a lot of reducers, which you may want to decrease.
>>
>>
>>
>> Have you tried without "having stddev_samp" ? Is the query exactly the same as in Sybase?
>>
>>
>> On 29 Dec 2015, at 11:53, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>>
>> Hi,
>>
>>
>>
>> I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion rows as follows:
>>
>>
>>
>> show create table sales;
>>
>> +-------------------------------------------------------------------------------+--+
>>
>> | createtab_stmt |
>>
>> +-------------------------------------------------------------------------------+--+
>>
>> | CREATE TABLE `sales`( |
>>
>> | `prod_id` bigint, |
>>
>> | `cust_id` bigint, |
>>
>> | `time_id` timestamp, |
>>
>> | `channel_id` bigint, |
>>
>> | `promo_id` bigint, |
>>
>> | `quantity_sold` decimal(10,0), |
>>
>> | `amount_sold` decimal(10,0)) |
>>
>> | CLUSTERED BY ( |
>>
>> | prod_id, |
>>
>> | cust_id, |
>>
>> | time_id, |
>>
>> | channel_id, |
>>
>> | promo_id) |
>>
>> | INTO 256 BUCKETS |
>>
>> | ROW FORMAT SERDE |
>>
>> | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
>>
>> | STORED AS INPUTFORMAT |
>>
>> | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
>>
>> | OUTPUTFORMAT |
>>
>> | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
>>
>> | LOCATION |
>>
>> | 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales' |
>>
>> | TBLPROPERTIES ( |
>>
>> | 'COLUMN_STATS_ACCURATE'='true', |
>>
>> | 'last_modified_by'='hduser', |
>>
>> | 'last_modified_time'='1451305626', |
>>
>> | 'numFiles'='11', |
>>
>> | 'numRows'='1000000000', |
>>
>> | 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID', |
>>
>> | 'orc.bloom.filter.fpp'='0.05', |
>>
>> | 'orc.compress'='SNAPPY', |
>>
>> | 'orc.create.index'='true', |
>>
>> | 'orc.row.index.stride'='10000', |
>>
>> | 'orc.stripe.size'='268435456', |
>>
>> | 'rawDataSize'='296000000000', |
>>
>> | 'totalSize'='2678882153', |
>>
>> | 'transient_lastDdlTime'='1451305626') |
>>
>> +-------------------------------------------------------------------------------+--+
>>
>>
>>
>> I use the following query to run against sales table only against Hive
>>
>>
>>
>> SELECT
>>
>> rs.Customer_ID
>>
>> , rs.Number_of_orders
>>
>> , rs.Total_customer_amount
>>
>> , rs.Average_order
>>
>> , rs.Standard_deviation
>>
>> FROM
>>
>> (
>>
>> SELECT cust_id AS Customer_ID,
>>
>> COUNT(amount_sold) AS Number_of_orders,
>>
>> SUM(amount_sold) AS Total_customer_amount,
>>
>> AVG(amount_sold) AS Average_order,
>>
>> stddev_samp(amount_sold) AS Standard_deviation
>>
>> FROM sales
>>
>> GROUP BY cust_id
>>
>> HAVING SUM(amount_sold) > 94000
>>
>> AND AVG(amount_sold) < stddev_samp(amount_sold)
>>
>> ) rs
>>
>> ORDER BY
>>
>> -- Total_customer_amount DESC
>>
>> 3 DESC
>>
>>
>>
>> Hive comes back in 17 minutes with 5,948 rows
>>
>>
>>
>> bl -f sales.hql > sales.log
>>
>> Connecting to jdbc:hive2://rhes564:10010/default
>>
>> Connected to: Apache Hive (version 1.2.1)
>>
>> Driver: Hive JDBC (version 1.2.1)
>>
>> Transaction isolation: TRANSACTION_REPEATABLE_READ
>>
>> Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
>>
>> No rows affected (0.097 seconds)
>>
>> No rows affected (0.001 seconds)
>>
>> No rows affected (0.001 seconds)
>>
>> No rows affected (0.038 seconds)
>>
>> INFO : Warning: Using constant number 3 in order by. If you try to use position alias when hive.groupby.orderby.position.alias is false, the position alias will be ignored.
>>
>> INFO :
>>
>> Query Hive on Spark job[0] stages:
>>
>> INFO : 0
>>
>> INFO : 1
>>
>> INFO : 2
>>
>> INFO :
>>
>> Status: Running (Hive on Spark job[0])
>>
>> INFO : Job Progress Format
>>
>> CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
>>
>> INFO : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>>
>>
>> INFO : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished Stage-1_0: 0(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished Stage-1_0: 6(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished Stage-1_0: 30(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished Stage-1_0: 59(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished Stage-1_0: 90(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished Stage-1_0: 124(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished Stage-1_0: 160(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished Stage-1_0: 193(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished Stage-1_0: 228(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished Stage-1_0: 262(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished Stage-1_0: 298(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished Stage-1_0: 338(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished Stage-1_0: 376(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished Stage-1_0: 417(+3)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished Stage-1_0: 460(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished Stage-1_0: 502(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished Stage-1_0: 542(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished Stage-1_0: 584(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished Stage-1_0: 624(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished Stage-1_0: 667(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished Stage-1_0: 709(+3)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished Stage-1_0: 754(+3)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished Stage-1_0: 797(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished Stage-1_0: 844(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished Stage-1_0: 888(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished Stage-1_0: 934(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished Stage-1_0: 981(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 0(+1)/1
>>
>> INFO : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 1/1 Finished
>>
>> INFO : Status: Finished successfully in 1036.00 seconds
>>
>> 5,948 rows selected (1074.817 seconds)
>>
>>
>>
>> So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in 23 seconds
>>
>>
>>
>> Sybase IQ is a columnar database so each column is created as a fast projection index by default. In addition I have created LF (bitmap) indexes on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now the query only touches CUST_ID.
>>
>>
>>
>> My suspicion is that it is the Standard Deviation function stddev_samp() that could be the bottleneck?
>>
>>
>>
>> Thanks
>>
>>
>>
>> Mich Talebzadeh
>>
>>
>>
>> Sybase ASE 15 Gold Medal Award 2008
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
>>
>> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
>>
>> Publications due shortly:
>>
>> Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
>>
>> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>>
>>
>>
>>
>>
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>>
>> Latest news at Handy
>>
>> Handy just raised $50m led by Fidelity
>>
>>
>>
>> <image001.jpg>
>>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> Latest news at Handy
> Handy just raised $50m led by Fidelity
>
Re: Running the same query on 1 billion rows fact table in Hive on
Spark compared to Sybase IQ columnar database
Posted by Marcin Tustin <mt...@handybook.com>.
I'm afraid I use the HDP distribution so I haven't yet had to compile
anything. (Incidentally, this isn't a recommendation of HDP over anything
else).
On Wed, Dec 30, 2015 at 3:33 PM, Mich Talebzadeh <mi...@peridale.co.uk>
wrote:
> Thanks Marcin
>
>
>
> Trying to build TEZ 0.7 in
>
>
>
> /usr/lib/apache-tez-0.7.0-src
>
>
>
> using
>
>
>
> mvn -X clean package -DskipTests=true -Dmaven.javadoc.skip=true
>
>
>
> with mvn version 3.2.5 (as opposed to 3.3) as I read that I can build it
> OK with 3.2.5 following the same error ass below
>
>
>
> mvn --version
>
> Apache Maven *3.2.5* (12a6b3acb947671f09b81f49094c53f426d8cea1;
> 2014-12-14T17:29:23+00:00)
>
> Maven home: /usr/local/apache-maven/apache-maven-3.2.5
>
> Java version: 1.7.0_25, vendor: Oracle Corporation
>
> Java home: /usr/java/jdk1.7.0_25/jre
>
>
>
> *I get this error*
>
>
>
> [INFO] tez-ui ............................................. FAILURE [
> 0.411 s]
>
> [
>
>
>
> DEBUG] -- end configuration --
>
> [INFO] Running 'npm install --color=false' in
> /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp
>
> [INFO]
> /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp/node/with_new_path.sh:
> line 3: 23781 Aborted "$@"
>
>
>
>
>
> [ERROR] Failed to execute goal
> com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on
> project tez-ui: Failed to run task: 'npm install --color=false' failed.
> (error code 134) -> [Help 1]
>
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on
> project tez-ui: Failed to run task
>
>
>
>
>
> any ideas as there is little info available in net.
>
>
>
>
>
> Thanks
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Marcin Tustin [mailto:mtustin@handybook.com]
> *Sent:* 30 December 2015 19:27
>
> *To:* user@hive.apache.org
> *Subject:* Re: Running the same query on 1 billion rows fact table in
> Hive on Spark compared to Sybase IQ columnar database
>
>
>
> I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is
> much faster than MR in pretty much all cases. Also, with hive, you'll make
> sure you've performed optimizations like aligning ORC stripe sizes with
> HDFS block sizes, and concatenated your tables (not so much an optimization
> as a must for avoiding the small files problem).
>
>
>
> On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <mi...@peridale.co.uk>
> wrote:
>
> Thanks again Jorn.
>
>
>
>
>
> Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I
> have compression enabled. The FACT table in IQ (sales) has LF (read bitmap)
> indexes on the time_id column. For the dimension table (times) I have
> time_id defined as primary key. Also Sybase IQ creates FP (fast projection)
> indexes on every column by default.
>
>
>
> Anyway I am trying to download and build TEZ. Do we know which version of
> TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
>
>
>
> Thanks
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Jörn Franke [mailto:jornfranke@gmail.com]
> *Sent:* 30 December 2015 16:29
>
>
> *To:* user@hive.apache.org
> *Subject:* Re: Running the same query on 1 billion rows fact table in
> Hive on Spark compared to Sybase IQ columnar database
>
>
>
>
> Hmm i think the execution Engine TEZ has (currently) the most
> optimizations on Hive. What about your hardware - is it the same? Do you
> have also compression on Sybase?
>
> Alternatively you need to wait for Hive for interactive analytics (tez 0.8
> + llap).
>
>
> On 30 Dec 2015, at 13:47, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>
> Hi Jorn,
>
>
>
> Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not
> tried it on TEZ. I tried the query on MR engine and it did nor fair better.
> I also ran it without SDDDEV function and found out that the function did
> not slow it down.
>
>
>
> I tried a simple query as follows builr in sales FACT table 1e9 rows and
> dimension table times (1826 rows)
>
>
>
> --
>
> -- Get the total amount sold for each calendar month
>
> --
>
> *SELECT t.calendar_month_desc, SUM(s.amount_sold)*
>
> *FROM sales s, times t WHERE s.time_id = t.time_id*
>
> *GROUP BY t.calendar_month_desc;*
>
>
>
> Now Sybase IQ comes back in around 30 seconds.
>
>
>
> Started query at Dec 30 2015 08:14:33:399AM
>
> (48 rows affected)
>
> Finished query at Dec 30 2015 08:15:04:640AM
>
>
>
> Whereas Hive with the following setting and running the same query
>
>
>
> set
> hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
>
> set hive.optimize.bucketmapjoin=true;
>
> set hive.optimize.bucketmapjoin.sortedmerge=true;
>
>
>
> Comes back in
>
>
>
> 48 rows selected (1514.687 seconds)
>
>
>
> I don’t know what else can be done. Obviously this is all schema on read
> so I am not sure I can change bucketing on FACT table based on one query
> alone!
>
>
>
>
>
>
>
> +--------------------------------------------------------------------+--+
>
> | createtab_stmt |
>
> +--------------------------------------------------------------------+--+
>
> | CREATE TABLE `times`( |
>
> | `time_id` timestamp, |
>
> | `day_name` varchar(9), |
>
> | `day_number_in_week` int, |
>
> | `day_number_in_month` int, |
>
> | `calendar_week_number` int, |
>
> | `fiscal_week_number` int, |
>
> | `week_ending_day` timestamp, |
>
> | `week_ending_day_id` bigint, |
>
> | `calendar_month_number` int, |
>
> | `fiscal_month_number` int, |
>
> | `calendar_month_desc` varchar(8), |
>
> ----------
>
> | `days_in_fis_year` bigint, |
>
> | `end_of_cal_year` timestamp, |
>
> | `end_of_fis_year` timestamp) |
>
> | CLUSTERED BY ( |
>
> | time_id) |
>
> | INTO 256 BUCKETS |
>
> | ROW FORMAT SERDE |
>
> | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
>
> | STORED AS INPUTFORMAT |
>
> | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
>
> | OUTPUTFORMAT |
>
> | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
>
> | LOCATION |
>
> | 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times' |
>
> | TBLPROPERTIES ( |
>
> | 'COLUMN_STATS_ACCURATE'='true', |
>
> | 'numFiles'='1', |
>
> | 'numRows'='1826', |
>
> | 'orc.bloom.filter.columns'='TIME_ID', |
>
> | 'orc.bloom.filter.fpp'='0.05', |
>
> | 'orc.compress'='SNAPPY', |
>
> | 'orc.create.index'='true', |
>
> | 'orc.row.index.stride'='10000', |
>
> | 'orc.stripe.size'='268435456', |
>
> | 'rawDataSize'='0', |
>
> | 'totalSize'='11155', |
>
> | 'transient_lastDdlTime'='1451429900') |
>
>
>
> ;
>
>
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Jörn Franke [mailto:jornfranke@gmail.com <jo...@gmail.com>]
> *Sent:* 30 December 2015 08:28
> *To:* user@hive.apache.org
> *Subject:* Re: Running the same query on 1 billion rows fact table in
> Hive on Spark compared to Sybase IQ columnar database
>
>
>
> Have you tried it with Hive ob TEZ? It contains (currently) more
> optimizations than Hive on Spark.
>
> I assume you use the latest Hive version.
>
> Additionally you may want to think about calculating statistics (depending
> on your configuration you need to trigger it) - I am not sure if Spark can
> use them.
>
> I am not sure if bloom filters on the columns you mention make sense. You
> may also want to increase stride size (depending on your data).
>
> Currently you bucket by a lot of fields, which may not make sense. You
> also may want to sort the data by customer Id in the table.
>
> You also seem to have a lot of reducers, which you may want to decrease.
>
>
>
> Have you tried without "having stddev_samp" ? Is the query exactly the
> same as in Sybase?
>
>
> On 29 Dec 2015, at 11:53, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>
> Hi,
>
>
>
> I have a fact table in Hive imported from Sybase IQ via SQOOP with 1
> billion rows as follows:
>
>
>
> show create table sales;
>
>
> +-------------------------------------------------------------------------------+--+
>
> |
> createtab_stmt |
>
>
> +-------------------------------------------------------------------------------+--+
>
> | CREATE TABLE
> `sales`( |
>
> | `prod_id`
> bigint, |
>
> | `cust_id`
> bigint, |
>
> | `time_id`
> timestamp, |
>
> | `channel_id`
> bigint, |
>
> | `promo_id`
> bigint, |
>
> | `quantity_sold`
> decimal(10,0), |
>
> | `amount_sold`
> decimal(10,0)) |
>
> | CLUSTERED BY (
> |
>
> |
> prod_id,
> |
>
> |
> cust_id,
> |
>
> | time_id,
> |
>
> |
> channel_id,
> |
>
> |
> promo_id)
> |
>
> | INTO 256 BUCKETS
> |
>
> | ROW FORMAT
> SERDE |
>
> |
> 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> |
>
> | STORED AS INPUTFORMAT
> |
>
> |
> 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
> |
>
> |
> OUTPUTFORMAT
> |
>
> | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> |
>
> |
> LOCATION
> |
>
> |
> 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales'
> |
>
> | TBLPROPERTIES
> ( |
>
> |
> 'COLUMN_STATS_ACCURATE'='true',
> |
>
> |
> 'last_modified_by'='hduser',
> |
>
> |
> 'last_modified_time'='1451305626',
> |
>
> |
> 'numFiles'='11',
> |
>
> |
> 'numRows'='1000000000',
> |
>
> |
> 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID',
> |
>
> |
> 'orc.bloom.filter.fpp'='0.05',
> |
>
> |
> 'orc.compress'='SNAPPY',
> |
>
> |
> 'orc.create.index'='true',
> |
>
> |
> 'orc.row.index.stride'='10000',
> |
>
> |
> 'orc.stripe.size'='268435456',
> |
>
> |
> 'rawDataSize'='296000000000',
> |
>
> | 'totalSize'='2678882153',
> |
>
> |
> 'transient_lastDdlTime'='1451305626')
> |
>
>
> +-------------------------------------------------------------------------------+--+
>
>
>
> I use the following query to run against sales table only against Hive
>
>
>
> SELECT
>
> rs.Customer_ID
>
> , rs.Number_of_orders
>
> , rs.Total_customer_amount
>
> , rs.Average_order
>
> , rs.Standard_deviation
>
> FROM
>
> (
>
> SELECT cust_id AS Customer_ID,
>
> COUNT(amount_sold) AS Number_of_orders,
>
> SUM(amount_sold) AS Total_customer_amount,
>
> AVG(amount_sold) AS Average_order,
>
> stddev_samp(amount_sold) AS Standard_deviation
>
> FROM sales
>
> GROUP BY cust_id
>
> HAVING SUM(amount_sold) > 94000
>
> AND AVG(amount_sold) < stddev_samp(amount_sold)
>
> ) rs
>
> ORDER BY
>
> -- Total_customer_amount DESC
>
> 3 DESC
>
>
>
> Hive comes back in 17 minutes with 5,948 rows
>
>
>
> bl -f sales.hql > sales.log
>
> Connecting to jdbc:hive2://rhes564:10010/default
>
> Connected to: Apache Hive (version 1.2.1)
>
> Driver: Hive JDBC (version 1.2.1)
>
> Transaction isolation: TRANSACTION_REPEATABLE_READ
>
> Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
>
> No rows affected (0.097 seconds)
>
> No rows affected (0.001 seconds)
>
> No rows affected (0.001 seconds)
>
> No rows affected (0.038 seconds)
>
> INFO : Warning: Using constant number 3 in order by. If you try to use
> position alias when hive.groupby.orderby.position.alias is false, the
> position alias will be ignored.
>
> INFO :
>
> Query Hive on Spark job[0] stages:
>
> INFO : 0
>
> INFO : 1
>
> INFO : 2
>
> INFO :
>
> Status: Running (Hive on Spark job[0])
>
> INFO : Job Progress Format
>
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
>
> INFO : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009
> Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009
> Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
>
>
> INFO : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished Stage-1_0:
> 0(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished Stage-1_0:
> 6(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished Stage-1_0:
> 30(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished Stage-1_0:
> 59(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished Stage-1_0:
> 90(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished Stage-1_0:
> 124(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished Stage-1_0:
> 160(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished Stage-1_0:
> 193(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished Stage-1_0:
> 228(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished Stage-1_0:
> 262(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished Stage-1_0:
> 298(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished Stage-1_0:
> 338(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished Stage-1_0:
> 376(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished Stage-1_0:
> 417(+3)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished Stage-1_0:
> 460(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished Stage-1_0:
> 502(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished Stage-1_0:
> 542(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished Stage-1_0:
> 584(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished Stage-1_0:
> 624(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished Stage-1_0:
> 667(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished Stage-1_0:
> 709(+3)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished Stage-1_0:
> 754(+3)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished Stage-1_0:
> 797(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished Stage-1_0:
> 844(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished Stage-1_0:
> 888(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished Stage-1_0:
> 934(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished Stage-1_0:
> 981(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished Stage-1_0:
> 1009/1009 Finished Stage-2_0: 0(+1)/1
>
> INFO : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished Stage-1_0:
> 1009/1009 Finished Stage-2_0: 1/1 Finished
>
> INFO : Status: Finished successfully in 1036.00 seconds
>
> *5,948 rows selected (1074.817 seconds)*
>
>
>
> So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in
> 23 seconds
>
>
>
> Sybase IQ is a columnar database so each column is created as a fast
> projection index by default. In addition I have created LF (bitmap) indexes
> on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now
> the query only touches CUST_ID.
>
>
>
> My suspicion is that it is the Standard Deviation function stddev_samp() that
> could be the bottleneck?
>
>
>
> Thanks
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
>
>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
>
> Latest news <http://www.handy.com/press> at Handy
>
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
> by Fidelity
>
>
>
> [image: Image removed by sender.]
>
--
Want to work at Handy? Check out our culture deck and open roles
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
by Fidelity
RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Thanks it is on my to do list.
Regards,
Dr Mich Talebzadeh
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Lohith Samaga M [mailto:Lohith.Samaga@mphasis.com]
Sent: 05 January 2016 10:52
To: user@hive.apache.org
Subject: RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Hi Mich,
Were you able to run SQLs using the web UI?
More details on Drill shell can be found here <https://drill.apache.org/docs/configuring-the-drill-shell/> .
Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
From: Lohith Samaga M [mailto:Lohith.Samaga@mphasis.com]
Sent: Friday, January 01, 2016 10.32
To: user@hive.apache.org <ma...@hive.apache.org>
Subject: RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Hi Mich,
I believe you have configured Drill using its web UI to access Hive.
I used Drill web UI to run SQLs. I have not used command line.
Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
From: Mich Talebzadeh [mailto:mich@peridale.co.uk]
Sent: Thursday, December 31, 2015 16.53
To: user@hive.apache.org <ma...@hive.apache.org>
Subject: RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Thanks for the tip
Installed Apache drill and need to access hive :)
hduser@rhes564::/usr/lib/apache-drill-1.4.0 <mailto:hduser@rhes564::/usr/lib/apache-drill-1.4.0> > bin/drill-embedded
/work/tmp/libnetty-transport-native-epoll2451215308710744204.so: /lib64/libc.so.6: version `GLIBC_2.10' not found (required by /work/tmp/libnetty-transport-native-epoll2451215308710744204.so)
apache drill 1.4.0
"what ever the mind of man can conceive and believe, drill can query"
0: jdbc:drill:zk=local>
Trying to connect to Hive
Usage: connect <url> <username> <password> [driver]
0: jdbc:drill:zk=local> connect //rhes564:10010/default hduser xxxx org.apache.hive.jdbc.HiveDriver ;
No current connection
Any ideas?
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Lohith Samaga M [mailto:Lohith.Samaga@mphasis.com]
Sent: 31 December 2015 10:47
To: user@hive.apache.org <ma...@hive.apache.org>
Subject: RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
HI Mich,
Did you try the Apache Drill SQL engine?
It can access HIVE data.
I have seen it to be much faster. But, I have not tested at your scale.
Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
From: Mich Talebzadeh [mailto:mich@peridale.co.uk]
Sent: Thursday, December 31, 2015 02.04
To: user@hive.apache.org <ma...@hive.apache.org>
Subject: RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Thanks Marcin
Trying to build TEZ 0.7 in
/usr/lib/apache-tez-0.7.0-src
using
mvn -X clean package -DskipTests=true -Dmaven.javadoc.skip=true
with mvn version 3.2.5 (as opposed to 3.3) as I read that I can build it OK with 3.2.5 following the same error ass below
mvn --version
Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1; 2014-12-14T17:29:23+00:00)
Maven home: /usr/local/apache-maven/apache-maven-3.2.5
Java version: 1.7.0_25, vendor: Oracle Corporation
Java home: /usr/java/jdk1.7.0_25/jre
I get this error
[INFO] tez-ui ............................................. FAILURE [ 0.411 s]
[
DEBUG] -- end configuration --
[INFO] Running 'npm install --color=false' in /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp
[INFO] /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp/node/with_new_path.sh: line 3: 23781 Aborted "$@"
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on project tez-ui: Failed to run task: 'npm install --color=false' failed. (error code 134) -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on project tez-ui: Failed to run task
any ideas as there is little info available in net.
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Marcin Tustin [mailto:mtustin@handybook.com]
Sent: 30 December 2015 19:27
To: user@hive.apache.org <ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much faster than MR in pretty much all cases. Also, with hive, you'll make sure you've performed optimizations like aligning ORC stripe sizes with HDFS block sizes, and concatenated your tables (not so much an optimization as a must for avoiding the small files problem).
On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:
Thanks again Jorn.
Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I have compression enabled. The FACT table in IQ (sales) has LF (read bitmap) indexes on the time_id column. For the dimension table (times) I have time_id defined as primary key. Also Sybase IQ creates FP (fast projection) indexes on every column by default.
Anyway I am trying to download and build TEZ. Do we know which version of TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Jörn Franke [mailto:jornfranke@gmail.com <ma...@gmail.com> ]
Sent: 30 December 2015 16:29
To: user@hive.apache.org <ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Hmm i think the execution Engine TEZ has (currently) the most optimizations on Hive. What about your hardware - is it the same? Do you have also compression on Sybase?
Alternatively you need to wait for Hive for interactive analytics (tez 0.8 + llap).
On 30 Dec 2015, at 13:47, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:
Hi Jorn,
Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not tried it on TEZ. I tried the query on MR engine and it did nor fair better. I also ran it without SDDDEV function and found out that the function did not slow it down.
I tried a simple query as follows builr in sales FACT table 1e9 rows and dimension table times (1826 rows)
--
-- Get the total amount sold for each calendar month
--
SELECT t.calendar_month_desc, SUM(s.amount_sold)
FROM sales s, times t WHERE s.time_id = t.time_id
GROUP BY t.calendar_month_desc;
Now Sybase IQ comes back in around 30 seconds.
Started query at Dec 30 2015 08:14:33:399AM
(48 rows affected)
Finished query at Dec 30 2015 08:15:04:640AM
Whereas Hive with the following setting and running the same query
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
Comes back in
48 rows selected (1514.687 seconds)
I don’t know what else can be done. Obviously this is all schema on read so I am not sure I can change bucketing on FACT table based on one query alone!
+--------------------------------------------------------------------+--+
| createtab_stmt |
+--------------------------------------------------------------------+--+
| CREATE TABLE `times`( |
| `time_id` timestamp, |
| `day_name` varchar(9), |
| `day_number_in_week` int, |
| `day_number_in_month` int, |
| `calendar_week_number` int, |
| `fiscal_week_number` int, |
| `week_ending_day` timestamp, |
| `week_ending_day_id` bigint, |
| `calendar_month_number` int, |
| `fiscal_month_number` int, |
| `calendar_month_desc` varchar(8), |
----------
| `days_in_fis_year` bigint, |
| `end_of_cal_year` timestamp, |
| `end_of_fis_year` timestamp) |
| CLUSTERED BY ( |
| time_id) |
| INTO 256 BUCKETS |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'numFiles'='1', |
| 'numRows'='1826', |
| 'orc.bloom.filter.columns'='TIME_ID', |
| 'orc.bloom.filter.fpp'='0.05', |
| 'orc.compress'='SNAPPY', |
| 'orc.create.index'='true', |
| 'orc.row.index.stride'='10000', |
| 'orc.stripe.size'='268435456', |
| 'rawDataSize'='0', |
| 'totalSize'='11155', |
| 'transient_lastDdlTime'='1451429900') |
;
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 30 December 2015 08:28
To: user@hive.apache.org <ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Have you tried it with Hive ob TEZ? It contains (currently) more optimizations than Hive on Spark.
I assume you use the latest Hive version.
Additionally you may want to think about calculating statistics (depending on your configuration you need to trigger it) - I am not sure if Spark can use them.
I am not sure if bloom filters on the columns you mention make sense. You may also want to increase stride size (depending on your data).
Currently you bucket by a lot of fields, which may not make sense. You also may want to sort the data by customer Id in the table.
You also seem to have a lot of reducers, which you may want to decrease.
Have you tried without "having stddev_samp" ? Is the query exactly the same as in Sybase?
On 29 Dec 2015, at 11:53, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:
Hi,
I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion rows as follows:
show create table sales;
+-------------------------------------------------------------------------------+--+
| createtab_stmt |
+-------------------------------------------------------------------------------+--+
| CREATE TABLE `sales`( |
| `prod_id` bigint, |
| `cust_id` bigint, |
| `time_id` timestamp, |
| `channel_id` bigint, |
| `promo_id` bigint, |
| `quantity_sold` decimal(10,0), |
| `amount_sold` decimal(10,0)) |
| CLUSTERED BY ( |
| prod_id, |
| cust_id, |
| time_id, |
| channel_id, |
| promo_id) |
| INTO 256 BUCKETS |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'last_modified_by'='hduser', |
| 'last_modified_time'='1451305626', |
| 'numFiles'='11', |
| 'numRows'='1000000000', |
| 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID', |
| 'orc.bloom.filter.fpp'='0.05', |
| 'orc.compress'='SNAPPY', |
| 'orc.create.index'='true', |
| 'orc.row.index.stride'='10000', |
| 'orc.stripe.size'='268435456', |
| 'rawDataSize'='296000000000', |
| 'totalSize'='2678882153 <tel:2678882153> ', |
| 'transient_lastDdlTime'='1451305626') |
+-------------------------------------------------------------------------------+--+
I use the following query to run against sales table only against Hive
SELECT
rs.Customer_ID
, rs.Number_of_orders
, rs.Total_customer_amount
, rs.Average_order
, rs.Standard_deviation
FROM
(
SELECT cust_id AS Customer_ID,
COUNT(amount_sold) AS Number_of_orders,
SUM(amount_sold) AS Total_customer_amount,
AVG(amount_sold) AS Average_order,
stddev_samp(amount_sold) AS Standard_deviation
FROM sales
GROUP BY cust_id
HAVING SUM(amount_sold) > 94000
AND AVG(amount_sold) < stddev_samp(amount_sold)
) rs
ORDER BY
-- Total_customer_amount DESC
3 DESC
Hive comes back in 17 minutes with 5,948 rows
bl -f sales.hql > sales.log
Connecting to jdbc:hive2://rhes564:10010/default
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
No rows affected (0.097 seconds)
No rows affected (0.001 seconds)
No rows affected (0.001 seconds)
No rows affected (0.038 seconds)
INFO : Warning: Using constant number 3 in order by. If you try to use position alias when hive.groupby.orderby.position.alias is false, the position alias will be ignored.
INFO :
Query Hive on Spark job[0] stages:
INFO : 0
INFO : 1
INFO : 2
INFO :
Status: Running (Hive on Spark job[0])
INFO : Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
INFO : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished Stage-1_0: 0(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished Stage-1_0: 6(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished Stage-1_0: 30(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished Stage-1_0: 59(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished Stage-1_0: 90(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished Stage-1_0: 124(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished Stage-1_0: 160(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished Stage-1_0: 193(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished Stage-1_0: 228(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished Stage-1_0: 262(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished Stage-1_0: 298(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished Stage-1_0: 338(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished Stage-1_0: 376(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished Stage-1_0: 417(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished Stage-1_0: 460(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished Stage-1_0: 502(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished Stage-1_0: 542(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished Stage-1_0: 584(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished Stage-1_0: 624(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished Stage-1_0: 667(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished Stage-1_0: 709(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished Stage-1_0: 754(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished Stage-1_0: 797(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished Stage-1_0: 844(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished Stage-1_0: 888(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished Stage-1_0: 934(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished Stage-1_0: 981(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 0(+1)/1
INFO : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 1/1 Finished
INFO : Status: Finished successfully in 1036.00 seconds
5,948 rows selected (1074.817 seconds)
So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in 23 seconds
Sybase IQ is a columnar database so each column is created as a fast projection index by default. In addition I have created LF (bitmap) indexes on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now the query only touches CUST_ID.
My suspicion is that it is the Standard Deviation function stddev_samp() that could be the bottleneck?
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led by Fidelity
Information transmitted by this e-mail is proprietary to Mphasis, its associated companies and/ or its customers and is intended
for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or
exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded
to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly
prohibited. In such cases, please notify us immediately at mailmaster@mphasis.com <ma...@mphasis.com> and delete this mail from your records.
Information transmitted by this e-mail is proprietary to Mphasis, its associated companies and/ or its customers and is intended
for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or
exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded
to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly
prohibited. In such cases, please notify us immediately at mailmaster@mphasis.com <ma...@mphasis.com> and delete this mail from your records.
Information transmitted by this e-mail is proprietary to Mphasis, its associated companies and/ or its customers and is intended
for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or
exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded
to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly
prohibited. In such cases, please notify us immediately at mailmaster@mphasis.com <ma...@mphasis.com> and delete this mail from your records.
RE: Running the same query on 1 billion rows fact table in Hive on
Spark compared to Sybase IQ columnar database
Posted by Lohith Samaga M <Lo...@mphasis.com>.
Hi Mich,
Were you able to run SQLs using the web UI?
More details on Drill shell can be found here<https://drill.apache.org/docs/configuring-the-drill-shell/>.
Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
From: Lohith Samaga M [mailto:Lohith.Samaga@mphasis.com]
Sent: Friday, January 01, 2016 10.32
To: user@hive.apache.org
Subject: RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Hi Mich,
I believe you have configured Drill using its web UI to access Hive.
I used Drill web UI to run SQLs. I have not used command line.
Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
From: Mich Talebzadeh [mailto:mich@peridale.co.uk]
Sent: Thursday, December 31, 2015 16.53
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Thanks for the tip
Installed Apache drill and need to access hive ☺
hduser@rhes564::/usr/lib/apache-drill-1.4.0<mailto:hduser@rhes564::/usr/lib/apache-drill-1.4.0>> bin/drill-embedded
/work/tmp/libnetty-transport-native-epoll2451215308710744204.so: /lib64/libc.so.6: version `GLIBC_2.10' not found (required by /work/tmp/libnetty-transport-native-epoll2451215308710744204.so)
apache drill 1.4.0
"what ever the mind of man can conceive and believe, drill can query"
0: jdbc:drill:zk=local>
Trying to connect to Hive
Usage: connect <url> <username> <password> [driver]
0: jdbc:drill:zk=local> connect //rhes564:10010/default hduser xxxx org.apache.hive.jdbc.HiveDriver ;
No current connection
Any ideas?
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Lohith Samaga M [mailto:Lohith.Samaga@mphasis.com]
Sent: 31 December 2015 10:47
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
HI Mich,
Did you try the Apache Drill SQL engine?
It can access HIVE data.
I have seen it to be much faster. But, I have not tested at your scale.
Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
From: Mich Talebzadeh [mailto:mich@peridale.co.uk]
Sent: Thursday, December 31, 2015 02.04
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Thanks Marcin
Trying to build TEZ 0.7 in
/usr/lib/apache-tez-0.7.0-src
using
mvn -X clean package -DskipTests=true -Dmaven.javadoc.skip=true
with mvn version 3.2.5 (as opposed to 3.3) as I read that I can build it OK with 3.2.5 following the same error ass below
mvn --version
Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1; 2014-12-14T17:29:23+00:00)
Maven home: /usr/local/apache-maven/apache-maven-3.2.5
Java version: 1.7.0_25, vendor: Oracle Corporation
Java home: /usr/java/jdk1.7.0_25/jre
I get this error
[INFO] tez-ui ............................................. FAILURE [ 0.411 s]
[
DEBUG] -- end configuration --
[INFO] Running 'npm install --color=false' in /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp
[INFO] /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp/node/with_new_path.sh: line 3: 23781 Aborted "$@"
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on project tez-ui: Failed to run task: 'npm install --color=false' failed. (error code 134) -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on project tez-ui: Failed to run task
any ideas as there is little info available in net.
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Marcin Tustin [mailto:mtustin@handybook.com]
Sent: 30 December 2015 19:27
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much faster than MR in pretty much all cases. Also, with hive, you'll make sure you've performed optimizations like aligning ORC stripe sizes with HDFS block sizes, and concatenated your tables (not so much an optimization as a must for avoiding the small files problem).
On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <mi...@peridale.co.uk>> wrote:
Thanks again Jorn.
Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I have compression enabled. The FACT table in IQ (sales) has LF (read bitmap) indexes on the time_id column. For the dimension table (times) I have time_id defined as primary key. Also Sybase IQ creates FP (fast projection) indexes on every column by default.
Anyway I am trying to download and build TEZ. Do we know which version of TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Jörn Franke [mailto:jornfranke@gmail.com<ma...@gmail.com>]
Sent: 30 December 2015 16:29
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Hmm i think the execution Engine TEZ has (currently) the most optimizations on Hive. What about your hardware - is it the same? Do you have also compression on Sybase?
Alternatively you need to wait for Hive for interactive analytics (tez 0.8 + llap).
On 30 Dec 2015, at 13:47, Mich Talebzadeh <mi...@peridale.co.uk>> wrote:
Hi Jorn,
Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not tried it on TEZ. I tried the query on MR engine and it did nor fair better. I also ran it without SDDDEV function and found out that the function did not slow it down.
I tried a simple query as follows builr in sales FACT table 1e9 rows and dimension table times (1826 rows)
--
-- Get the total amount sold for each calendar month
--
SELECT t.calendar_month_desc, SUM(s.amount_sold)
FROM sales s, times t WHERE s.time_id = t.time_id
GROUP BY t.calendar_month_desc;
Now Sybase IQ comes back in around 30 seconds.
Started query at Dec 30 2015 08:14:33:399AM
(48 rows affected)
Finished query at Dec 30 2015 08:15:04:640AM
Whereas Hive with the following setting and running the same query
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
Comes back in
48 rows selected (1514.687 seconds)
I don’t know what else can be done. Obviously this is all schema on read so I am not sure I can change bucketing on FACT table based on one query alone!
+--------------------------------------------------------------------+--+
| createtab_stmt |
+--------------------------------------------------------------------+--+
| CREATE TABLE `times`( |
| `time_id` timestamp, |
| `day_name` varchar(9), |
| `day_number_in_week` int, |
| `day_number_in_month` int, |
| `calendar_week_number` int, |
| `fiscal_week_number` int, |
| `week_ending_day` timestamp, |
| `week_ending_day_id` bigint, |
| `calendar_month_number` int, |
| `fiscal_month_number` int, |
| `calendar_month_desc` varchar(8), |
----------
| `days_in_fis_year` bigint, |
| `end_of_cal_year` timestamp, |
| `end_of_fis_year` timestamp) |
| CLUSTERED BY ( |
| time_id) |
| INTO 256 BUCKETS |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'numFiles'='1', |
| 'numRows'='1826', |
| 'orc.bloom.filter.columns'='TIME_ID', |
| 'orc.bloom.filter.fpp'='0.05', |
| 'orc.compress'='SNAPPY', |
| 'orc.create.index'='true', |
| 'orc.row.index.stride'='10000', |
| 'orc.stripe.size'='268435456', |
| 'rawDataSize'='0', |
| 'totalSize'='11155', |
| 'transient_lastDdlTime'='1451429900') |
;
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 30 December 2015 08:28
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Have you tried it with Hive ob TEZ? It contains (currently) more optimizations than Hive on Spark.
I assume you use the latest Hive version.
Additionally you may want to think about calculating statistics (depending on your configuration you need to trigger it) - I am not sure if Spark can use them.
I am not sure if bloom filters on the columns you mention make sense. You may also want to increase stride size (depending on your data).
Currently you bucket by a lot of fields, which may not make sense. You also may want to sort the data by customer Id in the table.
You also seem to have a lot of reducers, which you may want to decrease.
Have you tried without "having stddev_samp" ? Is the query exactly the same as in Sybase?
On 29 Dec 2015, at 11:53, Mich Talebzadeh <mi...@peridale.co.uk>> wrote:
Hi,
I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion rows as follows:
show create table sales;
+-------------------------------------------------------------------------------+--+
| createtab_stmt |
+-------------------------------------------------------------------------------+--+
| CREATE TABLE `sales`( |
| `prod_id` bigint, |
| `cust_id` bigint, |
| `time_id` timestamp, |
| `channel_id` bigint, |
| `promo_id` bigint, |
| `quantity_sold` decimal(10,0), |
| `amount_sold` decimal(10,0)) |
| CLUSTERED BY ( |
| prod_id, |
| cust_id, |
| time_id, |
| channel_id, |
| promo_id) |
| INTO 256 BUCKETS |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'last_modified_by'='hduser', |
| 'last_modified_time'='1451305626', |
| 'numFiles'='11', |
| 'numRows'='1000000000', |
| 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID', |
| 'orc.bloom.filter.fpp'='0.05', |
| 'orc.compress'='SNAPPY', |
| 'orc.create.index'='true', |
| 'orc.row.index.stride'='10000', |
| 'orc.stripe.size'='268435456', |
| 'rawDataSize'='296000000000', |
| 'totalSize'='2678882153<tel:2678882153>', |
| 'transient_lastDdlTime'='1451305626') |
+-------------------------------------------------------------------------------+--+
I use the following query to run against sales table only against Hive
SELECT
rs.Customer_ID
, rs.Number_of_orders
, rs.Total_customer_amount
, rs.Average_order
, rs.Standard_deviation
FROM
(
SELECT cust_id AS Customer_ID,
COUNT(amount_sold) AS Number_of_orders,
SUM(amount_sold) AS Total_customer_amount,
AVG(amount_sold) AS Average_order,
stddev_samp(amount_sold) AS Standard_deviation
FROM sales
GROUP BY cust_id
HAVING SUM(amount_sold) > 94000
AND AVG(amount_sold) < stddev_samp(amount_sold)
) rs
ORDER BY
-- Total_customer_amount DESC
3 DESC
Hive comes back in 17 minutes with 5,948 rows
bl -f sales.hql > sales.log
Connecting to jdbc:hive2://rhes564:10010/default
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
No rows affected (0.097 seconds)
No rows affected (0.001 seconds)
No rows affected (0.001 seconds)
No rows affected (0.038 seconds)
INFO : Warning: Using constant number 3 in order by. If you try to use position alias when hive.groupby.orderby.position.alias is false, the position alias will be ignored.
INFO :
Query Hive on Spark job[0] stages:
INFO : 0
INFO : 1
INFO : 2
INFO :
Status: Running (Hive on Spark job[0])
INFO : Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
INFO : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished Stage-1_0: 0(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished Stage-1_0: 6(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished Stage-1_0: 30(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished Stage-1_0: 59(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished Stage-1_0: 90(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished Stage-1_0: 124(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished Stage-1_0: 160(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished Stage-1_0: 193(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished Stage-1_0: 228(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished Stage-1_0: 262(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished Stage-1_0: 298(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished Stage-1_0: 338(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished Stage-1_0: 376(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished Stage-1_0: 417(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished Stage-1_0: 460(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished Stage-1_0: 502(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished Stage-1_0: 542(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished Stage-1_0: 584(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished Stage-1_0: 624(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished Stage-1_0: 667(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished Stage-1_0: 709(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished Stage-1_0: 754(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished Stage-1_0: 797(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished Stage-1_0: 844(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished Stage-1_0: 888(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished Stage-1_0: 934(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished Stage-1_0: 981(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 0(+1)/1
INFO : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 1/1 Finished
INFO : Status: Finished successfully in 1036.00 seconds
5,948 rows selected (1074.817 seconds)
So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in 23 seconds
Sybase IQ is a columnar database so each column is created as a fast projection index by default. In addition I have created LF (bitmap) indexes on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now the query only touches CUST_ID.
My suspicion is that it is the Standard Deviation function stddev_samp() that could be the bottleneck?
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
Want to work at Handy? Check out our culture deck and open roles<http://www.handy.com/careers>
Latest news<http://www.handy.com/press> at Handy
Handy just raised $50m<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led by Fidelity
[Image removed by sender.]
Information transmitted by this e-mail is proprietary to Mphasis, its associated companies and/ or its customers and is intended
for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or
exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded
to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly
prohibited. In such cases, please notify us immediately at mailmaster@mphasis.com<ma...@mphasis.com> and delete this mail from your records.
Information transmitted by this e-mail is proprietary to Mphasis, its associated companies and/ or its customers and is intended
for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or
exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded
to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly
prohibited. In such cases, please notify us immediately at mailmaster@mphasis.com<ma...@mphasis.com> and delete this mail from your records.
Information transmitted by this e-mail is proprietary to Mphasis, its associated companies and/ or its customers and is intended
for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or
exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded
to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly
prohibited. In such cases, please notify us immediately at mailmaster@mphasis.com and delete this mail from your records.
RE: Running the same query on 1 billion rows fact table in Hive on
Spark compared to Sybase IQ columnar database
Posted by Lohith Samaga M <Lo...@mphasis.com>.
Hi Mich,
I believe you have configured Drill using its web UI to access Hive.
I used Drill web UI to run SQLs. I have not used command line.
Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
From: Mich Talebzadeh [mailto:mich@peridale.co.uk]
Sent: Thursday, December 31, 2015 16.53
To: user@hive.apache.org
Subject: RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Thanks for the tip
Installed Apache drill and need to access hive ☺
hduser@rhes564::/usr/lib/apache-drill-1.4.0<mailto:hduser@rhes564::/usr/lib/apache-drill-1.4.0>> bin/drill-embedded
/work/tmp/libnetty-transport-native-epoll2451215308710744204.so: /lib64/libc.so.6: version `GLIBC_2.10' not found (required by /work/tmp/libnetty-transport-native-epoll2451215308710744204.so)
apache drill 1.4.0
"what ever the mind of man can conceive and believe, drill can query"
0: jdbc:drill:zk=local>
Trying to connect to Hive
Usage: connect <url> <username> <password> [driver]
0: jdbc:drill:zk=local> connect //rhes564:10010/default hduser xxxx org.apache.hive.jdbc.HiveDriver ;
No current connection
Any ideas?
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Lohith Samaga M [mailto:Lohith.Samaga@mphasis.com]
Sent: 31 December 2015 10:47
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
HI Mich,
Did you try the Apache Drill SQL engine?
It can access HIVE data.
I have seen it to be much faster. But, I have not tested at your scale.
Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
From: Mich Talebzadeh [mailto:mich@peridale.co.uk]
Sent: Thursday, December 31, 2015 02.04
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Thanks Marcin
Trying to build TEZ 0.7 in
/usr/lib/apache-tez-0.7.0-src
using
mvn -X clean package -DskipTests=true -Dmaven.javadoc.skip=true
with mvn version 3.2.5 (as opposed to 3.3) as I read that I can build it OK with 3.2.5 following the same error ass below
mvn --version
Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1; 2014-12-14T17:29:23+00:00)
Maven home: /usr/local/apache-maven/apache-maven-3.2.5
Java version: 1.7.0_25, vendor: Oracle Corporation
Java home: /usr/java/jdk1.7.0_25/jre
I get this error
[INFO] tez-ui ............................................. FAILURE [ 0.411 s]
[
DEBUG] -- end configuration --
[INFO] Running 'npm install --color=false' in /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp
[INFO] /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp/node/with_new_path.sh: line 3: 23781 Aborted "$@"
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on project tez-ui: Failed to run task: 'npm install --color=false' failed. (error code 134) -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on project tez-ui: Failed to run task
any ideas as there is little info available in net.
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Marcin Tustin [mailto:mtustin@handybook.com]
Sent: 30 December 2015 19:27
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much faster than MR in pretty much all cases. Also, with hive, you'll make sure you've performed optimizations like aligning ORC stripe sizes with HDFS block sizes, and concatenated your tables (not so much an optimization as a must for avoiding the small files problem).
On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <mi...@peridale.co.uk>> wrote:
Thanks again Jorn.
Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I have compression enabled. The FACT table in IQ (sales) has LF (read bitmap) indexes on the time_id column. For the dimension table (times) I have time_id defined as primary key. Also Sybase IQ creates FP (fast projection) indexes on every column by default.
Anyway I am trying to download and build TEZ. Do we know which version of TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Jörn Franke [mailto:jornfranke@gmail.com<ma...@gmail.com>]
Sent: 30 December 2015 16:29
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Hmm i think the execution Engine TEZ has (currently) the most optimizations on Hive. What about your hardware - is it the same? Do you have also compression on Sybase?
Alternatively you need to wait for Hive for interactive analytics (tez 0.8 + llap).
On 30 Dec 2015, at 13:47, Mich Talebzadeh <mi...@peridale.co.uk>> wrote:
Hi Jorn,
Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not tried it on TEZ. I tried the query on MR engine and it did nor fair better. I also ran it without SDDDEV function and found out that the function did not slow it down.
I tried a simple query as follows builr in sales FACT table 1e9 rows and dimension table times (1826 rows)
--
-- Get the total amount sold for each calendar month
--
SELECT t.calendar_month_desc, SUM(s.amount_sold)
FROM sales s, times t WHERE s.time_id = t.time_id
GROUP BY t.calendar_month_desc;
Now Sybase IQ comes back in around 30 seconds.
Started query at Dec 30 2015 08:14:33:399AM
(48 rows affected)
Finished query at Dec 30 2015 08:15:04:640AM
Whereas Hive with the following setting and running the same query
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
Comes back in
48 rows selected (1514.687 seconds)
I don’t know what else can be done. Obviously this is all schema on read so I am not sure I can change bucketing on FACT table based on one query alone!
+--------------------------------------------------------------------+--+
| createtab_stmt |
+--------------------------------------------------------------------+--+
| CREATE TABLE `times`( |
| `time_id` timestamp, |
| `day_name` varchar(9), |
| `day_number_in_week` int, |
| `day_number_in_month` int, |
| `calendar_week_number` int, |
| `fiscal_week_number` int, |
| `week_ending_day` timestamp, |
| `week_ending_day_id` bigint, |
| `calendar_month_number` int, |
| `fiscal_month_number` int, |
| `calendar_month_desc` varchar(8), |
----------
| `days_in_fis_year` bigint, |
| `end_of_cal_year` timestamp, |
| `end_of_fis_year` timestamp) |
| CLUSTERED BY ( |
| time_id) |
| INTO 256 BUCKETS |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'numFiles'='1', |
| 'numRows'='1826', |
| 'orc.bloom.filter.columns'='TIME_ID', |
| 'orc.bloom.filter.fpp'='0.05', |
| 'orc.compress'='SNAPPY', |
| 'orc.create.index'='true', |
| 'orc.row.index.stride'='10000', |
| 'orc.stripe.size'='268435456', |
| 'rawDataSize'='0', |
| 'totalSize'='11155', |
| 'transient_lastDdlTime'='1451429900') |
;
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 30 December 2015 08:28
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Have you tried it with Hive ob TEZ? It contains (currently) more optimizations than Hive on Spark.
I assume you use the latest Hive version.
Additionally you may want to think about calculating statistics (depending on your configuration you need to trigger it) - I am not sure if Spark can use them.
I am not sure if bloom filters on the columns you mention make sense. You may also want to increase stride size (depending on your data).
Currently you bucket by a lot of fields, which may not make sense. You also may want to sort the data by customer Id in the table.
You also seem to have a lot of reducers, which you may want to decrease.
Have you tried without "having stddev_samp" ? Is the query exactly the same as in Sybase?
On 29 Dec 2015, at 11:53, Mich Talebzadeh <mi...@peridale.co.uk>> wrote:
Hi,
I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion rows as follows:
show create table sales;
+-------------------------------------------------------------------------------+--+
| createtab_stmt |
+-------------------------------------------------------------------------------+--+
| CREATE TABLE `sales`( |
| `prod_id` bigint, |
| `cust_id` bigint, |
| `time_id` timestamp, |
| `channel_id` bigint, |
| `promo_id` bigint, |
| `quantity_sold` decimal(10,0), |
| `amount_sold` decimal(10,0)) |
| CLUSTERED BY ( |
| prod_id, |
| cust_id, |
| time_id, |
| channel_id, |
| promo_id) |
| INTO 256 BUCKETS |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'last_modified_by'='hduser', |
| 'last_modified_time'='1451305626', |
| 'numFiles'='11', |
| 'numRows'='1000000000', |
| 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID', |
| 'orc.bloom.filter.fpp'='0.05', |
| 'orc.compress'='SNAPPY', |
| 'orc.create.index'='true', |
| 'orc.row.index.stride'='10000', |
| 'orc.stripe.size'='268435456', |
| 'rawDataSize'='296000000000', |
| 'totalSize'='2678882153<tel:2678882153>', |
| 'transient_lastDdlTime'='1451305626') |
+-------------------------------------------------------------------------------+--+
I use the following query to run against sales table only against Hive
SELECT
rs.Customer_ID
, rs.Number_of_orders
, rs.Total_customer_amount
, rs.Average_order
, rs.Standard_deviation
FROM
(
SELECT cust_id AS Customer_ID,
COUNT(amount_sold) AS Number_of_orders,
SUM(amount_sold) AS Total_customer_amount,
AVG(amount_sold) AS Average_order,
stddev_samp(amount_sold) AS Standard_deviation
FROM sales
GROUP BY cust_id
HAVING SUM(amount_sold) > 94000
AND AVG(amount_sold) < stddev_samp(amount_sold)
) rs
ORDER BY
-- Total_customer_amount DESC
3 DESC
Hive comes back in 17 minutes with 5,948 rows
bl -f sales.hql > sales.log
Connecting to jdbc:hive2://rhes564:10010/default
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
No rows affected (0.097 seconds)
No rows affected (0.001 seconds)
No rows affected (0.001 seconds)
No rows affected (0.038 seconds)
INFO : Warning: Using constant number 3 in order by. If you try to use position alias when hive.groupby.orderby.position.alias is false, the position alias will be ignored.
INFO :
Query Hive on Spark job[0] stages:
INFO : 0
INFO : 1
INFO : 2
INFO :
Status: Running (Hive on Spark job[0])
INFO : Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
INFO : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished Stage-1_0: 0(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished Stage-1_0: 6(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished Stage-1_0: 30(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished Stage-1_0: 59(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished Stage-1_0: 90(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished Stage-1_0: 124(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished Stage-1_0: 160(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished Stage-1_0: 193(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished Stage-1_0: 228(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished Stage-1_0: 262(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished Stage-1_0: 298(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished Stage-1_0: 338(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished Stage-1_0: 376(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished Stage-1_0: 417(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished Stage-1_0: 460(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished Stage-1_0: 502(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished Stage-1_0: 542(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished Stage-1_0: 584(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished Stage-1_0: 624(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished Stage-1_0: 667(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished Stage-1_0: 709(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished Stage-1_0: 754(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished Stage-1_0: 797(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished Stage-1_0: 844(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished Stage-1_0: 888(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished Stage-1_0: 934(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished Stage-1_0: 981(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 0(+1)/1
INFO : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 1/1 Finished
INFO : Status: Finished successfully in 1036.00 seconds
5,948 rows selected (1074.817 seconds)
So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in 23 seconds
Sybase IQ is a columnar database so each column is created as a fast projection index by default. In addition I have created LF (bitmap) indexes on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now the query only touches CUST_ID.
My suspicion is that it is the Standard Deviation function stddev_samp() that could be the bottleneck?
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
Want to work at Handy? Check out our culture deck and open roles<http://www.handy.com/careers>
Latest news<http://www.handy.com/press> at Handy
Handy just raised $50m<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led by Fidelity
[Image removed by sender.]
Information transmitted by this e-mail is proprietary to Mphasis, its associated companies and/ or its customers and is intended
for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or
exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded
to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly
prohibited. In such cases, please notify us immediately at mailmaster@mphasis.com<ma...@mphasis.com> and delete this mail from your records.
Information transmitted by this e-mail is proprietary to Mphasis, its associated companies and/ or its customers and is intended
for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or
exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded
to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly
prohibited. In such cases, please notify us immediately at mailmaster@mphasis.com and delete this mail from your records.
RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Thanks for the tip
Installed Apache drill and need to access hive :)
hduser@rhes564::/usr/lib/apache-drill-1.4.0> bin/drill-embedded
/work/tmp/libnetty-transport-native-epoll2451215308710744204.so: /lib64/libc.so.6: version `GLIBC_2.10' not found (required by /work/tmp/libnetty-transport-native-epoll2451215308710744204.so)
apache drill 1.4.0
"what ever the mind of man can conceive and believe, drill can query"
0: jdbc:drill:zk=local>
Trying to connect to Hive
Usage: connect <url> <username> <password> [driver]
0: jdbc:drill:zk=local> connect //rhes564:10010/default hduser xxxx org.apache.hive.jdbc.HiveDriver ;
No current connection
Any ideas?
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Lohith Samaga M [mailto:Lohith.Samaga@mphasis.com]
Sent: 31 December 2015 10:47
To: user@hive.apache.org
Subject: RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
HI Mich,
Did you try the Apache Drill SQL engine?
It can access HIVE data.
I have seen it to be much faster. But, I have not tested at your scale.
Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
From: Mich Talebzadeh [mailto:mich@peridale.co.uk]
Sent: Thursday, December 31, 2015 02.04
To: user@hive.apache.org <ma...@hive.apache.org>
Subject: RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Thanks Marcin
Trying to build TEZ 0.7 in
/usr/lib/apache-tez-0.7.0-src
using
mvn -X clean package -DskipTests=true -Dmaven.javadoc.skip=true
with mvn version 3.2.5 (as opposed to 3.3) as I read that I can build it OK with 3.2.5 following the same error ass below
mvn --version
Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1; 2014-12-14T17:29:23+00:00)
Maven home: /usr/local/apache-maven/apache-maven-3.2.5
Java version: 1.7.0_25, vendor: Oracle Corporation
Java home: /usr/java/jdk1.7.0_25/jre
I get this error
[INFO] tez-ui ............................................. FAILURE [ 0.411 s]
[
DEBUG] -- end configuration --
[INFO] Running 'npm install --color=false' in /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp
[INFO] /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp/node/with_new_path.sh: line 3: 23781 Aborted "$@"
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on project tez-ui: Failed to run task: 'npm install --color=false' failed. (error code 134) -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on project tez-ui: Failed to run task
any ideas as there is little info available in net.
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Marcin Tustin [mailto:mtustin@handybook.com]
Sent: 30 December 2015 19:27
To: user@hive.apache.org <ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much faster than MR in pretty much all cases. Also, with hive, you'll make sure you've performed optimizations like aligning ORC stripe sizes with HDFS block sizes, and concatenated your tables (not so much an optimization as a must for avoiding the small files problem).
On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:
Thanks again Jorn.
Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I have compression enabled. The FACT table in IQ (sales) has LF (read bitmap) indexes on the time_id column. For the dimension table (times) I have time_id defined as primary key. Also Sybase IQ creates FP (fast projection) indexes on every column by default.
Anyway I am trying to download and build TEZ. Do we know which version of TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Jörn Franke [mailto:jornfranke@gmail.com <ma...@gmail.com> ]
Sent: 30 December 2015 16:29
To: user@hive.apache.org <ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Hmm i think the execution Engine TEZ has (currently) the most optimizations on Hive. What about your hardware - is it the same? Do you have also compression on Sybase?
Alternatively you need to wait for Hive for interactive analytics (tez 0.8 + llap).
On 30 Dec 2015, at 13:47, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:
Hi Jorn,
Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not tried it on TEZ. I tried the query on MR engine and it did nor fair better. I also ran it without SDDDEV function and found out that the function did not slow it down.
I tried a simple query as follows builr in sales FACT table 1e9 rows and dimension table times (1826 rows)
--
-- Get the total amount sold for each calendar month
--
SELECT t.calendar_month_desc, SUM(s.amount_sold)
FROM sales s, times t WHERE s.time_id = t.time_id
GROUP BY t.calendar_month_desc;
Now Sybase IQ comes back in around 30 seconds.
Started query at Dec 30 2015 08:14:33:399AM
(48 rows affected)
Finished query at Dec 30 2015 08:15:04:640AM
Whereas Hive with the following setting and running the same query
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
Comes back in
48 rows selected (1514.687 seconds)
I don’t know what else can be done. Obviously this is all schema on read so I am not sure I can change bucketing on FACT table based on one query alone!
+--------------------------------------------------------------------+--+
| createtab_stmt |
+--------------------------------------------------------------------+--+
| CREATE TABLE `times`( |
| `time_id` timestamp, |
| `day_name` varchar(9), |
| `day_number_in_week` int, |
| `day_number_in_month` int, |
| `calendar_week_number` int, |
| `fiscal_week_number` int, |
| `week_ending_day` timestamp, |
| `week_ending_day_id` bigint, |
| `calendar_month_number` int, |
| `fiscal_month_number` int, |
| `calendar_month_desc` varchar(8), |
----------
| `days_in_fis_year` bigint, |
| `end_of_cal_year` timestamp, |
| `end_of_fis_year` timestamp) |
| CLUSTERED BY ( |
| time_id) |
| INTO 256 BUCKETS |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'numFiles'='1', |
| 'numRows'='1826', |
| 'orc.bloom.filter.columns'='TIME_ID', |
| 'orc.bloom.filter.fpp'='0.05', |
| 'orc.compress'='SNAPPY', |
| 'orc.create.index'='true', |
| 'orc.row.index.stride'='10000', |
| 'orc.stripe.size'='268435456', |
| 'rawDataSize'='0', |
| 'totalSize'='11155', |
| 'transient_lastDdlTime'='1451429900') |
;
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 30 December 2015 08:28
To: user@hive.apache.org <ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Have you tried it with Hive ob TEZ? It contains (currently) more optimizations than Hive on Spark.
I assume you use the latest Hive version.
Additionally you may want to think about calculating statistics (depending on your configuration you need to trigger it) - I am not sure if Spark can use them.
I am not sure if bloom filters on the columns you mention make sense. You may also want to increase stride size (depending on your data).
Currently you bucket by a lot of fields, which may not make sense. You also may want to sort the data by customer Id in the table.
You also seem to have a lot of reducers, which you may want to decrease.
Have you tried without "having stddev_samp" ? Is the query exactly the same as in Sybase?
On 29 Dec 2015, at 11:53, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:
Hi,
I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion rows as follows:
show create table sales;
+-------------------------------------------------------------------------------+--+
| createtab_stmt |
+-------------------------------------------------------------------------------+--+
| CREATE TABLE `sales`( |
| `prod_id` bigint, |
| `cust_id` bigint, |
| `time_id` timestamp, |
| `channel_id` bigint, |
| `promo_id` bigint, |
| `quantity_sold` decimal(10,0), |
| `amount_sold` decimal(10,0)) |
| CLUSTERED BY ( |
| prod_id, |
| cust_id, |
| time_id, |
| channel_id, |
| promo_id) |
| INTO 256 BUCKETS |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'last_modified_by'='hduser', |
| 'last_modified_time'='1451305626', |
| 'numFiles'='11', |
| 'numRows'='1000000000', |
| 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID', |
| 'orc.bloom.filter.fpp'='0.05', |
| 'orc.compress'='SNAPPY', |
| 'orc.create.index'='true', |
| 'orc.row.index.stride'='10000', |
| 'orc.stripe.size'='268435456', |
| 'rawDataSize'='296000000000', |
| 'totalSize'='2678882153 <tel:2678882153> ', |
| 'transient_lastDdlTime'='1451305626') |
+-------------------------------------------------------------------------------+--+
I use the following query to run against sales table only against Hive
SELECT
rs.Customer_ID
, rs.Number_of_orders
, rs.Total_customer_amount
, rs.Average_order
, rs.Standard_deviation
FROM
(
SELECT cust_id AS Customer_ID,
COUNT(amount_sold) AS Number_of_orders,
SUM(amount_sold) AS Total_customer_amount,
AVG(amount_sold) AS Average_order,
stddev_samp(amount_sold) AS Standard_deviation
FROM sales
GROUP BY cust_id
HAVING SUM(amount_sold) > 94000
AND AVG(amount_sold) < stddev_samp(amount_sold)
) rs
ORDER BY
-- Total_customer_amount DESC
3 DESC
Hive comes back in 17 minutes with 5,948 rows
bl -f sales.hql > sales.log
Connecting to jdbc:hive2://rhes564:10010/default
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
No rows affected (0.097 seconds)
No rows affected (0.001 seconds)
No rows affected (0.001 seconds)
No rows affected (0.038 seconds)
INFO : Warning: Using constant number 3 in order by. If you try to use position alias when hive.groupby.orderby.position.alias is false, the position alias will be ignored.
INFO :
Query Hive on Spark job[0] stages:
INFO : 0
INFO : 1
INFO : 2
INFO :
Status: Running (Hive on Spark job[0])
INFO : Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
INFO : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished Stage-1_0: 0(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished Stage-1_0: 6(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished Stage-1_0: 30(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished Stage-1_0: 59(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished Stage-1_0: 90(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished Stage-1_0: 124(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished Stage-1_0: 160(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished Stage-1_0: 193(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished Stage-1_0: 228(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished Stage-1_0: 262(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished Stage-1_0: 298(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished Stage-1_0: 338(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished Stage-1_0: 376(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished Stage-1_0: 417(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished Stage-1_0: 460(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished Stage-1_0: 502(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished Stage-1_0: 542(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished Stage-1_0: 584(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished Stage-1_0: 624(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished Stage-1_0: 667(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished Stage-1_0: 709(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished Stage-1_0: 754(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished Stage-1_0: 797(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished Stage-1_0: 844(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished Stage-1_0: 888(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished Stage-1_0: 934(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished Stage-1_0: 981(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 0(+1)/1
INFO : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 1/1 Finished
INFO : Status: Finished successfully in 1036.00 seconds
5,948 rows selected (1074.817 seconds)
So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in 23 seconds
Sybase IQ is a columnar database so each column is created as a fast projection index by default. In addition I have created LF (bitmap) indexes on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now the query only touches CUST_ID.
My suspicion is that it is the Standard Deviation function stddev_samp() that could be the bottleneck?
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led by Fidelity
Information transmitted by this e-mail is proprietary to Mphasis, its associated companies and/ or its customers and is intended
for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or
exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded
to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly
prohibited. In such cases, please notify us immediately at mailmaster@mphasis.com <ma...@mphasis.com> and delete this mail from your records.
RE: Running the same query on 1 billion rows fact table in Hive on
Spark compared to Sybase IQ columnar database
Posted by Lohith Samaga M <Lo...@mphasis.com>.
HI Mich,
Did you try the Apache Drill SQL engine?
It can access HIVE data.
I have seen it to be much faster. But, I have not tested at your scale.
Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
From: Mich Talebzadeh [mailto:mich@peridale.co.uk]
Sent: Thursday, December 31, 2015 02.04
To: user@hive.apache.org
Subject: RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Thanks Marcin
Trying to build TEZ 0.7 in
/usr/lib/apache-tez-0.7.0-src
using
mvn -X clean package -DskipTests=true -Dmaven.javadoc.skip=true
with mvn version 3.2.5 (as opposed to 3.3) as I read that I can build it OK with 3.2.5 following the same error ass below
mvn --version
Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1; 2014-12-14T17:29:23+00:00)
Maven home: /usr/local/apache-maven/apache-maven-3.2.5
Java version: 1.7.0_25, vendor: Oracle Corporation
Java home: /usr/java/jdk1.7.0_25/jre
I get this error
[INFO] tez-ui ............................................. FAILURE [ 0.411 s]
[
DEBUG] -- end configuration --
[INFO] Running 'npm install --color=false' in /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp
[INFO] /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp/node/with_new_path.sh: line 3: 23781 Aborted "$@"
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on project tez-ui: Failed to run task: 'npm install --color=false' failed. (error code 134) -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on project tez-ui: Failed to run task
any ideas as there is little info available in net.
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Marcin Tustin [mailto:mtustin@handybook.com]
Sent: 30 December 2015 19:27
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much faster than MR in pretty much all cases. Also, with hive, you'll make sure you've performed optimizations like aligning ORC stripe sizes with HDFS block sizes, and concatenated your tables (not so much an optimization as a must for avoiding the small files problem).
On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <mi...@peridale.co.uk>> wrote:
Thanks again Jorn.
Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I have compression enabled. The FACT table in IQ (sales) has LF (read bitmap) indexes on the time_id column. For the dimension table (times) I have time_id defined as primary key. Also Sybase IQ creates FP (fast projection) indexes on every column by default.
Anyway I am trying to download and build TEZ. Do we know which version of TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Jörn Franke [mailto:jornfranke@gmail.com<ma...@gmail.com>]
Sent: 30 December 2015 16:29
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Hmm i think the execution Engine TEZ has (currently) the most optimizations on Hive. What about your hardware - is it the same? Do you have also compression on Sybase?
Alternatively you need to wait for Hive for interactive analytics (tez 0.8 + llap).
On 30 Dec 2015, at 13:47, Mich Talebzadeh <mi...@peridale.co.uk>> wrote:
Hi Jorn,
Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not tried it on TEZ. I tried the query on MR engine and it did nor fair better. I also ran it without SDDDEV function and found out that the function did not slow it down.
I tried a simple query as follows builr in sales FACT table 1e9 rows and dimension table times (1826 rows)
--
-- Get the total amount sold for each calendar month
--
SELECT t.calendar_month_desc, SUM(s.amount_sold)
FROM sales s, times t WHERE s.time_id = t.time_id
GROUP BY t.calendar_month_desc;
Now Sybase IQ comes back in around 30 seconds.
Started query at Dec 30 2015 08:14:33:399AM
(48 rows affected)
Finished query at Dec 30 2015 08:15:04:640AM
Whereas Hive with the following setting and running the same query
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
Comes back in
48 rows selected (1514.687 seconds)
I don’t know what else can be done. Obviously this is all schema on read so I am not sure I can change bucketing on FACT table based on one query alone!
+--------------------------------------------------------------------+--+
| createtab_stmt |
+--------------------------------------------------------------------+--+
| CREATE TABLE `times`( |
| `time_id` timestamp, |
| `day_name` varchar(9), |
| `day_number_in_week` int, |
| `day_number_in_month` int, |
| `calendar_week_number` int, |
| `fiscal_week_number` int, |
| `week_ending_day` timestamp, |
| `week_ending_day_id` bigint, |
| `calendar_month_number` int, |
| `fiscal_month_number` int, |
| `calendar_month_desc` varchar(8), |
----------
| `days_in_fis_year` bigint, |
| `end_of_cal_year` timestamp, |
| `end_of_fis_year` timestamp) |
| CLUSTERED BY ( |
| time_id) |
| INTO 256 BUCKETS |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'numFiles'='1', |
| 'numRows'='1826', |
| 'orc.bloom.filter.columns'='TIME_ID', |
| 'orc.bloom.filter.fpp'='0.05', |
| 'orc.compress'='SNAPPY', |
| 'orc.create.index'='true', |
| 'orc.row.index.stride'='10000', |
| 'orc.stripe.size'='268435456', |
| 'rawDataSize'='0', |
| 'totalSize'='11155', |
| 'transient_lastDdlTime'='1451429900') |
;
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 30 December 2015 08:28
To: user@hive.apache.org<ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Have you tried it with Hive ob TEZ? It contains (currently) more optimizations than Hive on Spark.
I assume you use the latest Hive version.
Additionally you may want to think about calculating statistics (depending on your configuration you need to trigger it) - I am not sure if Spark can use them.
I am not sure if bloom filters on the columns you mention make sense. You may also want to increase stride size (depending on your data).
Currently you bucket by a lot of fields, which may not make sense. You also may want to sort the data by customer Id in the table.
You also seem to have a lot of reducers, which you may want to decrease.
Have you tried without "having stddev_samp" ? Is the query exactly the same as in Sybase?
On 29 Dec 2015, at 11:53, Mich Talebzadeh <mi...@peridale.co.uk>> wrote:
Hi,
I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion rows as follows:
show create table sales;
+-------------------------------------------------------------------------------+--+
| createtab_stmt |
+-------------------------------------------------------------------------------+--+
| CREATE TABLE `sales`( |
| `prod_id` bigint, |
| `cust_id` bigint, |
| `time_id` timestamp, |
| `channel_id` bigint, |
| `promo_id` bigint, |
| `quantity_sold` decimal(10,0), |
| `amount_sold` decimal(10,0)) |
| CLUSTERED BY ( |
| prod_id, |
| cust_id, |
| time_id, |
| channel_id, |
| promo_id) |
| INTO 256 BUCKETS |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'last_modified_by'='hduser', |
| 'last_modified_time'='1451305626', |
| 'numFiles'='11', |
| 'numRows'='1000000000', |
| 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID', |
| 'orc.bloom.filter.fpp'='0.05', |
| 'orc.compress'='SNAPPY', |
| 'orc.create.index'='true', |
| 'orc.row.index.stride'='10000', |
| 'orc.stripe.size'='268435456', |
| 'rawDataSize'='296000000000', |
| 'totalSize'='2678882153<tel:2678882153>', |
| 'transient_lastDdlTime'='1451305626') |
+-------------------------------------------------------------------------------+--+
I use the following query to run against sales table only against Hive
SELECT
rs.Customer_ID
, rs.Number_of_orders
, rs.Total_customer_amount
, rs.Average_order
, rs.Standard_deviation
FROM
(
SELECT cust_id AS Customer_ID,
COUNT(amount_sold) AS Number_of_orders,
SUM(amount_sold) AS Total_customer_amount,
AVG(amount_sold) AS Average_order,
stddev_samp(amount_sold) AS Standard_deviation
FROM sales
GROUP BY cust_id
HAVING SUM(amount_sold) > 94000
AND AVG(amount_sold) < stddev_samp(amount_sold)
) rs
ORDER BY
-- Total_customer_amount DESC
3 DESC
Hive comes back in 17 minutes with 5,948 rows
bl -f sales.hql > sales.log
Connecting to jdbc:hive2://rhes564:10010/default
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
No rows affected (0.097 seconds)
No rows affected (0.001 seconds)
No rows affected (0.001 seconds)
No rows affected (0.038 seconds)
INFO : Warning: Using constant number 3 in order by. If you try to use position alias when hive.groupby.orderby.position.alias is false, the position alias will be ignored.
INFO :
Query Hive on Spark job[0] stages:
INFO : 0
INFO : 1
INFO : 2
INFO :
Status: Running (Hive on Spark job[0])
INFO : Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
INFO : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished Stage-1_0: 0(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished Stage-1_0: 6(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished Stage-1_0: 30(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished Stage-1_0: 59(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished Stage-1_0: 90(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished Stage-1_0: 124(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished Stage-1_0: 160(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished Stage-1_0: 193(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished Stage-1_0: 228(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished Stage-1_0: 262(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished Stage-1_0: 298(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished Stage-1_0: 338(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished Stage-1_0: 376(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished Stage-1_0: 417(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished Stage-1_0: 460(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished Stage-1_0: 502(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished Stage-1_0: 542(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished Stage-1_0: 584(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished Stage-1_0: 624(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished Stage-1_0: 667(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished Stage-1_0: 709(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished Stage-1_0: 754(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished Stage-1_0: 797(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished Stage-1_0: 844(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished Stage-1_0: 888(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished Stage-1_0: 934(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished Stage-1_0: 981(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 0(+1)/1
INFO : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 1/1 Finished
INFO : Status: Finished successfully in 1036.00 seconds
5,948 rows selected (1074.817 seconds)
So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in 23 seconds
Sybase IQ is a columnar database so each column is created as a fast projection index by default. In addition I have created LF (bitmap) indexes on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now the query only touches CUST_ID.
My suspicion is that it is the Standard Deviation function stddev_samp() that could be the bottleneck?
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
Want to work at Handy? Check out our culture deck and open roles<http://www.handy.com/careers>
Latest news<http://www.handy.com/press> at Handy
Handy just raised $50m<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led by Fidelity
[Image removed by sender.]
Information transmitted by this e-mail is proprietary to Mphasis, its associated companies and/ or its customers and is intended
for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or
exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded
to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly
prohibited. In such cases, please notify us immediately at mailmaster@mphasis.com and delete this mail from your records.
RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Thanks Marcin
Trying to build TEZ 0.7 in
/usr/lib/apache-tez-0.7.0-src
using
mvn -X clean package -DskipTests=true -Dmaven.javadoc.skip=true
with mvn version 3.2.5 (as opposed to 3.3) as I read that I can build it OK with 3.2.5 following the same error ass below
mvn --version
Apache Maven 3.2.5 (12a6b3acb947671f09b81f49094c53f426d8cea1; 2014-12-14T17:29:23+00:00)
Maven home: /usr/local/apache-maven/apache-maven-3.2.5
Java version: 1.7.0_25, vendor: Oracle Corporation
Java home: /usr/java/jdk1.7.0_25/jre
I get this error
[INFO] tez-ui ............................................. FAILURE [ 0.411 s]
[
DEBUG] -- end configuration --
[INFO] Running 'npm install --color=false' in /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp
[INFO] /usr/lib/apache-tez-0.7.0-src/tez-ui/src/main/webapp/node/with_new_path.sh: line 3: 23781 Aborted "$@"
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on project tez-ui: Failed to run task: 'npm install --color=false' failed. (error code 134) -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal com.github.eirslett:frontend-maven-plugin:0.0.16:npm (npm install) on project tez-ui: Failed to run task
any ideas as there is little info available in net.
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Marcin Tustin [mailto:mtustin@handybook.com]
Sent: 30 December 2015 19:27
To: user@hive.apache.org
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much faster than MR in pretty much all cases. Also, with hive, you'll make sure you've performed optimizations like aligning ORC stripe sizes with HDFS block sizes, and concatenated your tables (not so much an optimization as a must for avoiding the small files problem).
On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:
Thanks again Jorn.
Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I have compression enabled. The FACT table in IQ (sales) has LF (read bitmap) indexes on the time_id column. For the dimension table (times) I have time_id defined as primary key. Also Sybase IQ creates FP (fast projection) indexes on every column by default.
Anyway I am trying to download and build TEZ. Do we know which version of TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Jörn Franke [mailto:jornfranke@gmail.com <ma...@gmail.com> ]
Sent: 30 December 2015 16:29
To: user@hive.apache.org <ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Hmm i think the execution Engine TEZ has (currently) the most optimizations on Hive. What about your hardware - is it the same? Do you have also compression on Sybase?
Alternatively you need to wait for Hive for interactive analytics (tez 0.8 + llap).
On 30 Dec 2015, at 13:47, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:
Hi Jorn,
Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not tried it on TEZ. I tried the query on MR engine and it did nor fair better. I also ran it without SDDDEV function and found out that the function did not slow it down.
I tried a simple query as follows builr in sales FACT table 1e9 rows and dimension table times (1826 rows)
--
-- Get the total amount sold for each calendar month
--
SELECT t.calendar_month_desc, SUM(s.amount_sold)
FROM sales s, times t WHERE s.time_id = t.time_id
GROUP BY t.calendar_month_desc;
Now Sybase IQ comes back in around 30 seconds.
Started query at Dec 30 2015 08:14:33:399AM
(48 rows affected)
Finished query at Dec 30 2015 08:15:04:640AM
Whereas Hive with the following setting and running the same query
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
Comes back in
48 rows selected (1514.687 seconds)
I don’t know what else can be done. Obviously this is all schema on read so I am not sure I can change bucketing on FACT table based on one query alone!
+--------------------------------------------------------------------+--+
| createtab_stmt |
+--------------------------------------------------------------------+--+
| CREATE TABLE `times`( |
| `time_id` timestamp, |
| `day_name` varchar(9), |
| `day_number_in_week` int, |
| `day_number_in_month` int, |
| `calendar_week_number` int, |
| `fiscal_week_number` int, |
| `week_ending_day` timestamp, |
| `week_ending_day_id` bigint, |
| `calendar_month_number` int, |
| `fiscal_month_number` int, |
| `calendar_month_desc` varchar(8), |
----------
| `days_in_fis_year` bigint, |
| `end_of_cal_year` timestamp, |
| `end_of_fis_year` timestamp) |
| CLUSTERED BY ( |
| time_id) |
| INTO 256 BUCKETS |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'numFiles'='1', |
| 'numRows'='1826', |
| 'orc.bloom.filter.columns'='TIME_ID', |
| 'orc.bloom.filter.fpp'='0.05', |
| 'orc.compress'='SNAPPY', |
| 'orc.create.index'='true', |
| 'orc.row.index.stride'='10000', |
| 'orc.stripe.size'='268435456', |
| 'rawDataSize'='0', |
| 'totalSize'='11155', |
| 'transient_lastDdlTime'='1451429900') |
;
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 30 December 2015 08:28
To: user@hive.apache.org <ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Have you tried it with Hive ob TEZ? It contains (currently) more optimizations than Hive on Spark.
I assume you use the latest Hive version.
Additionally you may want to think about calculating statistics (depending on your configuration you need to trigger it) - I am not sure if Spark can use them.
I am not sure if bloom filters on the columns you mention make sense. You may also want to increase stride size (depending on your data).
Currently you bucket by a lot of fields, which may not make sense. You also may want to sort the data by customer Id in the table.
You also seem to have a lot of reducers, which you may want to decrease.
Have you tried without "having stddev_samp" ? Is the query exactly the same as in Sybase?
On 29 Dec 2015, at 11:53, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:
Hi,
I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion rows as follows:
show create table sales;
+-------------------------------------------------------------------------------+--+
| createtab_stmt |
+-------------------------------------------------------------------------------+--+
| CREATE TABLE `sales`( |
| `prod_id` bigint, |
| `cust_id` bigint, |
| `time_id` timestamp, |
| `channel_id` bigint, |
| `promo_id` bigint, |
| `quantity_sold` decimal(10,0), |
| `amount_sold` decimal(10,0)) |
| CLUSTERED BY ( |
| prod_id, |
| cust_id, |
| time_id, |
| channel_id, |
| promo_id) |
| INTO 256 BUCKETS |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'last_modified_by'='hduser', |
| 'last_modified_time'='1451305626', |
| 'numFiles'='11', |
| 'numRows'='1000000000', |
| 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID', |
| 'orc.bloom.filter.fpp'='0.05', |
| 'orc.compress'='SNAPPY', |
| 'orc.create.index'='true', |
| 'orc.row.index.stride'='10000', |
| 'orc.stripe.size'='268435456', |
| 'rawDataSize'='296000000000', |
| 'totalSize'='2678882153 <tel:2678882153> ', |
| 'transient_lastDdlTime'='1451305626') |
+-------------------------------------------------------------------------------+--+
I use the following query to run against sales table only against Hive
SELECT
rs.Customer_ID
, rs.Number_of_orders
, rs.Total_customer_amount
, rs.Average_order
, rs.Standard_deviation
FROM
(
SELECT cust_id AS Customer_ID,
COUNT(amount_sold) AS Number_of_orders,
SUM(amount_sold) AS Total_customer_amount,
AVG(amount_sold) AS Average_order,
stddev_samp(amount_sold) AS Standard_deviation
FROM sales
GROUP BY cust_id
HAVING SUM(amount_sold) > 94000
AND AVG(amount_sold) < stddev_samp(amount_sold)
) rs
ORDER BY
-- Total_customer_amount DESC
3 DESC
Hive comes back in 17 minutes with 5,948 rows
bl -f sales.hql > sales.log
Connecting to jdbc:hive2://rhes564:10010/default
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
No rows affected (0.097 seconds)
No rows affected (0.001 seconds)
No rows affected (0.001 seconds)
No rows affected (0.038 seconds)
INFO : Warning: Using constant number 3 in order by. If you try to use position alias when hive.groupby.orderby.position.alias is false, the position alias will be ignored.
INFO :
Query Hive on Spark job[0] stages:
INFO : 0
INFO : 1
INFO : 2
INFO :
Status: Running (Hive on Spark job[0])
INFO : Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
INFO : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished Stage-1_0: 0(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished Stage-1_0: 6(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished Stage-1_0: 30(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished Stage-1_0: 59(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished Stage-1_0: 90(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished Stage-1_0: 124(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished Stage-1_0: 160(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished Stage-1_0: 193(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished Stage-1_0: 228(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished Stage-1_0: 262(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished Stage-1_0: 298(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished Stage-1_0: 338(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished Stage-1_0: 376(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished Stage-1_0: 417(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished Stage-1_0: 460(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished Stage-1_0: 502(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished Stage-1_0: 542(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished Stage-1_0: 584(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished Stage-1_0: 624(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished Stage-1_0: 667(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished Stage-1_0: 709(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished Stage-1_0: 754(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished Stage-1_0: 797(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished Stage-1_0: 844(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished Stage-1_0: 888(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished Stage-1_0: 934(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished Stage-1_0: 981(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 0(+1)/1
INFO : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 1/1 Finished
INFO : Status: Finished successfully in 1036.00 seconds
5,948 rows selected (1074.817 seconds)
So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in 23 seconds
Sybase IQ is a columnar database so each column is created as a fast projection index by default. In addition I have created LF (bitmap) indexes on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now the query only touches CUST_ID.
My suspicion is that it is the Standard Deviation function stddev_samp() that could be the bottleneck?
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
Want to work at Handy? Check out our culture deck and open roles <http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led by Fidelity
Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Posted by Jörn Franke <jo...@gmail.com>.
Both Sybase and hive run on the same host?? Maybe this is the reason..
I think with older versions of spark, such as the one you have predicate push down (for ORc and/or parquet) does not work. This is another huge performance penalty. Additionally many optimizations probably do not work. I think in most of the cases Hive on tez makes most sense - should be definitely anyway a choice besides spark as an execution engine.
That being said there are lot of other tuning possibilities that go probably too far for a mailing list exchange. For example, I had some performance improvements in certain scenarios when using JDK8 or many of the other tunig possibilities. You can think also about using ignite etc.
It is a complex ecosystem where you can do a lot ...
> On 30 Dec 2015, at 20:26, Marcin Tustin <mt...@handybook.com> wrote:
>
> I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much faster than MR in pretty much all cases. Also, with hive, you'll make sure you've performed optimizations like aligning ORC stripe sizes with HDFS block sizes, and concatenated your tables (not so much an optimization as a must for avoiding the small files problem).
>
>> On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>> Thanks again Jorn.
>>
>>
>>
>>
>>
>> Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I have compression enabled. The FACT table in IQ (sales) has LF (read bitmap) indexes on the time_id column. For the dimension table (times) I have time_id defined as primary key. Also Sybase IQ creates FP (fast projection) indexes on every column by default.
>>
>>
>>
>> Anyway I am trying to download and build TEZ. Do we know which version of TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
>>
>>
>>
>> Thanks
>>
>>
>>
>> Mich Talebzadeh
>>
>>
>>
>> Sybase ASE 15 Gold Medal Award 2008
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
>>
>> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
>>
>> Publications due shortly:
>>
>> Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
>>
>> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>>
>>
>>
>> From: Jörn Franke [mailto:jornfranke@gmail.com]
>> Sent: 30 December 2015 16:29
>>
>>
>> To: user@hive.apache.org
>> Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
>>
>>
>>
>> Hmm i think the execution Engine TEZ has (currently) the most optimizations on Hive. What about your hardware - is it the same? Do you have also compression on Sybase?
>>
>> Alternatively you need to wait for Hive for interactive analytics (tez 0.8 + llap).
>>
>>
>> On 30 Dec 2015, at 13:47, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>>
>> Hi Jorn,
>>
>>
>>
>> Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not tried it on TEZ. I tried the query on MR engine and it did nor fair better. I also ran it without SDDDEV function and found out that the function did not slow it down.
>>
>>
>>
>> I tried a simple query as follows builr in sales FACT table 1e9 rows and dimension table times (1826 rows)
>>
>>
>>
>> --
>>
>> -- Get the total amount sold for each calendar month
>>
>> --
>>
>> SELECT t.calendar_month_desc, SUM(s.amount_sold)
>>
>> FROM sales s, times t WHERE s.time_id = t.time_id
>>
>> GROUP BY t.calendar_month_desc;
>>
>>
>>
>> Now Sybase IQ comes back in around 30 seconds.
>>
>>
>>
>> Started query at Dec 30 2015 08:14:33:399AM
>>
>> (48 rows affected)
>>
>> Finished query at Dec 30 2015 08:15:04:640AM
>>
>>
>>
>> Whereas Hive with the following setting and running the same query
>>
>>
>>
>> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
>>
>> set hive.optimize.bucketmapjoin=true;
>>
>> set hive.optimize.bucketmapjoin.sortedmerge=true;
>>
>>
>>
>> Comes back in
>>
>>
>>
>> 48 rows selected (1514.687 seconds)
>>
>>
>>
>> I don’t know what else can be done. Obviously this is all schema on read so I am not sure I can change bucketing on FACT table based on one query alone!
>>
>>
>>
>>
>>
>>
>>
>> +--------------------------------------------------------------------+--+
>>
>> | createtab_stmt |
>>
>> +--------------------------------------------------------------------+--+
>>
>> | CREATE TABLE `times`( |
>>
>> | `time_id` timestamp, |
>>
>> | `day_name` varchar(9), |
>>
>> | `day_number_in_week` int, |
>>
>> | `day_number_in_month` int, |
>>
>> | `calendar_week_number` int, |
>>
>> | `fiscal_week_number` int, |
>>
>> | `week_ending_day` timestamp, |
>>
>> | `week_ending_day_id` bigint, |
>>
>> | `calendar_month_number` int, |
>>
>> | `fiscal_month_number` int, |
>>
>> | `calendar_month_desc` varchar(8), |
>>
>> ----------
>>
>> | `days_in_fis_year` bigint, |
>>
>> | `end_of_cal_year` timestamp, |
>>
>> | `end_of_fis_year` timestamp) |
>>
>> | CLUSTERED BY ( |
>>
>> | time_id) |
>>
>> | INTO 256 BUCKETS |
>>
>> | ROW FORMAT SERDE |
>>
>> | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
>>
>> | STORED AS INPUTFORMAT |
>>
>> | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
>>
>> | OUTPUTFORMAT |
>>
>> | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
>>
>> | LOCATION |
>>
>> | 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times' |
>>
>> | TBLPROPERTIES ( |
>>
>> | 'COLUMN_STATS_ACCURATE'='true', |
>>
>> | 'numFiles'='1', |
>>
>> | 'numRows'='1826', |
>>
>> | 'orc.bloom.filter.columns'='TIME_ID', |
>>
>> | 'orc.bloom.filter.fpp'='0.05', |
>>
>> | 'orc.compress'='SNAPPY', |
>>
>> | 'orc.create.index'='true', |
>>
>> | 'orc.row.index.stride'='10000', |
>>
>> | 'orc.stripe.size'='268435456', |
>>
>> | 'rawDataSize'='0', |
>>
>> | 'totalSize'='11155', |
>>
>> | 'transient_lastDdlTime'='1451429900') |
>>
>>
>>
>> ;
>>
>>
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>>
>>
>>
>> From: Jörn Franke [mailto:jornfranke@gmail.com]
>> Sent: 30 December 2015 08:28
>> To: user@hive.apache.org
>> Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
>>
>>
>>
>> Have you tried it with Hive ob TEZ? It contains (currently) more optimizations than Hive on Spark.
>>
>> I assume you use the latest Hive version.
>>
>> Additionally you may want to think about calculating statistics (depending on your configuration you need to trigger it) - I am not sure if Spark can use them.
>>
>> I am not sure if bloom filters on the columns you mention make sense. You may also want to increase stride size (depending on your data).
>>
>> Currently you bucket by a lot of fields, which may not make sense. You also may want to sort the data by customer Id in the table.
>>
>> You also seem to have a lot of reducers, which you may want to decrease.
>>
>>
>>
>> Have you tried without "having stddev_samp" ? Is the query exactly the same as in Sybase?
>>
>>
>> On 29 Dec 2015, at 11:53, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>>
>> Hi,
>>
>>
>>
>> I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion rows as follows:
>>
>>
>>
>> show create table sales;
>>
>> +-------------------------------------------------------------------------------+--+
>>
>> | createtab_stmt |
>>
>> +-------------------------------------------------------------------------------+--+
>>
>> | CREATE TABLE `sales`( |
>>
>> | `prod_id` bigint, |
>>
>> | `cust_id` bigint, |
>>
>> | `time_id` timestamp, |
>>
>> | `channel_id` bigint, |
>>
>> | `promo_id` bigint, |
>>
>> | `quantity_sold` decimal(10,0), |
>>
>> | `amount_sold` decimal(10,0)) |
>>
>> | CLUSTERED BY ( |
>>
>> | prod_id, |
>>
>> | cust_id, |
>>
>> | time_id, |
>>
>> | channel_id, |
>>
>> | promo_id) |
>>
>> | INTO 256 BUCKETS |
>>
>> | ROW FORMAT SERDE |
>>
>> | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
>>
>> | STORED AS INPUTFORMAT |
>>
>> | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
>>
>> | OUTPUTFORMAT |
>>
>> | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
>>
>> | LOCATION |
>>
>> | 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales' |
>>
>> | TBLPROPERTIES ( |
>>
>> | 'COLUMN_STATS_ACCURATE'='true', |
>>
>> | 'last_modified_by'='hduser', |
>>
>> | 'last_modified_time'='1451305626', |
>>
>> | 'numFiles'='11', |
>>
>> | 'numRows'='1000000000', |
>>
>> | 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID', |
>>
>> | 'orc.bloom.filter.fpp'='0.05', |
>>
>> | 'orc.compress'='SNAPPY', |
>>
>> | 'orc.create.index'='true', |
>>
>> | 'orc.row.index.stride'='10000', |
>>
>> | 'orc.stripe.size'='268435456', |
>>
>> | 'rawDataSize'='296000000000', |
>>
>> | 'totalSize'='2678882153', |
>>
>> | 'transient_lastDdlTime'='1451305626') |
>>
>> +-------------------------------------------------------------------------------+--+
>>
>>
>>
>> I use the following query to run against sales table only against Hive
>>
>>
>>
>> SELECT
>>
>> rs.Customer_ID
>>
>> , rs.Number_of_orders
>>
>> , rs.Total_customer_amount
>>
>> , rs.Average_order
>>
>> , rs.Standard_deviation
>>
>> FROM
>>
>> (
>>
>> SELECT cust_id AS Customer_ID,
>>
>> COUNT(amount_sold) AS Number_of_orders,
>>
>> SUM(amount_sold) AS Total_customer_amount,
>>
>> AVG(amount_sold) AS Average_order,
>>
>> stddev_samp(amount_sold) AS Standard_deviation
>>
>> FROM sales
>>
>> GROUP BY cust_id
>>
>> HAVING SUM(amount_sold) > 94000
>>
>> AND AVG(amount_sold) < stddev_samp(amount_sold)
>>
>> ) rs
>>
>> ORDER BY
>>
>> -- Total_customer_amount DESC
>>
>> 3 DESC
>>
>>
>>
>> Hive comes back in 17 minutes with 5,948 rows
>>
>>
>>
>> bl -f sales.hql > sales.log
>>
>> Connecting to jdbc:hive2://rhes564:10010/default
>>
>> Connected to: Apache Hive (version 1.2.1)
>>
>> Driver: Hive JDBC (version 1.2.1)
>>
>> Transaction isolation: TRANSACTION_REPEATABLE_READ
>>
>> Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
>>
>> No rows affected (0.097 seconds)
>>
>> No rows affected (0.001 seconds)
>>
>> No rows affected (0.001 seconds)
>>
>> No rows affected (0.038 seconds)
>>
>> INFO : Warning: Using constant number 3 in order by. If you try to use position alias when hive.groupby.orderby.position.alias is false, the position alias will be ignored.
>>
>> INFO :
>>
>> Query Hive on Spark job[0] stages:
>>
>> INFO : 0
>>
>> INFO : 1
>>
>> INFO : 2
>>
>> INFO :
>>
>> Status: Running (Hive on Spark job[0])
>>
>> INFO : Job Progress Format
>>
>> CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
>>
>> INFO : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>>
>>
>> INFO : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished Stage-1_0: 0(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished Stage-1_0: 6(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished Stage-1_0: 30(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished Stage-1_0: 59(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished Stage-1_0: 90(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished Stage-1_0: 124(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished Stage-1_0: 160(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished Stage-1_0: 193(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished Stage-1_0: 228(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished Stage-1_0: 262(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished Stage-1_0: 298(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished Stage-1_0: 338(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished Stage-1_0: 376(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished Stage-1_0: 417(+3)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished Stage-1_0: 460(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished Stage-1_0: 502(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished Stage-1_0: 542(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished Stage-1_0: 584(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished Stage-1_0: 624(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished Stage-1_0: 667(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished Stage-1_0: 709(+3)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished Stage-1_0: 754(+3)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished Stage-1_0: 797(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished Stage-1_0: 844(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished Stage-1_0: 888(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished Stage-1_0: 934(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished Stage-1_0: 981(+2)/1009 Stage-2_0: 0/1
>>
>> INFO : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 0(+1)/1
>>
>> INFO : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 1/1 Finished
>>
>> INFO : Status: Finished successfully in 1036.00 seconds
>>
>> 5,948 rows selected (1074.817 seconds)
>>
>>
>>
>> So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in 23 seconds
>>
>>
>>
>> Sybase IQ is a columnar database so each column is created as a fast projection index by default. In addition I have created LF (bitmap) indexes on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now the query only touches CUST_ID.
>>
>>
>>
>> My suspicion is that it is the Standard Deviation function stddev_samp() that could be the bottleneck?
>>
>>
>>
>> Thanks
>>
>>
>>
>> Mich Talebzadeh
>>
>>
>>
>> Sybase ASE 15 Gold Medal Award 2008
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
>>
>> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
>>
>> Publications due shortly:
>>
>> Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
>>
>> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> Latest news at Handy
> Handy just raised $50m led by Fidelity
>
Re: Running the same query on 1 billion rows fact table in Hive on
Spark compared to Sybase IQ columnar database
Posted by Marcin Tustin <mt...@handybook.com>.
I'm using TEZ 0.7.0.2.3 with hive 1.2.1.2.3. I can confirm that TEZ is much
faster than MR in pretty much all cases. Also, with hive, you'll make sure
you've performed optimizations like aligning ORC stripe sizes with HDFS
block sizes, and concatenated your tables (not so much an optimization as a
must for avoiding the small files problem).
On Wed, Dec 30, 2015 at 2:19 PM, Mich Talebzadeh <mi...@peridale.co.uk>
wrote:
> Thanks again Jorn.
>
>
>
>
>
> Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I
> have compression enabled. The FACT table in IQ (sales) has LF (read bitmap)
> indexes on the time_id column. For the dimension table (times) I have
> time_id defined as primary key. Also Sybase IQ creates FP (fast projection)
> indexes on every column by default.
>
>
>
> Anyway I am trying to download and build TEZ. Do we know which version of
> TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
>
>
>
> Thanks
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Jörn Franke [mailto:jornfranke@gmail.com]
> *Sent:* 30 December 2015 16:29
>
> *To:* user@hive.apache.org
> *Subject:* Re: Running the same query on 1 billion rows fact table in
> Hive on Spark compared to Sybase IQ columnar database
>
>
>
>
> Hmm i think the execution Engine TEZ has (currently) the most
> optimizations on Hive. What about your hardware - is it the same? Do you
> have also compression on Sybase?
>
> Alternatively you need to wait for Hive for interactive analytics (tez 0.8
> + llap).
>
>
> On 30 Dec 2015, at 13:47, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>
> Hi Jorn,
>
>
>
> Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not
> tried it on TEZ. I tried the query on MR engine and it did nor fair better.
> I also ran it without SDDDEV function and found out that the function did
> not slow it down.
>
>
>
> I tried a simple query as follows builr in sales FACT table 1e9 rows and
> dimension table times (1826 rows)
>
>
>
> --
>
> -- Get the total amount sold for each calendar month
>
> --
>
> *SELECT t.calendar_month_desc, SUM(s.amount_sold)*
>
> *FROM sales s, times t WHERE s.time_id = t.time_id*
>
> *GROUP BY t.calendar_month_desc;*
>
>
>
> Now Sybase IQ comes back in around 30 seconds.
>
>
>
> Started query at Dec 30 2015 08:14:33:399AM
>
> (48 rows affected)
>
> Finished query at Dec 30 2015 08:15:04:640AM
>
>
>
> Whereas Hive with the following setting and running the same query
>
>
>
> set
> hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
>
> set hive.optimize.bucketmapjoin=true;
>
> set hive.optimize.bucketmapjoin.sortedmerge=true;
>
>
>
> Comes back in
>
>
>
> 48 rows selected (1514.687 seconds)
>
>
>
> I don’t know what else can be done. Obviously this is all schema on read
> so I am not sure I can change bucketing on FACT table based on one query
> alone!
>
>
>
>
>
>
>
> +--------------------------------------------------------------------+--+
>
> | createtab_stmt |
>
> +--------------------------------------------------------------------+--+
>
> | CREATE TABLE `times`( |
>
> | `time_id` timestamp, |
>
> | `day_name` varchar(9), |
>
> | `day_number_in_week` int, |
>
> | `day_number_in_month` int, |
>
> | `calendar_week_number` int, |
>
> | `fiscal_week_number` int, |
>
> | `week_ending_day` timestamp, |
>
> | `week_ending_day_id` bigint, |
>
> | `calendar_month_number` int, |
>
> | `fiscal_month_number` int, |
>
> | `calendar_month_desc` varchar(8), |
>
> ----------
>
> | `days_in_fis_year` bigint, |
>
> | `end_of_cal_year` timestamp, |
>
> | `end_of_fis_year` timestamp) |
>
> | CLUSTERED BY ( |
>
> | time_id) |
>
> | INTO 256 BUCKETS |
>
> | ROW FORMAT SERDE |
>
> | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
>
> | STORED AS INPUTFORMAT |
>
> | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
>
> | OUTPUTFORMAT |
>
> | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
>
> | LOCATION |
>
> | 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times' |
>
> | TBLPROPERTIES ( |
>
> | 'COLUMN_STATS_ACCURATE'='true', |
>
> | 'numFiles'='1', |
>
> | 'numRows'='1826', |
>
> | 'orc.bloom.filter.columns'='TIME_ID', |
>
> | 'orc.bloom.filter.fpp'='0.05', |
>
> | 'orc.compress'='SNAPPY', |
>
> | 'orc.create.index'='true', |
>
> | 'orc.row.index.stride'='10000', |
>
> | 'orc.stripe.size'='268435456', |
>
> | 'rawDataSize'='0', |
>
> | 'totalSize'='11155', |
>
> | 'transient_lastDdlTime'='1451429900') |
>
>
>
> ;
>
>
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Jörn Franke [mailto:jornfranke@gmail.com <jo...@gmail.com>]
> *Sent:* 30 December 2015 08:28
> *To:* user@hive.apache.org
> *Subject:* Re: Running the same query on 1 billion rows fact table in
> Hive on Spark compared to Sybase IQ columnar database
>
>
>
> Have you tried it with Hive ob TEZ? It contains (currently) more
> optimizations than Hive on Spark.
>
> I assume you use the latest Hive version.
>
> Additionally you may want to think about calculating statistics (depending
> on your configuration you need to trigger it) - I am not sure if Spark can
> use them.
>
> I am not sure if bloom filters on the columns you mention make sense. You
> may also want to increase stride size (depending on your data).
>
> Currently you bucket by a lot of fields, which may not make sense. You
> also may want to sort the data by customer Id in the table.
>
> You also seem to have a lot of reducers, which you may want to decrease.
>
>
>
> Have you tried without "having stddev_samp" ? Is the query exactly the
> same as in Sybase?
>
>
> On 29 Dec 2015, at 11:53, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>
> Hi,
>
>
>
> I have a fact table in Hive imported from Sybase IQ via SQOOP with 1
> billion rows as follows:
>
>
>
> show create table sales;
>
>
> +-------------------------------------------------------------------------------+--+
>
> |
> createtab_stmt |
>
>
> +-------------------------------------------------------------------------------+--+
>
> | CREATE TABLE
> `sales`( |
>
> | `prod_id`
> bigint, |
>
> | `cust_id`
> bigint, |
>
> | `time_id`
> timestamp, |
>
> | `channel_id`
> bigint, |
>
> | `promo_id`
> bigint, |
>
> | `quantity_sold`
> decimal(10,0), |
>
> | `amount_sold`
> decimal(10,0)) |
>
> | CLUSTERED BY (
> |
>
> |
> prod_id,
> |
>
> |
> cust_id,
> |
>
> | time_id,
> |
>
> |
> channel_id,
> |
>
> |
> promo_id)
> |
>
> | INTO 256 BUCKETS
> |
>
> | ROW FORMAT
> SERDE |
>
> |
> 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> |
>
> | STORED AS INPUTFORMAT
> |
>
> |
> 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
> |
>
> |
> OUTPUTFORMAT
> |
>
> | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> |
>
> |
> LOCATION
> |
>
> |
> 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales'
> |
>
> | TBLPROPERTIES
> ( |
>
> |
> 'COLUMN_STATS_ACCURATE'='true',
> |
>
> |
> 'last_modified_by'='hduser',
> |
>
> |
> 'last_modified_time'='1451305626',
> |
>
> |
> 'numFiles'='11',
> |
>
> |
> 'numRows'='1000000000',
> |
>
> |
> 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID',
> |
>
> |
> 'orc.bloom.filter.fpp'='0.05',
> |
>
> |
> 'orc.compress'='SNAPPY',
> |
>
> |
> 'orc.create.index'='true',
> |
>
> |
> 'orc.row.index.stride'='10000',
> |
>
> |
> 'orc.stripe.size'='268435456',
> |
>
> |
> 'rawDataSize'='296000000000',
> |
>
> | 'totalSize'='2678882153',
> |
>
> |
> 'transient_lastDdlTime'='1451305626')
> |
>
>
> +-------------------------------------------------------------------------------+--+
>
>
>
> I use the following query to run against sales table only against Hive
>
>
>
> SELECT
>
> rs.Customer_ID
>
> , rs.Number_of_orders
>
> , rs.Total_customer_amount
>
> , rs.Average_order
>
> , rs.Standard_deviation
>
> FROM
>
> (
>
> SELECT cust_id AS Customer_ID,
>
> COUNT(amount_sold) AS Number_of_orders,
>
> SUM(amount_sold) AS Total_customer_amount,
>
> AVG(amount_sold) AS Average_order,
>
> stddev_samp(amount_sold) AS Standard_deviation
>
> FROM sales
>
> GROUP BY cust_id
>
> HAVING SUM(amount_sold) > 94000
>
> AND AVG(amount_sold) < stddev_samp(amount_sold)
>
> ) rs
>
> ORDER BY
>
> -- Total_customer_amount DESC
>
> 3 DESC
>
>
>
> Hive comes back in 17 minutes with 5,948 rows
>
>
>
> bl -f sales.hql > sales.log
>
> Connecting to jdbc:hive2://rhes564:10010/default
>
> Connected to: Apache Hive (version 1.2.1)
>
> Driver: Hive JDBC (version 1.2.1)
>
> Transaction isolation: TRANSACTION_REPEATABLE_READ
>
> Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
>
> No rows affected (0.097 seconds)
>
> No rows affected (0.001 seconds)
>
> No rows affected (0.001 seconds)
>
> No rows affected (0.038 seconds)
>
> INFO : Warning: Using constant number 3 in order by. If you try to use
> position alias when hive.groupby.orderby.position.alias is false, the
> position alias will be ignored.
>
> INFO :
>
> Query Hive on Spark job[0] stages:
>
> INFO : 0
>
> INFO : 1
>
> INFO : 2
>
> INFO :
>
> Status: Running (Hive on Spark job[0])
>
> INFO : Job Progress Format
>
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
>
> INFO : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009
> Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009
> Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
>
>
> INFO : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11 Stage-1_0:
> 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished Stage-1_0:
> 0(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished Stage-1_0:
> 6(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished Stage-1_0:
> 30(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished Stage-1_0:
> 59(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished Stage-1_0:
> 90(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished Stage-1_0:
> 124(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished Stage-1_0:
> 160(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished Stage-1_0:
> 193(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished Stage-1_0:
> 228(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished Stage-1_0:
> 262(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished Stage-1_0:
> 298(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished Stage-1_0:
> 338(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished Stage-1_0:
> 376(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished Stage-1_0:
> 417(+3)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished Stage-1_0:
> 460(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished Stage-1_0:
> 502(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished Stage-1_0:
> 542(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished Stage-1_0:
> 584(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished Stage-1_0:
> 624(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished Stage-1_0:
> 667(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished Stage-1_0:
> 709(+3)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished Stage-1_0:
> 754(+3)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished Stage-1_0:
> 797(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished Stage-1_0:
> 844(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished Stage-1_0:
> 888(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished Stage-1_0:
> 934(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished Stage-1_0:
> 981(+2)/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished Stage-1_0:
> 1009/1009 Finished Stage-2_0: 0(+1)/1
>
> INFO : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished Stage-1_0:
> 1009/1009 Finished Stage-2_0: 1/1 Finished
>
> INFO : Status: Finished successfully in 1036.00 seconds
>
> *5,948 rows selected (1074.817 seconds)*
>
>
>
> So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in
> 23 seconds
>
>
>
> Sybase IQ is a columnar database so each column is created as a fast
> projection index by default. In addition I have created LF (bitmap) indexes
> on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now
> the query only touches CUST_ID.
>
>
>
> My suspicion is that it is the Standard Deviation function stddev_samp() that
> could be the bottleneck?
>
>
>
> Thanks
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
>
--
Want to work at Handy? Check out our culture deck and open roles
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
by Fidelity
RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Thanks again Jorn.
Both Hive and Sybase IQ are running on the same host. Yes for Sybase IQ I have compression enabled. The FACT table in IQ (sales) has LF (read bitmap) indexes on the time_id column. For the dimension table (times) I have time_id defined as primary key. Also Sybase IQ creates FP (fast projection) indexes on every column by default.
Anyway I am trying to download and build TEZ. Do we know which version of TEZ works with Hive 1.2.1 please? 0.8 seems to be in alpha
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
<http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 30 December 2015 16:29
To: user@hive.apache.org
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Hmm i think the execution Engine TEZ has (currently) the most optimizations on Hive. What about your hardware - is it the same? Do you have also compression on Sybase?
Alternatively you need to wait for Hive for interactive analytics (tez 0.8 + llap).
On 30 Dec 2015, at 13:47, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:
Hi Jorn,
Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not tried it on TEZ. I tried the query on MR engine and it did nor fair better. I also ran it without SDDDEV function and found out that the function did not slow it down.
I tried a simple query as follows builr in sales FACT table 1e9 rows and dimension table times (1826 rows)
--
-- Get the total amount sold for each calendar month
--
SELECT t.calendar_month_desc, SUM(s.amount_sold)
FROM sales s, times t WHERE s.time_id = t.time_id
GROUP BY t.calendar_month_desc;
Now Sybase IQ comes back in around 30 seconds.
Started query at Dec 30 2015 08:14:33:399AM
(48 rows affected)
Finished query at Dec 30 2015 08:15:04:640AM
Whereas Hive with the following setting and running the same query
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
Comes back in
48 rows selected (1514.687 seconds)
I don’t know what else can be done. Obviously this is all schema on read so I am not sure I can change bucketing on FACT table based on one query alone!
+--------------------------------------------------------------------+--+
| createtab_stmt |
+--------------------------------------------------------------------+--+
| CREATE TABLE `times`( |
| `time_id` timestamp, |
| `day_name` varchar(9), |
| `day_number_in_week` int, |
| `day_number_in_month` int, |
| `calendar_week_number` int, |
| `fiscal_week_number` int, |
| `week_ending_day` timestamp, |
| `week_ending_day_id` bigint, |
| `calendar_month_number` int, |
| `fiscal_month_number` int, |
| `calendar_month_desc` varchar(8), |
----------
| `days_in_fis_year` bigint, |
| `end_of_cal_year` timestamp, |
| `end_of_fis_year` timestamp) |
| CLUSTERED BY ( |
| time_id) |
| INTO 256 BUCKETS |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'numFiles'='1', |
| 'numRows'='1826', |
| 'orc.bloom.filter.columns'='TIME_ID', |
| 'orc.bloom.filter.fpp'='0.05', |
| 'orc.compress'='SNAPPY', |
| 'orc.create.index'='true', |
| 'orc.row.index.stride'='10000', |
| 'orc.stripe.size'='268435456', |
| 'rawDataSize'='0', |
| 'totalSize'='11155', |
| 'transient_lastDdlTime'='1451429900') |
;
<http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 30 December 2015 08:28
To: user@hive.apache.org <ma...@hive.apache.org>
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Have you tried it with Hive ob TEZ? It contains (currently) more optimizations than Hive on Spark.
I assume you use the latest Hive version.
Additionally you may want to think about calculating statistics (depending on your configuration you need to trigger it) - I am not sure if Spark can use them.
I am not sure if bloom filters on the columns you mention make sense. You may also want to increase stride size (depending on your data).
Currently you bucket by a lot of fields, which may not make sense. You also may want to sort the data by customer Id in the table.
You also seem to have a lot of reducers, which you may want to decrease.
Have you tried without "having stddev_samp" ? Is the query exactly the same as in Sybase?
On 29 Dec 2015, at 11:53, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:
Hi,
I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion rows as follows:
show create table sales;
+-------------------------------------------------------------------------------+--+
| createtab_stmt |
+-------------------------------------------------------------------------------+--+
| CREATE TABLE `sales`( |
| `prod_id` bigint, |
| `cust_id` bigint, |
| `time_id` timestamp, |
| `channel_id` bigint, |
| `promo_id` bigint, |
| `quantity_sold` decimal(10,0), |
| `amount_sold` decimal(10,0)) |
| CLUSTERED BY ( |
| prod_id, |
| cust_id, |
| time_id, |
| channel_id, |
| promo_id) |
| INTO 256 BUCKETS |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'last_modified_by'='hduser', |
| 'last_modified_time'='1451305626', |
| 'numFiles'='11', |
| 'numRows'='1000000000', |
| 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID', |
| 'orc.bloom.filter.fpp'='0.05', |
| 'orc.compress'='SNAPPY', |
| 'orc.create.index'='true', |
| 'orc.row.index.stride'='10000', |
| 'orc.stripe.size'='268435456', |
| 'rawDataSize'='296000000000', |
| 'totalSize'='2678882153', |
| 'transient_lastDdlTime'='1451305626') |
+-------------------------------------------------------------------------------+--+
I use the following query to run against sales table only against Hive
SELECT
rs.Customer_ID
, rs.Number_of_orders
, rs.Total_customer_amount
, rs.Average_order
, rs.Standard_deviation
FROM
(
SELECT cust_id AS Customer_ID,
COUNT(amount_sold) AS Number_of_orders,
SUM(amount_sold) AS Total_customer_amount,
AVG(amount_sold) AS Average_order,
stddev_samp(amount_sold) AS Standard_deviation
FROM sales
GROUP BY cust_id
HAVING SUM(amount_sold) > 94000
AND AVG(amount_sold) < stddev_samp(amount_sold)
) rs
ORDER BY
-- Total_customer_amount DESC
3 DESC
Hive comes back in 17 minutes with 5,948 rows
bl -f sales.hql > sales.log
Connecting to jdbc:hive2://rhes564:10010/default
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
No rows affected (0.097 seconds)
No rows affected (0.001 seconds)
No rows affected (0.001 seconds)
No rows affected (0.038 seconds)
INFO : Warning: Using constant number 3 in order by. If you try to use position alias when hive.groupby.orderby.position.alias is false, the position alias will be ignored.
INFO :
Query Hive on Spark job[0] stages:
INFO : 0
INFO : 1
INFO : 2
INFO :
Status: Running (Hive on Spark job[0])
INFO : Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
INFO : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished Stage-1_0: 0(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished Stage-1_0: 6(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished Stage-1_0: 30(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished Stage-1_0: 59(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished Stage-1_0: 90(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished Stage-1_0: 124(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished Stage-1_0: 160(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished Stage-1_0: 193(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished Stage-1_0: 228(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished Stage-1_0: 262(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished Stage-1_0: 298(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished Stage-1_0: 338(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished Stage-1_0: 376(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished Stage-1_0: 417(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished Stage-1_0: 460(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished Stage-1_0: 502(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished Stage-1_0: 542(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished Stage-1_0: 584(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished Stage-1_0: 624(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished Stage-1_0: 667(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished Stage-1_0: 709(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished Stage-1_0: 754(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished Stage-1_0: 797(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished Stage-1_0: 844(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished Stage-1_0: 888(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished Stage-1_0: 934(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished Stage-1_0: 981(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 0(+1)/1
INFO : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 1/1 Finished
INFO : Status: Finished successfully in 1036.00 seconds
5,948 rows selected (1074.817 seconds)
So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in 23 seconds
Sybase IQ is a columnar database so each column is created as a fast projection index by default. In addition I have created LF (bitmap) indexes on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now the query only touches CUST_ID.
My suspicion is that it is the Standard Deviation function stddev_samp() that could be the bottleneck?
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
<http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Posted by Jörn Franke <jo...@gmail.com>.
Hmm i think the execution Engine TEZ has (currently) the most optimizations on Hive. What about your hardware - is it the same? Do you have also compression on Sybase?
Alternatively you need to wait for Hive for interactive analytics (tez 0.8 + llap).
> On 30 Dec 2015, at 13:47, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>
> Hi Jorn,
>
> Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not tried it on TEZ. I tried the query on MR engine and it did nor fair better. I also ran it without SDDDEV function and found out that the function did not slow it down.
>
> I tried a simple query as follows builr in sales FACT table 1e9 rows and dimension table times (1826 rows)
>
> --
> -- Get the total amount sold for each calendar month
> --
> SELECT t.calendar_month_desc, SUM(s.amount_sold)
> FROM sales s, times t WHERE s.time_id = t.time_id
> GROUP BY t.calendar_month_desc;
>
> Now Sybase IQ comes back in around 30 seconds.
>
> Started query at Dec 30 2015 08:14:33:399AM
> (48 rows affected)
> Finished query at Dec 30 2015 08:15:04:640AM
>
> Whereas Hive with the following setting and running the same query
>
> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
> set hive.optimize.bucketmapjoin=true;
> set hive.optimize.bucketmapjoin.sortedmerge=true;
>
> Comes back in
>
> 48 rows selected (1514.687 seconds)
>
> I don’t know what else can be done. Obviously this is all schema on read so I am not sure I can change bucketing on FACT table based on one query alone!
>
>
>
> +--------------------------------------------------------------------+--+
> | createtab_stmt |
> +--------------------------------------------------------------------+--+
> | CREATE TABLE `times`( |
> | `time_id` timestamp, |
> | `day_name` varchar(9), |
> | `day_number_in_week` int, |
> | `day_number_in_month` int, |
> | `calendar_week_number` int, |
> | `fiscal_week_number` int, |
> | `week_ending_day` timestamp, |
> | `week_ending_day_id` bigint, |
> | `calendar_month_number` int, |
> | `fiscal_month_number` int, |
> | `calendar_month_desc` varchar(8), |
> ----------
> | `days_in_fis_year` bigint, |
> | `end_of_cal_year` timestamp, |
> | `end_of_fis_year` timestamp) |
> | CLUSTERED BY ( |
> | time_id) |
> | INTO 256 BUCKETS |
> | ROW FORMAT SERDE |
> | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
> | STORED AS INPUTFORMAT |
> | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
> | OUTPUTFORMAT |
> | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
> | LOCATION |
> | 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times' |
> | TBLPROPERTIES ( |
> | 'COLUMN_STATS_ACCURATE'='true', |
> | 'numFiles'='1', |
> | 'numRows'='1826', |
> | 'orc.bloom.filter.columns'='TIME_ID', |
> | 'orc.bloom.filter.fpp'='0.05', |
> | 'orc.compress'='SNAPPY', |
> | 'orc.create.index'='true', |
> | 'orc.row.index.stride'='10000', |
> | 'orc.stripe.size'='268435456', |
> | 'rawDataSize'='0', |
> | 'totalSize'='11155', |
> | 'transient_lastDdlTime'='1451429900') |
>
> ;
>
>
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>
> From: Jörn Franke [mailto:jornfranke@gmail.com]
> Sent: 30 December 2015 08:28
> To: user@hive.apache.org
> Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
>
> Have you tried it with Hive ob TEZ? It contains (currently) more optimizations than Hive on Spark.
> I assume you use the latest Hive version.
> Additionally you may want to think about calculating statistics (depending on your configuration you need to trigger it) - I am not sure if Spark can use them.
> I am not sure if bloom filters on the columns you mention make sense. You may also want to increase stride size (depending on your data).
> Currently you bucket by a lot of fields, which may not make sense. You also may want to sort the data by customer Id in the table.
> You also seem to have a lot of reducers, which you may want to decrease.
>
> Have you tried without "having stddev_samp" ? Is the query exactly the same as in Sybase?
>
> On 29 Dec 2015, at 11:53, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>
> Hi,
>
> I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion rows as follows:
>
> show create table sales;
> +-------------------------------------------------------------------------------+--+
> | createtab_stmt |
> +-------------------------------------------------------------------------------+--+
> | CREATE TABLE `sales`( |
> | `prod_id` bigint, |
> | `cust_id` bigint, |
> | `time_id` timestamp, |
> | `channel_id` bigint, |
> | `promo_id` bigint, |
> | `quantity_sold` decimal(10,0), |
> | `amount_sold` decimal(10,0)) |
> | CLUSTERED BY ( |
> | prod_id, |
> | cust_id, |
> | time_id, |
> | channel_id, |
> | promo_id) |
> | INTO 256 BUCKETS |
> | ROW FORMAT SERDE |
> | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
> | STORED AS INPUTFORMAT |
> | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
> | OUTPUTFORMAT |
> | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
> | LOCATION |
> | 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales' |
> | TBLPROPERTIES ( |
> | 'COLUMN_STATS_ACCURATE'='true', |
> | 'last_modified_by'='hduser', |
> | 'last_modified_time'='1451305626', |
> | 'numFiles'='11', |
> | 'numRows'='1000000000', |
> | 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID', |
> | 'orc.bloom.filter.fpp'='0.05', |
> | 'orc.compress'='SNAPPY', |
> | 'orc.create.index'='true', |
> | 'orc.row.index.stride'='10000', |
> | 'orc.stripe.size'='268435456', |
> | 'rawDataSize'='296000000000', |
> | 'totalSize'='2678882153', |
> | 'transient_lastDdlTime'='1451305626') |
> +-------------------------------------------------------------------------------+--+
>
> I use the following query to run against sales table only against Hive
>
> SELECT
> rs.Customer_ID
> , rs.Number_of_orders
> , rs.Total_customer_amount
> , rs.Average_order
> , rs.Standard_deviation
> FROM
> (
> SELECT cust_id AS Customer_ID,
> COUNT(amount_sold) AS Number_of_orders,
> SUM(amount_sold) AS Total_customer_amount,
> AVG(amount_sold) AS Average_order,
> stddev_samp(amount_sold) AS Standard_deviation
> FROM sales
> GROUP BY cust_id
> HAVING SUM(amount_sold) > 94000
> AND AVG(amount_sold) < stddev_samp(amount_sold)
> ) rs
> ORDER BY
> -- Total_customer_amount DESC
> 3 DESC
>
> Hive comes back in 17 minutes with 5,948 rows
>
> bl -f sales.hql > sales.log
> Connecting to jdbc:hive2://rhes564:10010/default
> Connected to: Apache Hive (version 1.2.1)
> Driver: Hive JDBC (version 1.2.1)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
> No rows affected (0.097 seconds)
> No rows affected (0.001 seconds)
> No rows affected (0.001 seconds)
> No rows affected (0.038 seconds)
> INFO : Warning: Using constant number 3 in order by. If you try to use position alias when hive.groupby.orderby.position.alias is false, the position alias will be ignored.
> INFO :
> Query Hive on Spark job[0] stages:
> INFO : 0
> INFO : 1
> INFO : 2
> INFO :
> Status: Running (Hive on Spark job[0])
> INFO : Job Progress Format
> CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
> INFO : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished Stage-1_0: 0(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished Stage-1_0: 6(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished Stage-1_0: 30(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished Stage-1_0: 59(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished Stage-1_0: 90(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished Stage-1_0: 124(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished Stage-1_0: 160(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished Stage-1_0: 193(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished Stage-1_0: 228(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished Stage-1_0: 262(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished Stage-1_0: 298(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished Stage-1_0: 338(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished Stage-1_0: 376(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished Stage-1_0: 417(+3)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished Stage-1_0: 460(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished Stage-1_0: 502(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished Stage-1_0: 542(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished Stage-1_0: 584(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished Stage-1_0: 624(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished Stage-1_0: 667(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished Stage-1_0: 709(+3)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished Stage-1_0: 754(+3)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished Stage-1_0: 797(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished Stage-1_0: 844(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished Stage-1_0: 888(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished Stage-1_0: 934(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished Stage-1_0: 981(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 0(+1)/1
> INFO : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 1/1 Finished
> INFO : Status: Finished successfully in 1036.00 seconds
> 5,948 rows selected (1074.817 seconds)
>
> So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in 23 seconds
>
> Sybase IQ is a columnar database so each column is created as a fast projection index by default. In addition I have created LF (bitmap) indexes on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now the query only touches CUST_ID.
>
> My suspicion is that it is the Standard Deviation function stddev_samp() that could be the bottleneck?
>
> Thanks
>
> Mich Talebzadeh
>
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
>
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>
RE: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Hi Jorn,
Thanks for your reply. My Hive version is 1.2.1 on Spark 1.3.1. I have not tried it on TEZ. I tried the query on MR engine and it did nor fair better. I also ran it without SDDDEV function and found out that the function did not slow it down.
I tried a simple query as follows builr in sales FACT table 1e9 rows and dimension table times (1826 rows)
--
-- Get the total amount sold for each calendar month
--
SELECT t.calendar_month_desc, SUM(s.amount_sold)
FROM sales s, times t WHERE s.time_id = t.time_id
GROUP BY t.calendar_month_desc;
Now Sybase IQ comes back in around 30 seconds.
Started query at Dec 30 2015 08:14:33:399AM
(48 rows affected)
Finished query at Dec 30 2015 08:15:04:640AM
Whereas Hive with the following setting and running the same query
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
Comes back in
48 rows selected (1514.687 seconds)
I don’t know what else can be done. Obviously this is all schema on read so I am not sure I can change bucketing on FACT table based on one query alone!
+--------------------------------------------------------------------+--+
| createtab_stmt |
+--------------------------------------------------------------------+--+
| CREATE TABLE `times`( |
| `time_id` timestamp, |
| `day_name` varchar(9), |
| `day_number_in_week` int, |
| `day_number_in_month` int, |
| `calendar_week_number` int, |
| `fiscal_week_number` int, |
| `week_ending_day` timestamp, |
| `week_ending_day_id` bigint, |
| `calendar_month_number` int, |
| `fiscal_month_number` int, |
| `calendar_month_desc` varchar(8), |
----------
| `days_in_fis_year` bigint, |
| `end_of_cal_year` timestamp, |
| `end_of_fis_year` timestamp) |
| CLUSTERED BY ( |
| time_id) |
| INTO 256 BUCKETS |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/times' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'numFiles'='1', |
| 'numRows'='1826', |
| 'orc.bloom.filter.columns'='TIME_ID', |
| 'orc.bloom.filter.fpp'='0.05', |
| 'orc.compress'='SNAPPY', |
| 'orc.create.index'='true', |
| 'orc.row.index.stride'='10000', |
| 'orc.stripe.size'='268435456', |
| 'rawDataSize'='0', |
| 'totalSize'='11155', |
| 'transient_lastDdlTime'='1451429900') |
;
<http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
From: Jörn Franke [mailto:jornfranke@gmail.com]
Sent: 30 December 2015 08:28
To: user@hive.apache.org
Subject: Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Have you tried it with Hive ob TEZ? It contains (currently) more optimizations than Hive on Spark.
I assume you use the latest Hive version.
Additionally you may want to think about calculating statistics (depending on your configuration you need to trigger it) - I am not sure if Spark can use them.
I am not sure if bloom filters on the columns you mention make sense. You may also want to increase stride size (depending on your data).
Currently you bucket by a lot of fields, which may not make sense. You also may want to sort the data by customer Id in the table.
You also seem to have a lot of reducers, which you may want to decrease.
Have you tried without "having stddev_samp" ? Is the query exactly the same as in Sybase?
On 29 Dec 2015, at 11:53, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:
Hi,
I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion rows as follows:
show create table sales;
+-------------------------------------------------------------------------------+--+
| createtab_stmt |
+-------------------------------------------------------------------------------+--+
| CREATE TABLE `sales`( |
| `prod_id` bigint, |
| `cust_id` bigint, |
| `time_id` timestamp, |
| `channel_id` bigint, |
| `promo_id` bigint, |
| `quantity_sold` decimal(10,0), |
| `amount_sold` decimal(10,0)) |
| CLUSTERED BY ( |
| prod_id, |
| cust_id, |
| time_id, |
| channel_id, |
| promo_id) |
| INTO 256 BUCKETS |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'last_modified_by'='hduser', |
| 'last_modified_time'='1451305626', |
| 'numFiles'='11', |
| 'numRows'='1000000000', |
| 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID', |
| 'orc.bloom.filter.fpp'='0.05', |
| 'orc.compress'='SNAPPY', |
| 'orc.create.index'='true', |
| 'orc.row.index.stride'='10000', |
| 'orc.stripe.size'='268435456', |
| 'rawDataSize'='296000000000', |
| 'totalSize'='2678882153', |
| 'transient_lastDdlTime'='1451305626') |
+-------------------------------------------------------------------------------+--+
I use the following query to run against sales table only against Hive
SELECT
rs.Customer_ID
, rs.Number_of_orders
, rs.Total_customer_amount
, rs.Average_order
, rs.Standard_deviation
FROM
(
SELECT cust_id AS Customer_ID,
COUNT(amount_sold) AS Number_of_orders,
SUM(amount_sold) AS Total_customer_amount,
AVG(amount_sold) AS Average_order,
stddev_samp(amount_sold) AS Standard_deviation
FROM sales
GROUP BY cust_id
HAVING SUM(amount_sold) > 94000
AND AVG(amount_sold) < stddev_samp(amount_sold)
) rs
ORDER BY
-- Total_customer_amount DESC
3 DESC
Hive comes back in 17 minutes with 5,948 rows
bl -f sales.hql > sales.log
Connecting to jdbc:hive2://rhes564:10010/default
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
No rows affected (0.097 seconds)
No rows affected (0.001 seconds)
No rows affected (0.001 seconds)
No rows affected (0.038 seconds)
INFO : Warning: Using constant number 3 in order by. If you try to use position alias when hive.groupby.orderby.position.alias is false, the position alias will be ignored.
INFO :
Query Hive on Spark job[0] stages:
INFO : 0
INFO : 1
INFO : 2
INFO :
Status: Running (Hive on Spark job[0])
INFO : Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
INFO : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished Stage-1_0: 0(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished Stage-1_0: 6(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished Stage-1_0: 30(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished Stage-1_0: 59(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished Stage-1_0: 90(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished Stage-1_0: 124(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished Stage-1_0: 160(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished Stage-1_0: 193(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished Stage-1_0: 228(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished Stage-1_0: 262(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished Stage-1_0: 298(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished Stage-1_0: 338(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished Stage-1_0: 376(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished Stage-1_0: 417(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished Stage-1_0: 460(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished Stage-1_0: 502(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished Stage-1_0: 542(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished Stage-1_0: 584(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished Stage-1_0: 624(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished Stage-1_0: 667(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished Stage-1_0: 709(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished Stage-1_0: 754(+3)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished Stage-1_0: 797(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished Stage-1_0: 844(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished Stage-1_0: 888(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished Stage-1_0: 934(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished Stage-1_0: 981(+2)/1009 Stage-2_0: 0/1
INFO : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 0(+1)/1
INFO : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 1/1 Finished
INFO : Status: Finished successfully in 1036.00 seconds
5,948 rows selected (1074.817 seconds)
So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in 23 seconds
Sybase IQ is a columnar database so each column is created as a fast projection index by default. In addition I have created LF (bitmap) indexes on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now the query only touches CUST_ID.
My suspicion is that it is the Standard Deviation function stddev_samp() that could be the bottleneck?
Thanks
Mich Talebzadeh
Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
<http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com
NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
Re: Running the same query on 1 billion rows fact table in Hive on Spark compared to Sybase IQ columnar database
Posted by Jörn Franke <jo...@gmail.com>.
Have you tried it with Hive ob TEZ? It contains (currently) more optimizations than Hive on Spark.
I assume you use the latest Hive version.
Additionally you may want to think about calculating statistics (depending on your configuration you need to trigger it) - I am not sure if Spark can use them.
I am not sure if bloom filters on the columns you mention make sense. You may also want to increase stride size (depending on your data).
Currently you bucket by a lot of fields, which may not make sense. You also may want to sort the data by customer Id in the table.
You also seem to have a lot of reducers, which you may want to decrease.
Have you tried without "having stddev_samp" ? Is the query exactly the same as in Sybase?
> On 29 Dec 2015, at 11:53, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
>
> Hi,
>
> I have a fact table in Hive imported from Sybase IQ via SQOOP with 1 billion rows as follows:
>
> show create table sales;
> +-------------------------------------------------------------------------------+--+
> | createtab_stmt |
> +-------------------------------------------------------------------------------+--+
> | CREATE TABLE `sales`( |
> | `prod_id` bigint, |
> | `cust_id` bigint, |
> | `time_id` timestamp, |
> | `channel_id` bigint, |
> | `promo_id` bigint, |
> | `quantity_sold` decimal(10,0), |
> | `amount_sold` decimal(10,0)) |
> | CLUSTERED BY ( |
> | prod_id, |
> | cust_id, |
> | time_id, |
> | channel_id, |
> | promo_id) |
> | INTO 256 BUCKETS |
> | ROW FORMAT SERDE |
> | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
> | STORED AS INPUTFORMAT |
> | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
> | OUTPUTFORMAT |
> | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
> | LOCATION |
> | 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales' |
> | TBLPROPERTIES ( |
> | 'COLUMN_STATS_ACCURATE'='true', |
> | 'last_modified_by'='hduser', |
> | 'last_modified_time'='1451305626', |
> | 'numFiles'='11', |
> | 'numRows'='1000000000', |
> | 'orc.bloom.filter.columns'='PROD_ID,CUST_ID,TIME_ID,CHANNEL_ID,PROMO_ID', |
> | 'orc.bloom.filter.fpp'='0.05', |
> | 'orc.compress'='SNAPPY', |
> | 'orc.create.index'='true', |
> | 'orc.row.index.stride'='10000', |
> | 'orc.stripe.size'='268435456', |
> | 'rawDataSize'='296000000000', |
> | 'totalSize'='2678882153', |
> | 'transient_lastDdlTime'='1451305626') |
> +-------------------------------------------------------------------------------+--+
>
> I use the following query to run against sales table only against Hive
>
> SELECT
> rs.Customer_ID
> , rs.Number_of_orders
> , rs.Total_customer_amount
> , rs.Average_order
> , rs.Standard_deviation
> FROM
> (
> SELECT cust_id AS Customer_ID,
> COUNT(amount_sold) AS Number_of_orders,
> SUM(amount_sold) AS Total_customer_amount,
> AVG(amount_sold) AS Average_order,
> stddev_samp(amount_sold) AS Standard_deviation
> FROM sales
> GROUP BY cust_id
> HAVING SUM(amount_sold) > 94000
> AND AVG(amount_sold) < stddev_samp(amount_sold)
> ) rs
> ORDER BY
> -- Total_customer_amount DESC
> 3 DESC
>
> Hive comes back in 17 minutes with 5,948 rows
>
> bl -f sales.hql > sales.log
> Connecting to jdbc:hive2://rhes564:10010/default
> Connected to: Apache Hive (version 1.2.1)
> Driver: Hive JDBC (version 1.2.1)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Running init script /home/hduser/dba/bin/hive_on_spark_init.hql
> No rows affected (0.097 seconds)
> No rows affected (0.001 seconds)
> No rows affected (0.001 seconds)
> No rows affected (0.038 seconds)
> INFO : Warning: Using constant number 3 in order by. If you try to use position alias when hive.groupby.orderby.position.alias is false, the position alias will be ignored.
> INFO :
> Query Hive on Spark job[0] stages:
> INFO : 0
> INFO : 1
> INFO : 2
> INFO :
> Status: Running (Hive on Spark job[0])
> INFO : Job Progress Format
> CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
> INFO : 2015-12-29 09:33:25,815 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:28,829 Stage-0_0: 0/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:31,857 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:34,875 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:37,903 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:40,918 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:43,939 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:46,958 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:49,971 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:52,991 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:33:56,007 Stage-0_0: 0(+2)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
>
> INFO : 2015-12-29 09:50:03,578 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:06,590 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:09,602 Stage-0_0: 10(+1)/11 Stage-1_0: 0/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:10,606 Stage-0_0: 11/11 Finished Stage-1_0: 0(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:11,610 Stage-0_0: 11/11 Finished Stage-1_0: 6(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:12,618 Stage-0_0: 11/11 Finished Stage-1_0: 30(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:13,622 Stage-0_0: 11/11 Finished Stage-1_0: 59(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:14,626 Stage-0_0: 11/11 Finished Stage-1_0: 90(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:15,631 Stage-0_0: 11/11 Finished Stage-1_0: 124(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:16,654 Stage-0_0: 11/11 Finished Stage-1_0: 160(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:17,659 Stage-0_0: 11/11 Finished Stage-1_0: 193(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:18,663 Stage-0_0: 11/11 Finished Stage-1_0: 228(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:19,667 Stage-0_0: 11/11 Finished Stage-1_0: 262(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:20,672 Stage-0_0: 11/11 Finished Stage-1_0: 298(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:21,679 Stage-0_0: 11/11 Finished Stage-1_0: 338(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:22,687 Stage-0_0: 11/11 Finished Stage-1_0: 376(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:23,691 Stage-0_0: 11/11 Finished Stage-1_0: 417(+3)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:24,696 Stage-0_0: 11/11 Finished Stage-1_0: 460(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:25,699 Stage-0_0: 11/11 Finished Stage-1_0: 502(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:26,707 Stage-0_0: 11/11 Finished Stage-1_0: 542(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:27,712 Stage-0_0: 11/11 Finished Stage-1_0: 584(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:28,719 Stage-0_0: 11/11 Finished Stage-1_0: 624(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:29,730 Stage-0_0: 11/11 Finished Stage-1_0: 667(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:30,736 Stage-0_0: 11/11 Finished Stage-1_0: 709(+3)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:31,740 Stage-0_0: 11/11 Finished Stage-1_0: 754(+3)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:32,743 Stage-0_0: 11/11 Finished Stage-1_0: 797(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:33,747 Stage-0_0: 11/11 Finished Stage-1_0: 844(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:34,754 Stage-0_0: 11/11 Finished Stage-1_0: 888(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:35,759 Stage-0_0: 11/11 Finished Stage-1_0: 934(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:36,764 Stage-0_0: 11/11 Finished Stage-1_0: 981(+2)/1009 Stage-2_0: 0/1
> INFO : 2015-12-29 09:50:37,768 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 0(+1)/1
> INFO : 2015-12-29 09:50:38,771 Stage-0_0: 11/11 Finished Stage-1_0: 1009/1009 Finished Stage-2_0: 1/1 Finished
> INFO : Status: Finished successfully in 1036.00 seconds
> 5,948 rows selected (1074.817 seconds)
>
> So it returns 5948 rows in 17 minutes. In contrast IQ returns 5947 rows in 23 seconds
>
> Sybase IQ is a columnar database so each column is created as a fast projection index by default. In addition I have created LF (bitmap) indexes on dimension columns (PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID). Now the query only touches CUST_ID.
>
> My suspicion is that it is the Standard Deviation function stddev_samp() that could be the bottleneck?
>
> Thanks
>
> Mich Talebzadeh
>
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
>
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>