You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by 김동원 <ea...@gmail.com> on 2017/01/30 10:57:54 UTC

Experimental results using TPC-DS (versus Spark and Presto)

Hi,

Recently I did some experiments using Hive, Spark, and Presto using TPC-DS benchmark
and I'd like to share the result with the community: http://www.slideshare.net/ssuser6bb12d/hive-presto-and-spark-on-tpcds-benchmark <http://www.slideshare.net/ssuser6bb12d/hive-presto-and-spark-on-tpcds-benchmark>
I entirely depend on the benchmark kit from Hortonwork: https://github.com/hortonworks/hive-testbench <https://github.com/hortonworks/hive-testbench>

Here I have a question about query 72.
Hive LLAP shows better performance than Presto and Spark for most queries, but it shows very poor performance on the execution of query 72.
While Presto also struggles with query 72, Spark finishes the execution of query 72 a lot faster than Hive (page 9 and 10).
I've observed a weird pattern in CPU utilization from Presto and Hive executing query 72 (page 11).
When I turned off Spark's WholeStageCodeGen, Spark also takes a very long time to finish the execution of query 72 (page 12).
Did I miss some feature of Hive to improve the performance of that kind of query?
I use the following setting for Hive experiments: https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/testbench.settings <https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/testbench.settings>

Except query 72, Hive with LLAP shows very good performance for both small and large workload anyway.

- Dongwon Kim

Re: Experimental results using TPC-DS (versus Spark and Presto)

Posted by 김동원 <ea...@gmail.com>.

As the size can cause confusion as you pointed, let me explain about it for a while for others.

The benchmark size comes from a scale factor of dsdgen (tpc-"ds" "d"ata "gen"erator).
If you take a look at http://eastcirclek.blogspot.kr/2016/12/loading-tpc-ds-data-into-mysql.html <http://eastcirclek.blogspot.kr/2016/12/loading-tpc-ds-data-into-mysql.html>,
a user can specify the scale factor when executing dsdgen (with -scale parameter in Step 3).
Using N as scale factor results in 24 CSV files (each of which is one of 24 tpc-ds tables) collectively amounting to N gigabytes.
When I convert 1TB dataset into ORC and Parquet, data sizes are 
- ORC : 313,787,478,951 bytes
- Parquet :  544,802,530,684 bytes
, respectively.
(High compression ratio is a reason why I use ORC over Parquet)

Individual table sizes are smaller than the total size.
The size of each table for 1TB workload stored as ORC is as follows:
$ hdfs dfs -du /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db | sort -nr
138604293501  /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/store_sales
95512050042   /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/catalog_sales
50883597954   /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/web_sales
14898403615   /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/store_returns
9381171748    /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/catalog_returns
4467912025    /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/web_returns
23303614      /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/inventory
11922214      /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/customer
2010633       /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/customer_address
1183265       /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/item
698942        /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/customer_demographics
403459        /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/catalog_page
361232        /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/date_dim
132187        /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/time_dim
10943         /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/promotion
6066          /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/web_site
4911          /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/store
4686          /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/call_center
2562          /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/web_page
1836          /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/warehouse
1320          /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/ship_mode
895           /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/household_demographics
888           /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/reason
413           /user/hive/warehouse/tpcds_bin_partitioned_orc_1024.db/income_band

Considering every query doesn't need all tables, Presto can process many queries within its budget.
What can be a problem is a case in which the intermediate data generated by (outer-)joining two large tables cause a query execution to use more memory than query.max-memory or query.max-memory-per-node in the middle of the execution, in which case Presto stops executing the query.
1TB workload causes more queries to exceed memory limit than 100TB workload.
Due to its in-memory nature, Presto doesn't show any spill during the execution of queries, which results in no disk write during the query execution as shown in page 11 of my slide.
I found Presto is working on spilling data during join, but the issue is still open: https://github.com/prestodb/presto/issues/5897 <https://github.com/prestodb/presto/issues/5897>.

- Dongwon

> 2017. 1. 31. 오전 9:55, Goden Yao <go...@gmail.com> 작성:
> 
> ORC works well with Presto too at least.
> Can you explain a little how you ran 1TB benchmark on a 5*80 = 400GB total memory in presto cluster. 
> Did you use compression to fit them all in memory? or partitioned data , etc.
> 
> 
> On Mon, Jan 30, 2017 at 3:50 PM Dongwon Kim <eastcirclek@gmail.com <ma...@gmail.com>> wrote:
> Goun : Just to make all the engines use the same data and I usually
> store data in ORC. I know that it can make biased results in favor of
> Hive. I did Spark experiments with Parquet, and Spark works better
> with Parquet as it is believed (not included in the result though).
> 
> Goden : Oops, 128GB main memory for the master and all the slaves for
> sure because I'm using 80GB per each node.
> 
> Gopal : (yarn logs -application $APPID) doesn't contain a line
> containing HISTORY so it doesn't produce svg file. Should I turn on
> some option to get the lines containing HISTORY in yarn application
> log?
> 
> 2017-01-31 4:47 GMT+09:00 Goden Yao <godenyao@apache.org <ma...@apache.org>>:
> > was the master 128MB or 128GB memory?
> >
> >
> > On Mon, Jan 30, 2017 at 3:24 AM Gopal Vijayaraghavan <gopalv@apache.org <ma...@apache.org>>
> > wrote:
> >>
> >>
> >> > Hive LLAP shows better performance than Presto and Spark for most
> >> > queries, but it shows very poor performance on the execution of query 72.
> >>
> >> My suspicion will be the the inventory x catalog_sales x warehouse join -
> >> assuming the column statistics are present and valid.
> >>
> >> If you could send the explain formatted plans and swimlanes for LLAP, I
> >> can probably debug this better.
> >>
> >>
> >> https://github.com/apache/tez/blob/master/tez-tools/swimlanes/yarn-swimlanes.sh <https://github.com/apache/tez/blob/master/tez-tools/swimlanes/yarn-swimlanes.sh>
> >>
> >> Use the "submitted to <appid>" in this to get the diagram.
> >>
> >> Cheers,
> >> Gopal
> >>
> >>
> > --
> > Goden
> -- 
> Goden

Re: Experimental results using TPC-DS (versus Spark and Presto)

Posted by Goden Yao <go...@gmail.com>.

ORC works well with Presto too at least.
Can you explain a little how you ran 1TB benchmark on a 5*80 = 400GB total
memory in presto cluster.
Did you use compression to fit them all in memory? or partitioned data ,
etc.


On Mon, Jan 30, 2017 at 3:50 PM Dongwon Kim <ea...@gmail.com> wrote:

> Goun : Just to make all the engines use the same data and I usually
> store data in ORC. I know that it can make biased results in favor of
> Hive. I did Spark experiments with Parquet, and Spark works better
> with Parquet as it is believed (not included in the result though).
>
> Goden : Oops, 128GB main memory for the master and all the slaves for
> sure because I'm using 80GB per each node.
>
> Gopal : (yarn logs -application $APPID) doesn't contain a line
> containing HISTORY so it doesn't produce svg file. Should I turn on
> some option to get the lines containing HISTORY in yarn application
> log?
>
> 2017-01-31 4:47 GMT+09:00 Goden Yao <go...@apache.org>:
> > was the master 128MB or 128GB memory?
> >
> >
> > On Mon, Jan 30, 2017 at 3:24 AM Gopal Vijayaraghavan <go...@apache.org>
> > wrote:
> >>
> >>
> >> > Hive LLAP shows better performance than Presto and Spark for most
> >> > queries, but it shows very poor performance on the execution of query
> 72.
> >>
> >> My suspicion will be the the inventory x catalog_sales x warehouse join
> -
> >> assuming the column statistics are present and valid.
> >>
> >> If you could send the explain formatted plans and swimlanes for LLAP, I
> >> can probably debug this better.
> >>
> >>
> >>
> https://github.com/apache/tez/blob/master/tez-tools/swimlanes/yarn-swimlanes.sh
> >>
> >> Use the "submitted to <appid>" in this to get the diagram.
> >>
> >> Cheers,
> >> Gopal
> >>
> >>
> > --
> > Goden
>
-- 
Goden

Re: Experimental results using TPC-DS (versus Spark and Presto)

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

Hi Dongwon

Thanks for the presentation! Very insightful.
I just filed a bug for query72. Hive’s CBO seems to be selecting wrong join order. https://issues.apache.org/jira/browse/HIVE-15771

In the following link you can find a rewrite for the query which gives much better runtime (in my testing I was able to run in 130s on 1TB scale on 6 node LLAP cluster).
I also disabled the date filter that turns into NULL > NULL + 5 expression in the queries.
Ideally, we want CBO to pick up join order in the rewritten query (which should be fixed with HIVE-15771).

https://transfer.sh/E0XfV/q72.zip

The above link contains
- original query
- modified query
- the explain output (txt and svg) for original and modified

Thanks for reporting the issue and I hope this helps.

Thanks
Prasanth

On Jan 30, 2017, at 9:39 PM, 김동원 <ea...@gmail.com>> wrote:

gopal :
In the attached gopal.tar.gz, I put two svg images and two text files after rerunning query72 with and without the following inequation:
and d3.d_date > d1.d_date + 5

FYI, I already did Hive experiments with and without the inequation because Presto doesn't allow it at the time of query submission,
but Hive's running times are not that different.
<to_gopal.tar.gz>
- Dongwon

2017. 1. 31. 오후 12:48, Gopal Vijayaraghavan <go...@apache.org>> 작성:

Gopal : (yarn logs -application $APPID) doesn't contain a line
containing HISTORY so it doesn't produce svg file. Should I turn on
some option to get the lines containing HISTORY in yarn application
log?

There's a config option tez.am.log.level=INFO which controls who much data is written to the log there.

I think there's an interval type clause in the 72 query, which might be a problem.

and d3.d_date > d1.d_date + 5

That might be doing UDFToDouble(d_date) > UDFToDouble(d_date) + 5, which will evaluate into

NULL > NULL + 5

Because UDFToDouble("1997-01-01") is NULL.

So, seeing your explain would go a long way in finding out what's going on.

The swimlane raw data is also somewhat interesting to me, because I also draw a differen t set of graphs from the same HISTORY data.

http://people.apache.org/~gopalv/q21_suppliers_who_kept_orders_waiting.svg

to locate bottlenecks in the system.

Cheers,
Gopal

Re: Experimental results using TPC-DS (versus Spark and Presto)

Posted by 김동원 <ea...@gmail.com>.

gopal : 
In the attached  gopal.tar.gz, I put two svg images and two text files after rerunning query72 with and without the following inequation: 
>> and d3.d_date > d1.d_date + 5

FYI, I already did Hive experiments with and without the inequation because Presto doesn't allow it at the time of query submission,
but Hive's running times are not that different.

Re: Experimental results using TPC-DS (versus Spark and Presto)

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> Gopal : (yarn logs -application $APPID) doesn't contain a line
>  containing HISTORY so it doesn't produce svg file. Should I turn on
>  some option to get the lines containing HISTORY in yarn application
>  log?

There's a config option tez.am.log.level=INFO which controls who much data is written to the log there.

I think there's an interval type clause in the 72 query, which might be a problem.

> and d3.d_date > d1.d_date + 5

That might be doing UDFToDouble(d_date) > UDFToDouble(d_date) + 5, which will evaluate into 

NULL > NULL + 5

Because UDFToDouble("1997-01-01") is NULL.

So, seeing your explain would go a long way in finding out what's going on.

The swimlane raw data is also somewhat interesting to me, because I also draw a differen t set of graphs from the same HISTORY data.

http://people.apache.org/~gopalv/q21_suppliers_who_kept_orders_waiting.svg

to locate bottlenecks in the system.

Cheers,
Gopal

Re: Experimental results using TPC-DS (versus Spark and Presto)

Posted by Dongwon Kim <ea...@gmail.com>.

Goun : Just to make all the engines use the same data and I usually
store data in ORC. I know that it can make biased results in favor of
Hive. I did Spark experiments with Parquet, and Spark works better
with Parquet as it is believed (not included in the result though).

Goden : Oops, 128GB main memory for the master and all the slaves for
sure because I'm using 80GB per each node.

Gopal : (yarn logs -application $APPID) doesn't contain a line
containing HISTORY so it doesn't produce svg file. Should I turn on
some option to get the lines containing HISTORY in yarn application
log?

2017-01-31 4:47 GMT+09:00 Goden Yao <go...@apache.org>:
> was the master 128MB or 128GB memory?
>
>
> On Mon, Jan 30, 2017 at 3:24 AM Gopal Vijayaraghavan <go...@apache.org>
> wrote:
>>
>>
>> > Hive LLAP shows better performance than Presto and Spark for most
>> > queries, but it shows very poor performance on the execution of query 72.
>>
>> My suspicion will be the the inventory x catalog_sales x warehouse join -
>> assuming the column statistics are present and valid.
>>
>> If you could send the explain formatted plans and swimlanes for LLAP, I
>> can probably debug this better.
>>
>>
>> https://github.com/apache/tez/blob/master/tez-tools/swimlanes/yarn-swimlanes.sh
>>
>> Use the "submitted to <appid>" in this to get the diagram.
>>
>> Cheers,
>> Gopal
>>
>>
> --
> Goden

Re: Experimental results using TPC-DS (versus Spark and Presto)

Posted by Goden Yao <go...@apache.org>.

was the master 128MB or 128GB memory?

On Mon, Jan 30, 2017 at 3:24 AM Gopal Vijayaraghavan <go...@apache.org>
wrote:

>
> > Hive LLAP shows better performance than Presto and Spark for most
> queries, but it shows very poor performance on the execution of query 72.
>
> My suspicion will be the the inventory x catalog_sales x warehouse join -
> assuming the column statistics are present and valid.
>
> If you could send the explain formatted plans and swimlanes for LLAP, I
> can probably debug this better.
>
>
> https://github.com/apache/tez/blob/master/tez-tools/swimlanes/yarn-swimlanes.sh
>
> Use the "submitted to <appid>" in this to get the diagram.
>
> Cheers,
> Gopal
>
>
> --
Goden

Re: Experimental results using TPC-DS (versus Spark and Presto)

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> Hive LLAP shows better performance than Presto and Spark for most queries, but it shows very poor performance on the execution of query 72.

My suspicion will be the the inventory x catalog_sales x warehouse join - assuming the column statistics are present and valid.

If you could send the explain formatted plans and swimlanes for LLAP, I can probably debug this better.

https://github.com/apache/tez/blob/master/tez-tools/swimlanes/yarn-swimlanes.sh

Use the "submitted to <appid>" in this to get the diagram.

Cheers,
Gopal

Re: Experimental results using TPC-DS (versus Spark and Presto)

Posted by goun na <go...@gmail.com>.

Thanks for sharing benchmark results. May I ask why you choose ORC?

2017-01-30 19:57 GMT+09:00 김동원 <ea...@gmail.com>:

> Hi,
>
> Recently I did some experiments using Hive, Spark, and Presto using TPC-DS
> benchmark
> and I'd like to share the result with the community: http://www.
> slideshare.net/ssuser6bb12d/hive-presto-and-spark-on-tpcds-benchmark
> I entirely depend on the benchmark kit from Hortonwork:
> https://github.com/hortonworks/hive-testbench
>
> Here I have a question about query 72.
> Hive LLAP shows better performance than Presto and Spark for most queries,
> but it shows very poor performance on the execution of query 72.
> While Presto also struggles with query 72, Spark finishes the execution of
> query 72 a lot faster than Hive (page 9 and 10).
> I've observed a weird pattern in CPU utilization from Presto and Hive
> executing query 72 (page 11).
> When I turned off Spark's WholeStageCodeGen, Spark also takes a very long
> time to finish the execution of query 72 (page 12).
> Did I miss some feature of Hive to improve the performance of that kind of
> query?
> I use the following setting for Hive experiments: https://github.
> com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/testbench.
> settings
>
> Except query 72, Hive with LLAP shows very good performance for both small
> and large workload anyway.
>
> - Dongwon Kim
>