You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Zheng Shao <zs...@facebook.com> on 2009/06/19 06:29:35 UTC

A simple performance benchmark for Hadoop, Hive and Pig

Hi all,

Yuntao Jia, our intern this summer, did a simple performance benchmark for Hadoop, Hive and Pig based on the queries in the SIGMOD 2009 paper: A Comparison of Approaches to Large-Scale Data Analysis

The report and the performance test kit are both attached here:
http://issues.apache.org/jira/browse/HIVE-396


We tried our best to get good performance out of Hive and Pig, and we keep the hadoop program as close as it is from the SIGMOD paper.  We welcome all suggestions on how we can improve the performance more by both changing the configuration or improving the code.


While we tried our best to be fair, system settings and environments do affect the result a lot.  So we encourage everybody to try out the performance test kit on their own cluster, and we will appreciate if everybody can share their results.


Here is the summary.  The details are in the report hive_benchmark_2009-06-18.pdf from the link above.

Query: GREP SELECT
Hadoop: 136.1s
Hive:   125.4s
Pig:    247.8s

Query: RANKINGS SELECT
Hadoop: 26.1s
Hive:   31.0s
Pig:    38.4s

Query: USERVISITS AGGREGATION
Hadoop: 533.8s
Hive:   768.8s
Pig:    855.4s

Query: RANKINGS USERVISITS JOIN
Hadoop: 470.0s
Hive:   471.3s
Pig:    763.9s

Please take a look at hive_benchmark_2009-06-18.pdf from the link above for details. Let's keep discussions on http://issues.apache.org/jira/browse/HIVE-396 so it's easier to keep track.


Zheng


RE: A simple performance benchmark for Hadoop, Hive and Pig

Posted by Zheng Shao <zs...@facebook.com>.
I completely agree with Owen on this point.  Let's move all discussions to dev lists and jira:  http://issues.apache.org/jira/browse/HIVE-396

I was confused by seeing so many automatic emails in the dev mailing list.

Zheng
-----Original Message-----
From: Owen O'Malley [mailto:owen.omalley@gmail.com] 
Sent: Friday, June 19, 2009 10:03 AM
To: core-user@hadoop.apache.org; pig-user@hadoop.apache.org; hive-user@hadoop.apache.org
Subject: Re: A simple performance benchmark for Hadoop, Hive and Pig

On Thu, Jun 18, 2009 at 9:29 PM, Zheng Shao <zs...@facebook.com> wrote:


> Yuntao Jia, our intern this summer, did a simple performance benchmark for
> Hadoop, Hive and Pig based on the queries in the SIGMOD 2009 paper: A
> Comparison of Approaches to Large-Scale Data Analysis


It should be noted that no one on the Pig team was involved in setting up
the benchmarks and the queries don't follow the Pig cookbook suggestions for
writing efficient queries, so these results should be considered *extremely*
preliminary. Furthermore, I can't see any way that Hive should be able to
beat raw map/reduce, since Hive uses map/reduce to run the job.

In the future, it would be better to involve the respective communities
(mapreduce-dev and pig-dev) far before pushing benchmark results out to the
user lists. The Hadoop project, which includes all three subprojects, needs
to be a cooperative community that is trying to build the best software we
can. Getting benchmark numbers is good, but are better done in a
collaborative manner.

-- Owen

RE: A simple performance benchmark for Hadoop, Hive and Pig

Posted by Zheng Shao <zs...@facebook.com>.
I completely agree with Owen on this point.  Let's move all discussions to dev lists and jira:  http://issues.apache.org/jira/browse/HIVE-396

I was confused by seeing so many automatic emails in the dev mailing list.

Zheng
-----Original Message-----
From: Owen O'Malley [mailto:owen.omalley@gmail.com] 
Sent: Friday, June 19, 2009 10:03 AM
To: core-user@hadoop.apache.org; pig-user@hadoop.apache.org; hive-user@hadoop.apache.org
Subject: Re: A simple performance benchmark for Hadoop, Hive and Pig

On Thu, Jun 18, 2009 at 9:29 PM, Zheng Shao <zs...@facebook.com> wrote:


> Yuntao Jia, our intern this summer, did a simple performance benchmark for
> Hadoop, Hive and Pig based on the queries in the SIGMOD 2009 paper: A
> Comparison of Approaches to Large-Scale Data Analysis


It should be noted that no one on the Pig team was involved in setting up
the benchmarks and the queries don't follow the Pig cookbook suggestions for
writing efficient queries, so these results should be considered *extremely*
preliminary. Furthermore, I can't see any way that Hive should be able to
beat raw map/reduce, since Hive uses map/reduce to run the job.

In the future, it would be better to involve the respective communities
(mapreduce-dev and pig-dev) far before pushing benchmark results out to the
user lists. The Hadoop project, which includes all three subprojects, needs
to be a cooperative community that is trying to build the best software we
can. Getting benchmark numbers is good, but are better done in a
collaborative manner.

-- Owen

RE: A simple performance benchmark for Hadoop, Hive and Pig

Posted by Zheng Shao <zs...@facebook.com>.
I completely agree with Owen on this point.  Let's move all discussions to dev lists and jira:  http://issues.apache.org/jira/browse/HIVE-396

I was confused by seeing so many automatic emails in the dev mailing list.

Zheng
-----Original Message-----
From: Owen O'Malley [mailto:owen.omalley@gmail.com] 
Sent: Friday, June 19, 2009 10:03 AM
To: core-user@hadoop.apache.org; pig-user@hadoop.apache.org; hive-user@hadoop.apache.org
Subject: Re: A simple performance benchmark for Hadoop, Hive and Pig

On Thu, Jun 18, 2009 at 9:29 PM, Zheng Shao <zs...@facebook.com> wrote:


> Yuntao Jia, our intern this summer, did a simple performance benchmark for
> Hadoop, Hive and Pig based on the queries in the SIGMOD 2009 paper: A
> Comparison of Approaches to Large-Scale Data Analysis


It should be noted that no one on the Pig team was involved in setting up
the benchmarks and the queries don't follow the Pig cookbook suggestions for
writing efficient queries, so these results should be considered *extremely*
preliminary. Furthermore, I can't see any way that Hive should be able to
beat raw map/reduce, since Hive uses map/reduce to run the job.

In the future, it would be better to involve the respective communities
(mapreduce-dev and pig-dev) far before pushing benchmark results out to the
user lists. The Hadoop project, which includes all three subprojects, needs
to be a cooperative community that is trying to build the best software we
can. Getting benchmark numbers is good, but are better done in a
collaborative manner.

-- Owen

Fwd: A simple performance benchmark for Hadoop, Hive and Pig

Posted by Owen O'Malley <ow...@gmail.com>.
This bounced since I wasn't subscribed. It should have been moderated through...

-- Owen


---------- Forwarded message ----------
From: Owen O'Malley <ow...@gmail.com>
Date: Fri, Jun 19, 2009 at 10:03 AM
Subject: Re: A simple performance benchmark for Hadoop, Hive and Pig
To: core-user@hadoop.apache.org, pig-user@hadoop.apache.org,
hive-user@hadoop.apache.org


On Thu, Jun 18, 2009 at 9:29 PM, Zheng Shao <zs...@facebook.com> wrote:

>
> Yuntao Jia, our intern this summer, did a simple performance benchmark for Hadoop, Hive and Pig based on the queries in the SIGMOD 2009 paper: A Comparison of Approaches to Large-Scale Data Analysis

It should be noted that no one on the Pig team was involved in setting
up the benchmarks and the queries don't follow the Pig cookbook
suggestions for writing efficient queries, so these results should be
considered *extremely* preliminary. Furthermore, I can't see any way
that Hive should be able to beat raw map/reduce, since Hive uses
map/reduce to run the job.

In the future, it would be better to involve the respective
communities (mapreduce-dev and pig-dev) far before pushing benchmark
results out to the user lists. The Hadoop project, which includes all
three subprojects, needs to be a cooperative community that is trying
to build the best software we can. Getting benchmark numbers is good,
but are better done in a collaborative manner.

-- Owen

Re: A simple performance benchmark for Hadoop, Hive and Pig

Posted by Owen O'Malley <ow...@gmail.com>.
On Thu, Jun 18, 2009 at 9:29 PM, Zheng Shao <zs...@facebook.com> wrote:


> Yuntao Jia, our intern this summer, did a simple performance benchmark for
> Hadoop, Hive and Pig based on the queries in the SIGMOD 2009 paper: A
> Comparison of Approaches to Large-Scale Data Analysis


It should be noted that no one on the Pig team was involved in setting up
the benchmarks and the queries don't follow the Pig cookbook suggestions for
writing efficient queries, so these results should be considered *extremely*
preliminary. Furthermore, I can't see any way that Hive should be able to
beat raw map/reduce, since Hive uses map/reduce to run the job.

In the future, it would be better to involve the respective communities
(mapreduce-dev and pig-dev) far before pushing benchmark results out to the
user lists. The Hadoop project, which includes all three subprojects, needs
to be a cooperative community that is trying to build the best software we
can. Getting benchmark numbers is good, but are better done in a
collaborative manner.

-- Owen

Re: A simple performance benchmark for Hadoop, Hive and Pig

Posted by Owen O'Malley <ow...@gmail.com>.
On Thu, Jun 18, 2009 at 9:29 PM, Zheng Shao <zs...@facebook.com> wrote:


> Yuntao Jia, our intern this summer, did a simple performance benchmark for
> Hadoop, Hive and Pig based on the queries in the SIGMOD 2009 paper: A
> Comparison of Approaches to Large-Scale Data Analysis


It should be noted that no one on the Pig team was involved in setting up
the benchmarks and the queries don't follow the Pig cookbook suggestions for
writing efficient queries, so these results should be considered *extremely*
preliminary. Furthermore, I can't see any way that Hive should be able to
beat raw map/reduce, since Hive uses map/reduce to run the job.

In the future, it would be better to involve the respective communities
(mapreduce-dev and pig-dev) far before pushing benchmark results out to the
user lists. The Hadoop project, which includes all three subprojects, needs
to be a cooperative community that is trying to build the best software we
can. Getting benchmark numbers is good, but are better done in a
collaborative manner.

-- Owen