You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Rob Stewart <ro...@googlemail.com> on 2009/11/07 19:03:20 UTC

Hive Performance

Hi there. I'm in the process of writing a paper, and part of it I aim to
write (yet another) comparative study on various interfaces with Hadoop.

This will almost certainly include Pig and Hive, probably MapReduce, and
maybe JAQL.

I have read the papers published on the Hive JIRA (pig vs hive vs MapReduce
for 2 queries, an aggregation, and a join). I am, however, wanting to know a
bit from the Hive community.

1. Do you guys (the Hive developers) have a standardized benchmarking tool
to use prior to each Hive release? I am thinking of something similar to
PigMix, used by the Pig developers. In case you don't know, PigMix is a set
of 12 designed queries, implemented in Pig and Java Hadoop, and comparisons
are made on execution time. Does the Hive community have something similar?

2. The Pig wiki point out some unique features of Pig that allow optimal
execution performance. For instance, they have a methods to optimize queries
on skewed data (by taking samples of the data for reduce key allocations. Is
there something about the implementation of Hive that gives it some
functionality not found in other interfaces. And better still, would there
some Hive implementation that could work as a proof of concept to show any
optimized features of Hive?

3. One section suggested for investigation within the Pig development team
is to create a SQL like language that could be compiled down through Pig to
MR jobs. If such a project was to achieve parity with Hive's SQL like
interface, where would be the distinction be between Pig and Hive.
Certainly, from a users perspective, there would be very little difference.
If the only difference turns out to be the execution performance achieved by
one interface over another, where would this put the inferior interface (be
that either Pig or Hive) in terms of its relevance in the Hadoop software
stack?


Many thanks,


Rob Stewart

RE: Hive Performance

Posted by Ashish Thusoo <at...@facebook.com>.

There are a bunch of optimizations that deal with skewed data in Hive as well. The optimizer is rule based and the user has to hint the query - similar to what is done in RDBMS. We have mostly done our performance work on the benchmark published in the SIGMOD paper.

Ashish

-----Original Message-----
From: Edward Capriolo [mailto:edlinuxguru@gmail.com] 
Sent: Saturday, November 07, 2009 11:19 AM
To: hive-dev@hadoop.apache.org
Subject: Re: Hive Performance

A friend and I were disgussing pig vs hive in general yesterday. On the surface hive is an sql like language.pig is its own language 'pig latin' however in the end I think they both end up doing column projections, joins,etc. In the end it is a similar operation happening on the same cluster. So performance wise I expect the performance will eventually be similair. now pig offering more sql support is a large undertaking.

 While pig looks very versatile I resently emultated the example on cloudera's blog for geoip locating traffic in pig. I did this in hive with an external perl script using map/transform. (It did not take a page long pig program) I also think the hive udf framework can be used in place of many piggybank functions. Also unless I am missing something a udf is native java. Seems like piggybank functions are going to be piping /streaming output I can't see that performing better.

To backtrack if pig adds sql, will we need hive? If hive adds something like tsql will we need pig?

On 11/7/09, Rob Stewart <ro...@googlemail.com> wrote:
> Hi there. I'm in the process of writing a paper, and part of it I aim 
> to write (yet another) comparative study on various interfaces with Hadoop.
>
> This will almost certainly include Pig and Hive, probably MapReduce, 
> and maybe JAQL.
>
> I have read the papers published on the Hive JIRA (pig vs hive vs 
> MapReduce for 2 queries, an aggregation, and a join). I am, however, 
> wanting to know a bit from the Hive community.
>
> 1. Do you guys (the Hive developers) have a standardized benchmarking 
> tool to use prior to each Hive release? I am thinking of something 
> similar to PigMix, used by the Pig developers. In case you don't know, 
> PigMix is a set of 12 designed queries, implemented in Pig and Java 
> Hadoop, and comparisons are made on execution time. Does the Hive community have something similar?
>
> 2. The Pig wiki point out some unique features of Pig that allow 
> optimal execution performance. For instance, they have a methods to 
> optimize queries on skewed data (by taking samples of the data for 
> reduce key allocations. Is there something about the implementation of 
> Hive that gives it some functionality not found in other interfaces. 
> And better still, would there some Hive implementation that could work 
> as a proof of concept to show any optimized features of Hive?
>
> 3. One section suggested for investigation within the Pig development 
> team is to create a SQL like language that could be compiled down 
> through Pig to MR jobs. If such a project was to achieve parity with 
> Hive's SQL like interface, where would be the distinction be between Pig and Hive.
> Certainly, from a users perspective, there would be very little difference.
> If the only difference turns out to be the execution performance 
> achieved by one interface over another, where would this put the 
> inferior interface (be that either Pig or Hive) in terms of its 
> relevance in the Hadoop software stack?
>
>
> Many thanks,
>
>
> Rob Stewart
>

Re: Hive Performance

Posted by Edward Capriolo <ed...@gmail.com>.

A friend and I were disgussing pig vs hive in general yesterday. On
the surface hive is an sql like language.pig is its own language 'pig
latin' however in the end I think they both end up doing column
projections, joins,etc. In the end it is a similar operation happening
on the same cluster. So performance wise I expect the performance will
eventually be similair. now pig offering more sql support is a large
undertaking.

 While pig looks very versatile I resently emultated the example on
cloudera's blog for geoip locating traffic in pig. I did this in hive
with an external perl script using map/transform. (It did not take a
page long pig program) I also think the hive udf framework can be used
in place of many piggybank functions. Also unless I am missing
something a udf is native java. Seems like piggybank functions are
going to be piping /streaming output I can't see that performing
better.

To backtrack if pig adds sql, will we need hive? If hive adds
something like tsql will we need pig?

On 11/7/09, Rob Stewart <ro...@googlemail.com> wrote:
> Hi there. I'm in the process of writing a paper, and part of it I aim to
> write (yet another) comparative study on various interfaces with Hadoop.
>
> This will almost certainly include Pig and Hive, probably MapReduce, and
> maybe JAQL.
>
> I have read the papers published on the Hive JIRA (pig vs hive vs MapReduce
> for 2 queries, an aggregation, and a join). I am, however, wanting to know a
> bit from the Hive community.
>
> 1. Do you guys (the Hive developers) have a standardized benchmarking tool
> to use prior to each Hive release? I am thinking of something similar to
> PigMix, used by the Pig developers. In case you don't know, PigMix is a set
> of 12 designed queries, implemented in Pig and Java Hadoop, and comparisons
> are made on execution time. Does the Hive community have something similar?
>
> 2. The Pig wiki point out some unique features of Pig that allow optimal
> execution performance. For instance, they have a methods to optimize queries
> on skewed data (by taking samples of the data for reduce key allocations. Is
> there something about the implementation of Hive that gives it some
> functionality not found in other interfaces. And better still, would there
> some Hive implementation that could work as a proof of concept to show any
> optimized features of Hive?
>
> 3. One section suggested for investigation within the Pig development team
> is to create a SQL like language that could be compiled down through Pig to
> MR jobs. If such a project was to achieve parity with Hive's SQL like
> interface, where would be the distinction be between Pig and Hive.
> Certainly, from a users perspective, there would be very little difference.
> If the only difference turns out to be the execution performance achieved by
> one interface over another, where would this put the inferior interface (be
> that either Pig or Hive) in terms of its relevance in the Hadoop software
> stack?
>
>
> Many thanks,
>
>
> Rob Stewart
>