You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Jie Li <ji...@cs.duke.edu> on 2011/11/29 20:38:08 UTC

Running TPC-H on Pig

Hello everyone,

As people are usually more concerned about the performance, we need more
benchmarks to identify the bottleneck of the Pig's performance. For a class
project we develop a whole set of Pig scripts for TPC-H. Though Pig was not
designed for this RDBMS benchmark, it does support most of the relation
operators like join and aggregation, which can be optimized based on this
benchmark. Besides that, we can also demonstrate how to write efficient pig
scripts by making full use of Pig Latin's features.

Here are what we did:
1) write correct pig scripts for TPC-H by verifying them on 1GB data.
2) demonstrate the flexibility of Pig Latin by using COGROUP operator to
implement join.
3) show how to optimize the join by slightly reordering or using replicated
join. We think pig should be able to have more heuristic optimization for
the join, such as evaluating the smaller join first, using replicated join
for small tables, and putting the larger table on the right side of the
hash join.
4) identify the poor performance of aggregation. Pig doesn't yet support
hash-based aggregation so it's extremely slow for aggregation. Good to know
that Pig is just about to support it:)

As TPC-H is well-known, a good benchmark result can help change people's
impression that Pig is slow. Actually we compare Pig and Hive and find that
Pig is not necessarily slower than Hive. I wonder if we can create a jira
for this project.

Thanks,
Jie Li
PhD Candidate of Computer Science
Duke University

Re: Running TPC-H on Pig

Posted by Jie Li <ji...@cs.duke.edu>.

Yeah we already have some results but not so good, so we are currently
rewriting some of the scripts especially rewriting the joins. Once we can a
good result we will publish it.

Jie

On Tue, Nov 29, 2011 at 2:41 PM, Jonathan Coveney <jc...@gmail.com>wrote:

> I'm a little confused. Do you already have the benchmarks? I'd love to see
> them if you do. Do you want to make a JIRA in order to put this info on the
> site? I'm a little confused, but I agree that statistics can help focus
> effort and could also be a good tool for evangelism (especially if Pig is
> in fact as fast as Hive in cases).
>
> 2011/11/29 Jie Li <ji...@cs.duke.edu>
>
> > Hello everyone,
> >
> > As people are usually more concerned about the performance, we need more
> > benchmarks to identify the bottleneck of the Pig's performance. For a
> class
> > project we develop a whole set of Pig scripts for TPC-H. Though Pig was
> not
> > designed for this RDBMS benchmark, it does support most of the relation
> > operators like join and aggregation, which can be optimized based on this
> > benchmark. Besides that, we can also demonstrate how to write efficient
> pig
> > scripts by making full use of Pig Latin's features.
> >
> > Here are what we did:
> > 1) write correct pig scripts for TPC-H by verifying them on 1GB data.
> > 2) demonstrate the flexibility of Pig Latin by using COGROUP operator to
> > implement join.
> > 3) show how to optimize the join by slightly reordering or using
> replicated
> > join. We think pig should be able to have more heuristic optimization for
> > the join, such as evaluating the smaller join first, using replicated
> join
> > for small tables, and putting the larger table on the right side of the
> > hash join.
> > 4) identify the poor performance of aggregation. Pig doesn't yet support
> > hash-based aggregation so it's extremely slow for aggregation. Good to
> know
> > that Pig is just about to support it:)
> >
> > As TPC-H is well-known, a good benchmark result can help change people's
> > impression that Pig is slow. Actually we compare Pig and Hive and find
> that
> > Pig is not necessarily slower than Hive. I wonder if we can create a jira
> > for this project.
> >
> > Thanks,
> > Jie Li
> > PhD Candidate of Computer Science
> > Duke University
> >
>

Re: Running TPC-H on Pig

Posted by Jonathan Coveney <jc...@gmail.com>.

I'm a little confused. Do you already have the benchmarks? I'd love to see
them if you do. Do you want to make a JIRA in order to put this info on the
site? I'm a little confused, but I agree that statistics can help focus
effort and could also be a good tool for evangelism (especially if Pig is
in fact as fast as Hive in cases).

2011/11/29 Jie Li <ji...@cs.duke.edu>

> Hello everyone,
>
> As people are usually more concerned about the performance, we need more
> benchmarks to identify the bottleneck of the Pig's performance. For a class
> project we develop a whole set of Pig scripts for TPC-H. Though Pig was not
> designed for this RDBMS benchmark, it does support most of the relation
> operators like join and aggregation, which can be optimized based on this
> benchmark. Besides that, we can also demonstrate how to write efficient pig
> scripts by making full use of Pig Latin's features.
>
> Here are what we did:
> 1) write correct pig scripts for TPC-H by verifying them on 1GB data.
> 2) demonstrate the flexibility of Pig Latin by using COGROUP operator to
> implement join.
> 3) show how to optimize the join by slightly reordering or using replicated
> join. We think pig should be able to have more heuristic optimization for
> the join, such as evaluating the smaller join first, using replicated join
> for small tables, and putting the larger table on the right side of the
> hash join.
> 4) identify the poor performance of aggregation. Pig doesn't yet support
> hash-based aggregation so it's extremely slow for aggregation. Good to know
> that Pig is just about to support it:)
>
> As TPC-H is well-known, a good benchmark result can help change people's
> impression that Pig is slow. Actually we compare Pig and Hive and find that
> Pig is not necessarily slower than Hive. I wonder if we can create a jira
> for this project.
>
> Thanks,
> Jie Li
> PhD Candidate of Computer Science
> Duke University
>

Re: Running TPC-H on Pig

Posted by Jie Li <ji...@cs.duke.edu>.

Yeah sure. We are just about to post them.

Jie

On Tue, Nov 29, 2011 at 8:18 PM, Jonathan Coveney <jc...@gmail.com>wrote:

> If you want some feedback on the how to make the scripts faster, feel free
> to post them.
>
> 2011/11/29 Jie Li <ji...@cs.duke.edu>
>
> > Did you mean the two update functions of TPC-H? I think we can leave them
> > out as Hive did, as usually Hadoop is not for update.
> >
> > Jie
> >
> > On Tue, Nov 29, 2011 at 2:42 PM, Santhosh Srinivasan <sms@yahoo-inc.com
> > >wrote:
> >
> > > Please do. The association with TPC-H might be tricky as it mandates
> the
> > > concurrent data modification. Nevertheless, the benchmark will be very
> > > useful as you point out.
> > >
> > > -----Original Message-----
> > > From: Jie Li [mailto:jieli@cs.duke.edu]
> > > Sent: Tuesday, November 29, 2011 11:38 AM
> > > To: dev@pig.apache.org
> > > Subject: Running TPC-H on Pig
> > >
> > > Hello everyone,
> > >
> > > As people are usually more concerned about the performance, we need
> more
> > > benchmarks to identify the bottleneck of the Pig's performance. For a
> > class
> > > project we develop a whole set of Pig scripts for TPC-H. Though Pig was
> > not
> > > designed for this RDBMS benchmark, it does support most of the relation
> > > operators like join and aggregation, which can be optimized based on
> this
> > > benchmark. Besides that, we can also demonstrate how to write efficient
> > pig
> > > scripts by making full use of Pig Latin's features.
> > >
> > > Here are what we did:
> > > 1) write correct pig scripts for TPC-H by verifying them on 1GB data.
> > > 2) demonstrate the flexibility of Pig Latin by using COGROUP operator
> to
> > > implement join.
> > > 3) show how to optimize the join by slightly reordering or using
> > > replicated join. We think pig should be able to have more heuristic
> > > optimization for the join, such as evaluating the smaller join first,
> > using
> > > replicated join for small tables, and putting the larger table on the
> > right
> > > side of the hash join.
> > > 4) identify the poor performance of aggregation. Pig doesn't yet
> support
> > > hash-based aggregation so it's extremely slow for aggregation. Good to
> > know
> > > that Pig is just about to support it:)
> > >
> > > As TPC-H is well-known, a good benchmark result can help change
> people's
> > > impression that Pig is slow. Actually we compare Pig and Hive and find
> > that
> > > Pig is not necessarily slower than Hive. I wonder if we can create a
> jira
> > > for this project.
> > >
> > > Thanks,
> > > Jie Li
> > > PhD Candidate of Computer Science
> > > Duke University
> > >
> > >
> >
>

Re: Running TPC-H on Pig

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

My bad I was talking about TPC-DS (:
I used the TPC-DS to test Pig Joins, but I didn't actually think on
comparing it with Hive because Hive already has on going projects for
its cost based optimizer, and I thought it wouldn't be a fair
comparison. But I guess your work is related to the starfish system
right?
Anyways, I hope to see your benchmark.

Renato M.


2011/12/2 Jie Li <ji...@cs.duke.edu>:
> TPC-E is for transaction, so why is it better for evaluating Hadoop related
> systems?
>
> We are benchmarking the whole queries. We found that some simple heuristics
> work very well so far. No doubt that the statistics would help make a even
> better query plan.
>
> Jie
>
> On Wed, Nov 30, 2011 at 12:18 AM, Renato Marroquín Mogrovejo <
> renatoj.marroquin@gmail.com> wrote:
>
>> Hey,
>>
>> why didn't you use the TPC-E?and what are you guys exactly
>> benchmarking?i.e. specific components of both systems or the whole queries?
>> Because hive is already able to use some basic statistics but pig isn't,and
>> at least until hcat is ready it won't be able to take fully advantage of
>> them.
>>
>> Renato M.
>> On Nov 29, 2011 8:18 PM, "Jonathan Coveney" <jc...@gmail.com> wrote:
>>
>> > If you want some feedback on the how to make the scripts faster, feel
>> free
>> > to post them.
>> >
>> > 2011/11/29 Jie Li <ji...@cs.duke.edu>
>> >
>> > > Did you mean the two update functions of TPC-H? I think we can leave
>> them
>> > > out as Hive did, as usually Hadoop is not for update.
>> > >
>> > > Jie
>> > >
>> > > On Tue, Nov 29, 2011 at 2:42 PM, Santhosh Srinivasan <
>> sms@yahoo-inc.com
>> > > >wrote:
>> > >
>> > > > Please do. The association with TPC-H might be tricky as it mandates
>> > the
>> > > > concurrent data modification. Nevertheless, the benchmark will be
>> very
>> > > > useful as you point out.
>> > > >
>> > > > -----Original Message-----
>> > > > From: Jie Li [mailto:jieli@cs.duke.edu]
>> > > > Sent: Tuesday, November 29, 2011 11:38 AM
>> > > > To: dev@pig.apache.org
>> > > > Subject: Running TPC-H on Pig
>> > > >
>> > > > Hello everyone,
>> > > >
>> > > > As people are usually more concerned about the performance, we need
>> > more
>> > > > benchmarks to identify the bottleneck of the Pig's performance. For a
>> > > class
>> > > > project we develop a whole set of Pig scripts for TPC-H. Though Pig
>> was
>> > > not
>> > > > designed for this RDBMS benchmark, it does support most of the
>> relation
>> > > > operators like join and aggregation, which can be optimized based on
>> > this
>> > > > benchmark. Besides that, we can also demonstrate how to write
>> efficient
>> > > pig
>> > > > scripts by making full use of Pig Latin's features.
>> > > >
>> > > > Here are what we did:
>> > > > 1) write correct pig scripts for TPC-H by verifying them on 1GB data.
>> > > > 2) demonstrate the flexibility of Pig Latin by using COGROUP operator
>> > to
>> > > > implement join.
>> > > > 3) show how to optimize the join by slightly reordering or using
>> > > > replicated join. We think pig should be able to have more heuristic
>> > > > optimization for the join, such as evaluating the smaller join first,
>> > > using
>> > > > replicated join for small tables, and putting the larger table on the
>> > > right
>> > > > side of the hash join.
>> > > > 4) identify the poor performance of aggregation. Pig doesn't yet
>> > support
>> > > > hash-based aggregation so it's extremely slow for aggregation. Good
>> to
>> > > know
>> > > > that Pig is just about to support it:)
>> > > >
>> > > > As TPC-H is well-known, a good benchmark result can help change
>> > people's
>> > > > impression that Pig is slow. Actually we compare Pig and Hive and
>> find
>> > > that
>> > > > Pig is not necessarily slower than Hive. I wonder if we can create a
>> > jira
>> > > > for this project.
>> > > >
>> > > > Thanks,
>> > > > Jie Li
>> > > > PhD Candidate of Computer Science
>> > > > Duke University
>> > > >
>> > > >
>> > >
>> >
>>
>

Re: Running TPC-H on Pig

Posted by Jie Li <ji...@cs.duke.edu>.

TPC-E is for transaction, so why is it better for evaluating Hadoop related
systems?

We are benchmarking the whole queries. We found that some simple heuristics
work very well so far. No doubt that the statistics would help make a even
better query plan.

Jie

On Wed, Nov 30, 2011 at 12:18 AM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> Hey,
>
> why didn't you use the TPC-E?and what are you guys exactly
> benchmarking?i.e. specific components of both systems or the whole queries?
> Because hive is already able to use some basic statistics but pig isn't,and
> at least until hcat is ready it won't be able to take fully advantage of
> them.
>
> Renato M.
> On Nov 29, 2011 8:18 PM, "Jonathan Coveney" <jc...@gmail.com> wrote:
>
> > If you want some feedback on the how to make the scripts faster, feel
> free
> > to post them.
> >
> > 2011/11/29 Jie Li <ji...@cs.duke.edu>
> >
> > > Did you mean the two update functions of TPC-H? I think we can leave
> them
> > > out as Hive did, as usually Hadoop is not for update.
> > >
> > > Jie
> > >
> > > On Tue, Nov 29, 2011 at 2:42 PM, Santhosh Srinivasan <
> sms@yahoo-inc.com
> > > >wrote:
> > >
> > > > Please do. The association with TPC-H might be tricky as it mandates
> > the
> > > > concurrent data modification. Nevertheless, the benchmark will be
> very
> > > > useful as you point out.
> > > >
> > > > -----Original Message-----
> > > > From: Jie Li [mailto:jieli@cs.duke.edu]
> > > > Sent: Tuesday, November 29, 2011 11:38 AM
> > > > To: dev@pig.apache.org
> > > > Subject: Running TPC-H on Pig
> > > >
> > > > Hello everyone,
> > > >
> > > > As people are usually more concerned about the performance, we need
> > more
> > > > benchmarks to identify the bottleneck of the Pig's performance. For a
> > > class
> > > > project we develop a whole set of Pig scripts for TPC-H. Though Pig
> was
> > > not
> > > > designed for this RDBMS benchmark, it does support most of the
> relation
> > > > operators like join and aggregation, which can be optimized based on
> > this
> > > > benchmark. Besides that, we can also demonstrate how to write
> efficient
> > > pig
> > > > scripts by making full use of Pig Latin's features.
> > > >
> > > > Here are what we did:
> > > > 1) write correct pig scripts for TPC-H by verifying them on 1GB data.
> > > > 2) demonstrate the flexibility of Pig Latin by using COGROUP operator
> > to
> > > > implement join.
> > > > 3) show how to optimize the join by slightly reordering or using
> > > > replicated join. We think pig should be able to have more heuristic
> > > > optimization for the join, such as evaluating the smaller join first,
> > > using
> > > > replicated join for small tables, and putting the larger table on the
> > > right
> > > > side of the hash join.
> > > > 4) identify the poor performance of aggregation. Pig doesn't yet
> > support
> > > > hash-based aggregation so it's extremely slow for aggregation. Good
> to
> > > know
> > > > that Pig is just about to support it:)
> > > >
> > > > As TPC-H is well-known, a good benchmark result can help change
> > people's
> > > > impression that Pig is slow. Actually we compare Pig and Hive and
> find
> > > that
> > > > Pig is not necessarily slower than Hive. I wonder if we can create a
> > jira
> > > > for this project.
> > > >
> > > > Thanks,
> > > > Jie Li
> > > > PhD Candidate of Computer Science
> > > > Duke University
> > > >
> > > >
> > >
> >
>

Re: Running TPC-H on Pig

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hey,

why didn't you use the TPC-E?and what are you guys exactly
benchmarking?i.e. specific components of both systems or the whole queries?
Because hive is already able to use some basic statistics but pig isn't,and
at least until hcat is ready it won't be able to take fully advantage of
them.

Renato M.
On Nov 29, 2011 8:18 PM, "Jonathan Coveney" <jc...@gmail.com> wrote:

> If you want some feedback on the how to make the scripts faster, feel free
> to post them.
>
> 2011/11/29 Jie Li <ji...@cs.duke.edu>
>
> > Did you mean the two update functions of TPC-H? I think we can leave them
> > out as Hive did, as usually Hadoop is not for update.
> >
> > Jie
> >
> > On Tue, Nov 29, 2011 at 2:42 PM, Santhosh Srinivasan <sms@yahoo-inc.com
> > >wrote:
> >
> > > Please do. The association with TPC-H might be tricky as it mandates
> the
> > > concurrent data modification. Nevertheless, the benchmark will be very
> > > useful as you point out.
> > >
> > > -----Original Message-----
> > > From: Jie Li [mailto:jieli@cs.duke.edu]
> > > Sent: Tuesday, November 29, 2011 11:38 AM
> > > To: dev@pig.apache.org
> > > Subject: Running TPC-H on Pig
> > >
> > > Hello everyone,
> > >
> > > As people are usually more concerned about the performance, we need
> more
> > > benchmarks to identify the bottleneck of the Pig's performance. For a
> > class
> > > project we develop a whole set of Pig scripts for TPC-H. Though Pig was
> > not
> > > designed for this RDBMS benchmark, it does support most of the relation
> > > operators like join and aggregation, which can be optimized based on
> this
> > > benchmark. Besides that, we can also demonstrate how to write efficient
> > pig
> > > scripts by making full use of Pig Latin's features.
> > >
> > > Here are what we did:
> > > 1) write correct pig scripts for TPC-H by verifying them on 1GB data.
> > > 2) demonstrate the flexibility of Pig Latin by using COGROUP operator
> to
> > > implement join.
> > > 3) show how to optimize the join by slightly reordering or using
> > > replicated join. We think pig should be able to have more heuristic
> > > optimization for the join, such as evaluating the smaller join first,
> > using
> > > replicated join for small tables, and putting the larger table on the
> > right
> > > side of the hash join.
> > > 4) identify the poor performance of aggregation. Pig doesn't yet
> support
> > > hash-based aggregation so it's extremely slow for aggregation. Good to
> > know
> > > that Pig is just about to support it:)
> > >
> > > As TPC-H is well-known, a good benchmark result can help change
> people's
> > > impression that Pig is slow. Actually we compare Pig and Hive and find
> > that
> > > Pig is not necessarily slower than Hive. I wonder if we can create a
> jira
> > > for this project.
> > >
> > > Thanks,
> > > Jie Li
> > > PhD Candidate of Computer Science
> > > Duke University
> > >
> > >
> >
>

Re: Running TPC-H on Pig

Posted by Jonathan Coveney <jc...@gmail.com>.

If you want some feedback on the how to make the scripts faster, feel free
to post them.

2011/11/29 Jie Li <ji...@cs.duke.edu>

> Did you mean the two update functions of TPC-H? I think we can leave them
> out as Hive did, as usually Hadoop is not for update.
>
> Jie
>
> On Tue, Nov 29, 2011 at 2:42 PM, Santhosh Srinivasan <sms@yahoo-inc.com
> >wrote:
>
> > Please do. The association with TPC-H might be tricky as it mandates the
> > concurrent data modification. Nevertheless, the benchmark will be very
> > useful as you point out.
> >
> > -----Original Message-----
> > From: Jie Li [mailto:jieli@cs.duke.edu]
> > Sent: Tuesday, November 29, 2011 11:38 AM
> > To: dev@pig.apache.org
> > Subject: Running TPC-H on Pig
> >
> > Hello everyone,
> >
> > As people are usually more concerned about the performance, we need more
> > benchmarks to identify the bottleneck of the Pig's performance. For a
> class
> > project we develop a whole set of Pig scripts for TPC-H. Though Pig was
> not
> > designed for this RDBMS benchmark, it does support most of the relation
> > operators like join and aggregation, which can be optimized based on this
> > benchmark. Besides that, we can also demonstrate how to write efficient
> pig
> > scripts by making full use of Pig Latin's features.
> >
> > Here are what we did:
> > 1) write correct pig scripts for TPC-H by verifying them on 1GB data.
> > 2) demonstrate the flexibility of Pig Latin by using COGROUP operator to
> > implement join.
> > 3) show how to optimize the join by slightly reordering or using
> > replicated join. We think pig should be able to have more heuristic
> > optimization for the join, such as evaluating the smaller join first,
> using
> > replicated join for small tables, and putting the larger table on the
> right
> > side of the hash join.
> > 4) identify the poor performance of aggregation. Pig doesn't yet support
> > hash-based aggregation so it's extremely slow for aggregation. Good to
> know
> > that Pig is just about to support it:)
> >
> > As TPC-H is well-known, a good benchmark result can help change people's
> > impression that Pig is slow. Actually we compare Pig and Hive and find
> that
> > Pig is not necessarily slower than Hive. I wonder if we can create a jira
> > for this project.
> >
> > Thanks,
> > Jie Li
> > PhD Candidate of Computer Science
> > Duke University
> >
> >
>

Re: Running TPC-H on Pig

Posted by Jie Li <ji...@cs.duke.edu>.

Did you mean the two update functions of TPC-H? I think we can leave them
out as Hive did, as usually Hadoop is not for update.

Jie

On Tue, Nov 29, 2011 at 2:42 PM, Santhosh Srinivasan <sm...@yahoo-inc.com>wrote:

> Please do. The association with TPC-H might be tricky as it mandates the
> concurrent data modification. Nevertheless, the benchmark will be very
> useful as you point out.
>
> -----Original Message-----
> From: Jie Li [mailto:jieli@cs.duke.edu]
> Sent: Tuesday, November 29, 2011 11:38 AM
> To: dev@pig.apache.org
> Subject: Running TPC-H on Pig
>
> Hello everyone,
>
> As people are usually more concerned about the performance, we need more
> benchmarks to identify the bottleneck of the Pig's performance. For a class
> project we develop a whole set of Pig scripts for TPC-H. Though Pig was not
> designed for this RDBMS benchmark, it does support most of the relation
> operators like join and aggregation, which can be optimized based on this
> benchmark. Besides that, we can also demonstrate how to write efficient pig
> scripts by making full use of Pig Latin's features.
>
> Here are what we did:
> 1) write correct pig scripts for TPC-H by verifying them on 1GB data.
> 2) demonstrate the flexibility of Pig Latin by using COGROUP operator to
> implement join.
> 3) show how to optimize the join by slightly reordering or using
> replicated join. We think pig should be able to have more heuristic
> optimization for the join, such as evaluating the smaller join first, using
> replicated join for small tables, and putting the larger table on the right
> side of the hash join.
> 4) identify the poor performance of aggregation. Pig doesn't yet support
> hash-based aggregation so it's extremely slow for aggregation. Good to know
> that Pig is just about to support it:)
>
> As TPC-H is well-known, a good benchmark result can help change people's
> impression that Pig is slow. Actually we compare Pig and Hive and find that
> Pig is not necessarily slower than Hive. I wonder if we can create a jira
> for this project.
>
> Thanks,
> Jie Li
> PhD Candidate of Computer Science
> Duke University
>
>

RE: Running TPC-H on Pig

Posted by Santhosh Srinivasan <sm...@yahoo-inc.com>.

Please do. The association with TPC-H might be tricky as it mandates the concurrent data modification. Nevertheless, the benchmark will be very useful as you point out. 

-----Original Message-----
From: Jie Li [mailto:jieli@cs.duke.edu] 
Sent: Tuesday, November 29, 2011 11:38 AM
To: dev@pig.apache.org
Subject: Running TPC-H on Pig

Hello everyone,

As people are usually more concerned about the performance, we need more benchmarks to identify the bottleneck of the Pig's performance. For a class project we develop a whole set of Pig scripts for TPC-H. Though Pig was not designed for this RDBMS benchmark, it does support most of the relation operators like join and aggregation, which can be optimized based on this benchmark. Besides that, we can also demonstrate how to write efficient pig scripts by making full use of Pig Latin's features.

Here are what we did:
1) write correct pig scripts for TPC-H by verifying them on 1GB data.
2) demonstrate the flexibility of Pig Latin by using COGROUP operator to implement join.
3) show how to optimize the join by slightly reordering or using replicated join. We think pig should be able to have more heuristic optimization for the join, such as evaluating the smaller join first, using replicated join for small tables, and putting the larger table on the right side of the hash join.
4) identify the poor performance of aggregation. Pig doesn't yet support hash-based aggregation so it's extremely slow for aggregation. Good to know that Pig is just about to support it:)

As TPC-H is well-known, a good benchmark result can help change people's impression that Pig is slow. Actually we compare Pig and Hive and find that Pig is not necessarily slower than Hive. I wonder if we can create a jira for this project.

Thanks,
Jie Li
PhD Candidate of Computer Science
Duke University