You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Michał Anglart <an...@gmail.com> on 2011/01/04 22:21:52 UTC

Master thesis about Hive/Pig/MapReduce

Hi Everybody,

I'm a soon-to-graduate student of computer science at the Univeristy
of Wrocław in Poland. Currently I'm starting to write my master thesis
and I'm looking for some inspirations/ideas.

First of all I want to write about MapReduce - as far as I know nobody
took such topics as their thesis at my faculty, but the topic is
interesting, so someone should start. Lately I thought that maybe I
could consider comparing Java's MapReduce with Hive and Pig in terms
of it's performance, optimizations that are used inside etc.
Personally I had found it nice idea as it would allowed me to learn
both frameworks and take a look at the way they work. Unfortunately I
found out that Robert Stewart from Heriot Watt Univeristy wrote his
thesis in "Performance & Programming Comparison of JAQL, Hive, Pig and
Java" which can be found via Google. I looked through this paper and
it looks quite similar to what I wanted to do.

After this discover I thought that maybe a little bit different
approach to performance comparison can prove to be a succesful topic
for my master thesis: specifically I'm thinking about comparing the
frameworks in some real-life problem. Robert in his paper made the
experiments on few quite simple problems like word count, simple join
of two sets or logs proccessing. I'm thinking about first: comparing
them in real-life problem and second: look for optimizations that can
be made in Pig or Hive (e.g. choosing join strategy) and how it
affects the performance of the frameworks.

Ok, after this long introduction I want to ask you: do you think it is
interesting approach and does it make any sense? Is it worth trying?
If so - maybe you can suggest me the features of frameworks on which I
should look closer and maybe a real-life problems that can be used in
the experiments?

I look forward for any comments - thanks in advance.

p.s. I've posted this messege on both framework's mailing lists - hive and pig.


Thanks!
Michal

Re: Master thesis about Hive/Pig/MapReduce

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Jumping in a bit late on this..

I am not sure how valuable such benchmarks are from an academic standpoint.
You wouldn't really be testing performance of the algorithms, but of the
implementations, and with a lot of unknowns in the middle -- Pig and Hive
use different serialization format, different code for reading and passing
around data, construct their pipelines differently, treat garbage collection
differently, and so on. I wouldn't be surprised if measuring theoretically
identical join algorithms (say, the regular hash join) would give you
different results. Moreover, all of these things are highly dependent on
tuning, memory settings, and are a moving target as both projects keep
improving their codebases.

Two ideas:

1) If you are specifically interested in benchmarks, an interesting
benchmarking problem might be doing something like adjusting the various JVM
parameters to identify what effect they have on execution of Pig and Hive
jobs, and whether the same parameters are found ideal for both. That way you
are isolating your test to a single variable (since you are only comparing
Pig to Pig, and Hive to Hive).  It would be really cool if you came up with
something that cleverly searched the total space of the JVM parameters and
identified likely best configurations, without doing an exhaustive search of
the space of course.

There are pointers at some JVM resources here:
http://www.quora.com/What-are-some-useful-tips-for-tuning-programs-running-on-the-JVM?q=jvm+tun
You might even try measuring what effects using different garbage collectors
has.

Try to do your experiments on a real cluster, I suspect using AWS will be a
bit suspect since their virtualization tech will get in your way.

2) I would love to see someone do proper cost-based optimization for either
Hive or Pig. I know several people have tried in the past but nothing that
really worked came of it... I'd be happy to help brainstorm approaches.

-D

2011/1/4 Alan Gates <ga...@yahoo-inc.com>

> Hi Michal,
>
> A couple of areas where you could study performance without duplicating
> Robert Stewart's work come to mind.  One is in the area of how data skew
> affects performance.  This is a very real world concern since in my
> experience almost all input data is power law distributed.  Consider for
> example if you want to join a highly skewed table against an evenly
> distributed table.  Using the default join algorithm some small subset of
> your reducers will get the vast majority of the data.  Pig has a join
> implementation called skew join that can handle this and evenly distribute
> the data.  I believe Hive has a similar join implementation (I know at least
> that they planned to, I'm not sure if it's done yet or not).  So you could
> test performance of skewed joins between the two as well as skewed versus
> non skewed implementations of join.
>
> Another area that comes to mind is combining multiple grouping operations
> into one Map Reduce job.  This is something we see used extensively at Yahoo
> as users often want to read data once and group it by different sets of
> keys.  Both Pig and Hive have support for this.  In Pig we call it
> multi-query.  I think Hive calls it multiple insert or something like that.
>  This is another area where you could test performance both between Pig and
> Hive and between using the multi-query algorithm and scanning the data
> multiple times.
>
> I hope those are helpful.  Whatever you choose, good luck with your thesis.
>
> Alan.
>
>
> On Jan 4, 2011, at 1:21 PM, Michał Anglart wrote:
>
>  Hi Everybody,
>>
>> I'm a soon-to-graduate student of computer science at the Univeristy
>> of Wrocław in Poland. Currently I'm starting to write my master thesis
>> and I'm looking for some inspirations/ideas.
>>
>> First of all I want to write about MapReduce - as far as I know nobody
>> took such topics as their thesis at my faculty, but the topic is
>> interesting, so someone should start. Lately I thought that maybe I
>> could consider comparing Java's MapReduce with Hive and Pig in terms
>> of it's performance, optimizations that are used inside etc.
>> Personally I had found it nice idea as it would allowed me to learn
>> both frameworks and take a look at the way they work. Unfortunately I
>> found out that Robert Stewart from Heriot Watt Univeristy wrote his
>> thesis in "Performance & Programming Comparison of JAQL, Hive, Pig and
>> Java" which can be found via Google. I looked through this paper and
>> it looks quite similar to what I wanted to do.
>>
>> After this discover I thought that maybe a little bit different
>> approach to performance comparison can prove to be a succesful topic
>> for my master thesis: specifically I'm thinking about comparing the
>> frameworks in some real-life problem. Robert in his paper made the
>> experiments on few quite simple problems like word count, simple join
>> of two sets or logs proccessing. I'm thinking about first: comparing
>> them in real-life problem and second: look for optimizations that can
>> be made in Pig or Hive (e.g. choosing join strategy) and how it
>> affects the performance of the frameworks.
>>
>> Ok, after this long introduction I want to ask you: do you think it is
>> interesting approach and does it make any sense? Is it worth trying?
>> If so - maybe you can suggest me the features of frameworks on which I
>> should look closer and maybe a real-life problems that can be used in
>> the experiments?
>>
>> I look forward for any comments - thanks in advance.
>>
>> p.s. I've posted this messege on both framework's mailing lists - hive and
>> pig.
>>
>>
>> Thanks!
>> Michal
>>
>
>

Re: Master thesis about Hive/Pig/MapReduce

Posted by Alan Gates <ga...@yahoo-inc.com>.

Hi Michal,

A couple of areas where you could study performance without  
duplicating Robert Stewart's work come to mind.  One is in the area of  
how data skew affects performance.  This is a very real world concern  
since in my experience almost all input data is power law  
distributed.  Consider for example if you want to join a highly skewed  
table against an evenly distributed table.  Using the default join  
algorithm some small subset of your reducers will get the vast  
majority of the data.  Pig has a join implementation called skew join  
that can handle this and evenly distribute the data.  I believe Hive  
has a similar join implementation (I know at least that they planned  
to, I'm not sure if it's done yet or not).  So you could test  
performance of skewed joins between the two as well as skewed versus  
non skewed implementations of join.

Another area that comes to mind is combining multiple grouping  
operations into one Map Reduce job.  This is something we see used  
extensively at Yahoo as users often want to read data once and group  
it by different sets of keys.  Both Pig and Hive have support for  
this.  In Pig we call it multi-query.  I think Hive calls it multiple  
insert or something like that.  This is another area where you could  
test performance both between Pig and Hive and between using the multi- 
query algorithm and scanning the data multiple times.

I hope those are helpful.  Whatever you choose, good luck with your  
thesis.

Alan.

On Jan 4, 2011, at 1:21 PM, Michał Anglart wrote:

> Hi Everybody,
>
> I'm a soon-to-graduate student of computer science at the Univeristy
> of Wrocław in Poland. Currently I'm starting to write my master thesis
> and I'm looking for some inspirations/ideas.
>
> First of all I want to write about MapReduce - as far as I know nobody
> took such topics as their thesis at my faculty, but the topic is
> interesting, so someone should start. Lately I thought that maybe I
> could consider comparing Java's MapReduce with Hive and Pig in terms
> of it's performance, optimizations that are used inside etc.
> Personally I had found it nice idea as it would allowed me to learn
> both frameworks and take a look at the way they work. Unfortunately I
> found out that Robert Stewart from Heriot Watt Univeristy wrote his
> thesis in "Performance & Programming Comparison of JAQL, Hive, Pig and
> Java" which can be found via Google. I looked through this paper and
> it looks quite similar to what I wanted to do.
>
> After this discover I thought that maybe a little bit different
> approach to performance comparison can prove to be a succesful topic
> for my master thesis: specifically I'm thinking about comparing the
> frameworks in some real-life problem. Robert in his paper made the
> experiments on few quite simple problems like word count, simple join
> of two sets or logs proccessing. I'm thinking about first: comparing
> them in real-life problem and second: look for optimizations that can
> be made in Pig or Hive (e.g. choosing join strategy) and how it
> affects the performance of the frameworks.
>
> Ok, after this long introduction I want to ask you: do you think it is
> interesting approach and does it make any sense? Is it worth trying?
> If so - maybe you can suggest me the features of frameworks on which I
> should look closer and maybe a real-life problems that can be used in
> the experiments?
>
> I look forward for any comments - thanks in advance.
>
> p.s. I've posted this messege on both framework's mailing lists -  
> hive and pig.
>
>
> Thanks!
> Michal