You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Michał Anglart <an...@gmail.com> on 2011/01/04 22:21:07 UTC

Master thesis about Hive/Pig/MapReduce

Hi Everybody,

I'm a soon-to-graduate student of computer science at the Univeristy
of Wrocław in Poland. Currently I'm starting to write my master thesis
and I'm looking for some inspirations/ideas.

First of all I want to write about MapReduce - as far as I know nobody
took such topics as their thesis at my faculty, but the topic is
interesting, so someone should start. Lately I thought that maybe I
could consider comparing Java's MapReduce with Hive and Pig in terms
of it's performance, optimizations that are used inside etc.
Personally I had found it nice idea as it would allowed me to learn
both frameworks and take a look at the way they work. Unfortunately I
found out that Robert Stewart from Heriot Watt Univeristy wrote his
thesis in "Performance & Programming Comparison of JAQL, Hive, Pig and
Java" which can be found via Google. I looked through this paper and
it looks quite similar to what I wanted to do.

After this discover I thought that maybe a little bit different
approach to performance comparison can prove to be a succesful topic
for my master thesis: specifically I'm thinking about comparing the
frameworks in some real-life problem. Robert in his paper made the
experiments on few quite simple problems like word count, simple join
of two sets or logs proccessing. I'm thinking about first: comparing
them in real-life problem and second: look for optimizations that can
be made in Pig or Hive (e.g. choosing join strategy) and how it
affects the performance of the frameworks.

Ok, after this long introduction I want to ask you: do you think it is
interesting approach and does it make any sense? Is it worth trying?
If so - maybe you can suggest me the features of frameworks on which I
should look closer and maybe a real-life problems that can be used in
the experiments?

I look forward for any comments - thanks in advance.

p.s. I've posted this messege on both framework's mailing lists - hive and pig.


Thanks!
Michal