You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Rob Stewart <ro...@googlemail.com> on 2009/10/07 15:18:12 UTC

Using Pig for a comparative Study

Hello Pig user group !

OK, here's two things about me:
1. I'm new to Pig and Hadoop
2. I'm studying for a Masters in Software Engineering in the UK.
3. I'm looking to do a comparitive study on probably two distributed systems
over a cluster network. I have investigated Hadoop, and have deployed Hadoop
across various virtual Linux systems on this PC I'm using (which was fun!),
and my university has given me permission to use the cluster at university
to deploy Hadoop, which I'm excited about. (They may even use it for future
research, or better still, production processing!).

Anyway... I have had a look at Pig, and have worked through the various
tutorials, which are very well written, and have these tutorials working on
my virtual Hadoop cluster here on this PC, and I assume the same would be
the case on the university cluster.

I am needing another system, as similar as possible to the function and use
of Pig. My supervisor has pointed me in the direction of CouchDB (written in
Erlang) as another tool which potentially could be used for comparison for
my studies. Reading a little about it, there seems no formal process for
distributing a CouchDB job however, across a cluster of nodes for parallel
processing. I have contacted the CouchDB mailing list for clarification
about this however.

So, I write to you guys for four reasons:
1. To touch base, and say - "hey, I'm hoping to use Pig for a comparitive
study for my Masters dissertation - Thanks !!"
2. To ask, if there is any other solution out there that can be closely
compared to the functionality and use of Pig.
3. If CouchDB has been benchmarked against Pig before now, where I can find
it, or who can help me with this.
4. Am I off the mark with these questions? If so, please speak now!


thanks,

Rob Stewart

Re: Using Pig for a comparative Study

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

Hey Rob,
There's been some fairly extensive benchmarking of Pig and Hive over at
https://issues.apache.org/jira/browse/HIVE-396 and
https://issues.apache.org/jira/browse/HIVE-600 that may help you get
started.

Regards,
Jeff

On Wed, Oct 7, 2009 at 3:09 PM, Rob Stewart <ro...@googlemail.com>wrote:

> @ Santhosh - I will indeed keep this mailing list abreast of my study -
> You'll probably figure out what I'm upto with the question that'll appear
> on
> the mailing list :-)
>
> @ Dmitriy - What you've pointed me towards is already a massive help. I
> appreciate the time you've taken to respond to my plea for help! :-) And
> I've had a look around the community project documentation, your name
> appears quite a lot!
>
> Ok, I'm cramming the references you've pointed out in my bibtex database,
> so
> they don't get lost.
>
> I will, over the next few days, have a good look at PIgMix and the Pig vs
> Hive benchmark.
>
> So for now, it's a lot of playing about with Hive, Pig, and I will have a
> look at how HadoopDB functions, though the creators of this project explain
> how they use elements of Hive, but I need to clarify the differences
> between
> the two before deciding which to use for my study. A HadoopDB vs Hive vs
> Pig
> vs JAQL evaluation is not out of the equation at this moment in time.
>
> Thanks Dmitriy, I will most likely touch base  down the line.
>
>
> Rob
>
>
>
>
> 2009/10/7 Dmitriy Ryaboy <dv...@gmail.com>
>
> > Oh, check out HadoopDB also.
> >
> > On Wed, Oct 7, 2009 at 12:46 PM, Rob Stewart <
> robstewart57@googlemail.com
> > >wrote:
> >
> > > Hi Dmitry, excellent response, thanks.
> > >
> > > I was predominately looking at CouchDB simply for the fact that it's
> > > written
> > > in Erlang, which is a scalable, distributable language.  I do realise
> > that
> > > if I were to compare CouchDB with Pig/Hadoop, it would be difficult to
> > > argue
> > > that I was indeed comparing like for like.
> > >
> > > RE: "in what way do you intend to compare the systems". That is *the*
> > > question.
> > > Speed - Yes, it would be nice to be able to implement the same
> execution
> > > procedure in two different systems/languages, run it on the same
> cluster
> > > (at
> > > different times!) and compare the time it takes to execute. The
> variable
> > > here would be the size of data to process
> > > Architecture - A good one to discuss. Is the required infrastructure
> > > identical on both systems (I know, for instance, that Dryad and Hadoop
> > have
> > > the "one master and many slaves" architecture, albeit for different
> > roles.
> > > Parallelization policy - Indeed, at one point does the execution switch
> > > from
> > > sequential to parallel, which nodes execute in parallel etc...
> > > Fault Tolernce - This is one I'd be keen to explore. The obvious
> > advantage
> > > in using Pig for my research is that I get fault tolerance for free
> from
> > > Hadoop. Great! But I want to be able to control failures to analyse the
> > > performance of recovery. I would need to investigate exactly how to
> > create
> > > a
> > > fault, other than killing the DataNode service using the Linux kill
> > > command.
> > > Answers on the back of a postcard, thanks.
> > >
> > > I've just had a quick look at JAQL. Wow, good suggestion, the core of
> the
> > > language offers: filter, transform, group, join, sort and expand. A few
> > of
> > > these are matched in Pig, and JAQL can also from delimited files, like
> > Pig
> > > does. I will certainly spend time looking into this, and see if I can
> > > create
> > > an input file and process it using both JAQL and Pig without any
> > > alterations
> > > to the input, whilst generating an identical output file. If so, I'm in
> > > business... This would eliminate the distributed nature of the systems
> as
> > a
> > > variable (they both use Hadoop) also.
> > >
> > > I had been pointed in the direction of Dryad, and whilst I am, at this
> > > stage, open to suggestions for my study, I do have a few concerns about
> > > using DryadLINQ. Firstly, they require Windows Server 2008 SP1 servers
> > for
> > > the master and the slaves, and they need to have the Dryad software
> > > installed. I'm not so sure on how accessible that is to me. Also, I
> > wonder
> > > where my support would dry up once I've started (and I wouldn't have a
> > > community to rely on like this one!).
> > >
> > > RE: Avram - "that closely relates toe Pig". I *think* I meant both in
> > terms
> > > of an underlying architecture (in Pig's case, Hadoop), syntax that is
> > above
> > > the level of data allocation to DataNodes, and also the sort of
> > > functionality Pig provides (basic data processing/manipulation using
> > > filter/join although I realise that you can write user defined
> functions
> > to
> > > fill the gap). I will indeed have a look at Hive. It will be
> interesting
> > > the
> > > see the differences between Hive and Pig, bearing in mind they have
> both
> > > been merged into the Apache Hadoop software stack, to see how much
> > > crossover
> > > exists between the two. Finally, Cascading looks interesting also, I
> > shall
> > > try and get an example working, and take it from there. Is it
> anticipated
> > > that Cascading will get merged into the Hadoop software stack?
> > >
> > >
> > > thanks guys, no doubt I will have a ton of problems/questions that need
> > > solving when I've tried these out.
> > >
> > >
> > > Rob Stewart
> > >
> > >
> > > 2009/10/7 Dmitriy Ryaboy <dv...@gmail.com>
> > >
> > > > Hi Rob,
> > > >
> > > > CouchDB is a totally different project with very different goals.
> > > > Preemptively -- so are Cassandra, Project Voldemort, Tokyo Tyrant,
> and
> > > > HBase. They are also different from each other, but that's a long
> > > > conversation..
> > > >
> > > > In what way do you intend to compare the systems -- speed,
> > architecture,
> > > > parallelization policy?
> > > >
> > > > In the Hadoop world, Hive is a system with similar goals to Pig,
> > although
> > > > it
> > > > has a somewhat different philosophy.
> > > > You may also want to check out JAQL.
> > > >
> > > > Microsoft has been letting academics get access to its Dryad  system,
> > so
> > > > you
> > > > may want to look at their DryadLINQ and SCOPE stuff. I am not sure of
> > the
> > > > extent MS actually lets you play with their stack, but they seem to
> be
> > > > getting more student-researcher-friendly in recent years.
> > > >
> > > > -D
> > > >
> > > > On Wed, Oct 7, 2009 at 9:18 AM, Rob Stewart <
> > robstewart57@googlemail.com
> > > > >wrote:
> > > >
> > > > > Hello Pig user group !
> > > > >
> > > > > OK, here's two things about me:
> > > > > 1. I'm new to Pig and Hadoop
> > > > > 2. I'm studying for a Masters in Software Engineering in the UK.
> > > > > 3. I'm looking to do a comparitive study on probably two
> distributed
> > > > > systems
> > > > > over a cluster network. I have investigated Hadoop, and have
> deployed
> > > > > Hadoop
> > > > > across various virtual Linux systems on this PC I'm using (which
> was
> > > > fun!),
> > > > > and my university has given me permission to use the cluster at
> > > > university
> > > > > to deploy Hadoop, which I'm excited about. (They may even use it
> for
> > > > future
> > > > > research, or better still, production processing!).
> > > > >
> > > > > Anyway... I have had a look at Pig, and have worked through the
> > various
> > > > > tutorials, which are very well written, and have these tutorials
> > > working
> > > > on
> > > > > my virtual Hadoop cluster here on this PC, and I assume the same
> > would
> > > be
> > > > > the case on the university cluster.
> > > > >
> > > > > I am needing another system, as similar as possible to the function
> > and
> > > > use
> > > > > of Pig. My supervisor has pointed me in the direction of CouchDB
> > > (written
> > > > > in
> > > > > Erlang) as another tool which potentially could be used for
> > comparison
> > > > for
> > > > > my studies. Reading a little about it, there seems no formal
> process
> > > for
> > > > > distributing a CouchDB job however, across a cluster of nodes for
> > > > parallel
> > > > > processing. I have contacted the CouchDB mailing list for
> > clarification
> > > > > about this however.
> > > > >
> > > > > So, I write to you guys for four reasons:
> > > > > 1. To touch base, and say - "hey, I'm hoping to use Pig for a
> > > comparitive
> > > > > study for my Masters dissertation - Thanks !!"
> > > > > 2. To ask, if there is any other solution out there that can be
> > closely
> > > > > compared to the functionality and use of Pig.
> > > > > 3. If CouchDB has been benchmarked against Pig before now, where I
> > can
> > > > find
> > > > > it, or who can help me with this.
> > > > > 4. Am I off the mark with these questions? If so, please speak now!
> > > > >
> > > > >
> > > > > thanks,
> > > > >
> > > > > Rob Stewart
> > > > >
> > > >
> > >
> >
>

Re: Using Pig for a comparative Study

Posted by Rob Stewart <ro...@googlemail.com>.

@ Santhosh - I will indeed keep this mailing list abreast of my study -
You'll probably figure out what I'm upto with the question that'll appear on
the mailing list :-)

@ Dmitriy - What you've pointed me towards is already a massive help. I
appreciate the time you've taken to respond to my plea for help! :-) And
I've had a look around the community project documentation, your name
appears quite a lot!

Ok, I'm cramming the references you've pointed out in my bibtex database, so
they don't get lost.

I will, over the next few days, have a good look at PIgMix and the Pig vs
Hive benchmark.

So for now, it's a lot of playing about with Hive, Pig, and I will have a
look at how HadoopDB functions, though the creators of this project explain
how they use elements of Hive, but I need to clarify the differences between
the two before deciding which to use for my study. A HadoopDB vs Hive vs Pig
vs JAQL evaluation is not out of the equation at this moment in time.

Thanks Dmitriy, I will most likely touch base  down the line.


Rob




2009/10/7 Dmitriy Ryaboy <dv...@gmail.com>

> Oh, check out HadoopDB also.
>
> On Wed, Oct 7, 2009 at 12:46 PM, Rob Stewart <robstewart57@googlemail.com
> >wrote:
>
> > Hi Dmitry, excellent response, thanks.
> >
> > I was predominately looking at CouchDB simply for the fact that it's
> > written
> > in Erlang, which is a scalable, distributable language.  I do realise
> that
> > if I were to compare CouchDB with Pig/Hadoop, it would be difficult to
> > argue
> > that I was indeed comparing like for like.
> >
> > RE: "in what way do you intend to compare the systems". That is *the*
> > question.
> > Speed - Yes, it would be nice to be able to implement the same execution
> > procedure in two different systems/languages, run it on the same cluster
> > (at
> > different times!) and compare the time it takes to execute. The variable
> > here would be the size of data to process
> > Architecture - A good one to discuss. Is the required infrastructure
> > identical on both systems (I know, for instance, that Dryad and Hadoop
> have
> > the "one master and many slaves" architecture, albeit for different
> roles.
> > Parallelization policy - Indeed, at one point does the execution switch
> > from
> > sequential to parallel, which nodes execute in parallel etc...
> > Fault Tolernce - This is one I'd be keen to explore. The obvious
> advantage
> > in using Pig for my research is that I get fault tolerance for free from
> > Hadoop. Great! But I want to be able to control failures to analyse the
> > performance of recovery. I would need to investigate exactly how to
> create
> > a
> > fault, other than killing the DataNode service using the Linux kill
> > command.
> > Answers on the back of a postcard, thanks.
> >
> > I've just had a quick look at JAQL. Wow, good suggestion, the core of the
> > language offers: filter, transform, group, join, sort and expand. A few
> of
> > these are matched in Pig, and JAQL can also from delimited files, like
> Pig
> > does. I will certainly spend time looking into this, and see if I can
> > create
> > an input file and process it using both JAQL and Pig without any
> > alterations
> > to the input, whilst generating an identical output file. If so, I'm in
> > business... This would eliminate the distributed nature of the systems as
> a
> > variable (they both use Hadoop) also.
> >
> > I had been pointed in the direction of Dryad, and whilst I am, at this
> > stage, open to suggestions for my study, I do have a few concerns about
> > using DryadLINQ. Firstly, they require Windows Server 2008 SP1 servers
> for
> > the master and the slaves, and they need to have the Dryad software
> > installed. I'm not so sure on how accessible that is to me. Also, I
> wonder
> > where my support would dry up once I've started (and I wouldn't have a
> > community to rely on like this one!).
> >
> > RE: Avram - "that closely relates toe Pig". I *think* I meant both in
> terms
> > of an underlying architecture (in Pig's case, Hadoop), syntax that is
> above
> > the level of data allocation to DataNodes, and also the sort of
> > functionality Pig provides (basic data processing/manipulation using
> > filter/join although I realise that you can write user defined functions
> to
> > fill the gap). I will indeed have a look at Hive. It will be interesting
> > the
> > see the differences between Hive and Pig, bearing in mind they have both
> > been merged into the Apache Hadoop software stack, to see how much
> > crossover
> > exists between the two. Finally, Cascading looks interesting also, I
> shall
> > try and get an example working, and take it from there. Is it anticipated
> > that Cascading will get merged into the Hadoop software stack?
> >
> >
> > thanks guys, no doubt I will have a ton of problems/questions that need
> > solving when I've tried these out.
> >
> >
> > Rob Stewart
> >
> >
> > 2009/10/7 Dmitriy Ryaboy <dv...@gmail.com>
> >
> > > Hi Rob,
> > >
> > > CouchDB is a totally different project with very different goals.
> > > Preemptively -- so are Cassandra, Project Voldemort, Tokyo Tyrant, and
> > > HBase. They are also different from each other, but that's a long
> > > conversation..
> > >
> > > In what way do you intend to compare the systems -- speed,
> architecture,
> > > parallelization policy?
> > >
> > > In the Hadoop world, Hive is a system with similar goals to Pig,
> although
> > > it
> > > has a somewhat different philosophy.
> > > You may also want to check out JAQL.
> > >
> > > Microsoft has been letting academics get access to its Dryad  system,
> so
> > > you
> > > may want to look at their DryadLINQ and SCOPE stuff. I am not sure of
> the
> > > extent MS actually lets you play with their stack, but they seem to be
> > > getting more student-researcher-friendly in recent years.
> > >
> > > -D
> > >
> > > On Wed, Oct 7, 2009 at 9:18 AM, Rob Stewart <
> robstewart57@googlemail.com
> > > >wrote:
> > >
> > > > Hello Pig user group !
> > > >
> > > > OK, here's two things about me:
> > > > 1. I'm new to Pig and Hadoop
> > > > 2. I'm studying for a Masters in Software Engineering in the UK.
> > > > 3. I'm looking to do a comparitive study on probably two distributed
> > > > systems
> > > > over a cluster network. I have investigated Hadoop, and have deployed
> > > > Hadoop
> > > > across various virtual Linux systems on this PC I'm using (which was
> > > fun!),
> > > > and my university has given me permission to use the cluster at
> > > university
> > > > to deploy Hadoop, which I'm excited about. (They may even use it for
> > > future
> > > > research, or better still, production processing!).
> > > >
> > > > Anyway... I have had a look at Pig, and have worked through the
> various
> > > > tutorials, which are very well written, and have these tutorials
> > working
> > > on
> > > > my virtual Hadoop cluster here on this PC, and I assume the same
> would
> > be
> > > > the case on the university cluster.
> > > >
> > > > I am needing another system, as similar as possible to the function
> and
> > > use
> > > > of Pig. My supervisor has pointed me in the direction of CouchDB
> > (written
> > > > in
> > > > Erlang) as another tool which potentially could be used for
> comparison
> > > for
> > > > my studies. Reading a little about it, there seems no formal process
> > for
> > > > distributing a CouchDB job however, across a cluster of nodes for
> > > parallel
> > > > processing. I have contacted the CouchDB mailing list for
> clarification
> > > > about this however.
> > > >
> > > > So, I write to you guys for four reasons:
> > > > 1. To touch base, and say - "hey, I'm hoping to use Pig for a
> > comparitive
> > > > study for my Masters dissertation - Thanks !!"
> > > > 2. To ask, if there is any other solution out there that can be
> closely
> > > > compared to the functionality and use of Pig.
> > > > 3. If CouchDB has been benchmarked against Pig before now, where I
> can
> > > find
> > > > it, or who can help me with this.
> > > > 4. Am I off the mark with these questions? If so, please speak now!
> > > >
> > > >
> > > > thanks,
> > > >
> > > > Rob Stewart
> > > >
> > >
> >
>

Re: Using Pig for a comparative Study

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Oh, check out HadoopDB also.

On Wed, Oct 7, 2009 at 12:46 PM, Rob Stewart <ro...@googlemail.com>wrote:

> Hi Dmitry, excellent response, thanks.
>
> I was predominately looking at CouchDB simply for the fact that it's
> written
> in Erlang, which is a scalable, distributable language.  I do realise that
> if I were to compare CouchDB with Pig/Hadoop, it would be difficult to
> argue
> that I was indeed comparing like for like.
>
> RE: "in what way do you intend to compare the systems". That is *the*
> question.
> Speed - Yes, it would be nice to be able to implement the same execution
> procedure in two different systems/languages, run it on the same cluster
> (at
> different times!) and compare the time it takes to execute. The variable
> here would be the size of data to process
> Architecture - A good one to discuss. Is the required infrastructure
> identical on both systems (I know, for instance, that Dryad and Hadoop have
> the "one master and many slaves" architecture, albeit for different roles.
> Parallelization policy - Indeed, at one point does the execution switch
> from
> sequential to parallel, which nodes execute in parallel etc...
> Fault Tolernce - This is one I'd be keen to explore. The obvious advantage
> in using Pig for my research is that I get fault tolerance for free from
> Hadoop. Great! But I want to be able to control failures to analyse the
> performance of recovery. I would need to investigate exactly how to create
> a
> fault, other than killing the DataNode service using the Linux kill
> command.
> Answers on the back of a postcard, thanks.
>
> I've just had a quick look at JAQL. Wow, good suggestion, the core of the
> language offers: filter, transform, group, join, sort and expand. A few of
> these are matched in Pig, and JAQL can also from delimited files, like Pig
> does. I will certainly spend time looking into this, and see if I can
> create
> an input file and process it using both JAQL and Pig without any
> alterations
> to the input, whilst generating an identical output file. If so, I'm in
> business... This would eliminate the distributed nature of the systems as a
> variable (they both use Hadoop) also.
>
> I had been pointed in the direction of Dryad, and whilst I am, at this
> stage, open to suggestions for my study, I do have a few concerns about
> using DryadLINQ. Firstly, they require Windows Server 2008 SP1 servers for
> the master and the slaves, and they need to have the Dryad software
> installed. I'm not so sure on how accessible that is to me. Also, I wonder
> where my support would dry up once I've started (and I wouldn't have a
> community to rely on like this one!).
>
> RE: Avram - "that closely relates toe Pig". I *think* I meant both in terms
> of an underlying architecture (in Pig's case, Hadoop), syntax that is above
> the level of data allocation to DataNodes, and also the sort of
> functionality Pig provides (basic data processing/manipulation using
> filter/join although I realise that you can write user defined functions to
> fill the gap). I will indeed have a look at Hive. It will be interesting
> the
> see the differences between Hive and Pig, bearing in mind they have both
> been merged into the Apache Hadoop software stack, to see how much
> crossover
> exists between the two. Finally, Cascading looks interesting also, I shall
> try and get an example working, and take it from there. Is it anticipated
> that Cascading will get merged into the Hadoop software stack?
>
>
> thanks guys, no doubt I will have a ton of problems/questions that need
> solving when I've tried these out.
>
>
> Rob Stewart
>
>
> 2009/10/7 Dmitriy Ryaboy <dv...@gmail.com>
>
> > Hi Rob,
> >
> > CouchDB is a totally different project with very different goals.
> > Preemptively -- so are Cassandra, Project Voldemort, Tokyo Tyrant, and
> > HBase. They are also different from each other, but that's a long
> > conversation..
> >
> > In what way do you intend to compare the systems -- speed, architecture,
> > parallelization policy?
> >
> > In the Hadoop world, Hive is a system with similar goals to Pig, although
> > it
> > has a somewhat different philosophy.
> > You may also want to check out JAQL.
> >
> > Microsoft has been letting academics get access to its Dryad  system, so
> > you
> > may want to look at their DryadLINQ and SCOPE stuff. I am not sure of the
> > extent MS actually lets you play with their stack, but they seem to be
> > getting more student-researcher-friendly in recent years.
> >
> > -D
> >
> > On Wed, Oct 7, 2009 at 9:18 AM, Rob Stewart <robstewart57@googlemail.com
> > >wrote:
> >
> > > Hello Pig user group !
> > >
> > > OK, here's two things about me:
> > > 1. I'm new to Pig and Hadoop
> > > 2. I'm studying for a Masters in Software Engineering in the UK.
> > > 3. I'm looking to do a comparitive study on probably two distributed
> > > systems
> > > over a cluster network. I have investigated Hadoop, and have deployed
> > > Hadoop
> > > across various virtual Linux systems on this PC I'm using (which was
> > fun!),
> > > and my university has given me permission to use the cluster at
> > university
> > > to deploy Hadoop, which I'm excited about. (They may even use it for
> > future
> > > research, or better still, production processing!).
> > >
> > > Anyway... I have had a look at Pig, and have worked through the various
> > > tutorials, which are very well written, and have these tutorials
> working
> > on
> > > my virtual Hadoop cluster here on this PC, and I assume the same would
> be
> > > the case on the university cluster.
> > >
> > > I am needing another system, as similar as possible to the function and
> > use
> > > of Pig. My supervisor has pointed me in the direction of CouchDB
> (written
> > > in
> > > Erlang) as another tool which potentially could be used for comparison
> > for
> > > my studies. Reading a little about it, there seems no formal process
> for
> > > distributing a CouchDB job however, across a cluster of nodes for
> > parallel
> > > processing. I have contacted the CouchDB mailing list for clarification
> > > about this however.
> > >
> > > So, I write to you guys for four reasons:
> > > 1. To touch base, and say - "hey, I'm hoping to use Pig for a
> comparitive
> > > study for my Masters dissertation - Thanks !!"
> > > 2. To ask, if there is any other solution out there that can be closely
> > > compared to the functionality and use of Pig.
> > > 3. If CouchDB has been benchmarked against Pig before now, where I can
> > find
> > > it, or who can help me with this.
> > > 4. Am I off the mark with these questions? If so, please speak now!
> > >
> > >
> > > thanks,
> > >
> > > Rob Stewart
> > >
> >
>

Re: Using Pig for a comparative Study

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Rob,
It's highly unlikely that Cascading would be "merged" into Hadoop due to
license issues (it's GPL, while Hadoop is Apache). But it's open source, and
the author seems to be pretty available on the mailing lists; I am not sure
how much the specifics of which source code repository the code comes from
matter for your purposes (as long as you aren't distributing the software).


There is a Hive vs Pig benchmark on the Hive jira, which reproduces queries
from the Hadoop vs RDBMS paper by Pavlo et al. The queries are a bit biased
towards the kind of stuff RDBMSes are good at, but it's a good place to
start.  Pig also has its own benchmark, called PigMix, which you can
translate into Hive / JAQL / Cascading queries. Note that the Pig version in
the trunk just got a whole lot faster, so it may be worth rerunning both of
those benchmarks.

For inducing failures, you can kill data nodes / task trackers, or you can
induce various loads on individual machines -- the CMU group that did
performance monitoring had a few common things they would do, like a "disk
hog", a "cpu hog", a "network hog" to simulate various problems that might
arise.  I suspect that since the underlying fault-tolerance model is
Hadoop's for all the systems, you will wind up with the same results for
Pig, Hive, JAQL, and Cascading.

It might be interesting to look at how many map-reduce steps are generated
by the different frameworks to achieve the same task (keeping in mind that
not all steps are created equal -- for example Pig often generates indexing
MR jobs that are very fast, and whose "cost" is much lower than an MR job
that requires processing all the input data).

Take a look at the "Distributed Aggregation for Data-Parallel Computing"
paper from MSR's Yu et al (SIGMOD 2009 I think? Might have the conference
wrong).  It's got an interesting analysis of different models for computing
distributed aggregations, and some criticisms of how Pig, specifically, does
it. Maybe there's some follow-up work in that?

You may also want to experiment with how the various systems deal with odd
distributions and skewed data, especially skewed data that models the real
world -- graphs of social connections or web links (with in- and out-degrees
of nodes following a power law), etc.

I think CouchDB is a red herring as far as comparing things to Pig is
concerned.  But if you want to use Erlang to write a Pig clone, no one would
stop you :-).

-D

On Wed, Oct 7, 2009 at 12:46 PM, Rob Stewart <ro...@googlemail.com>wrote:

> Hi Dmitry, excellent response, thanks.
>
> I was predominately looking at CouchDB simply for the fact that it's
> written
> in Erlang, which is a scalable, distributable language.  I do realise that
> if I were to compare CouchDB with Pig/Hadoop, it would be difficult to
> argue
> that I was indeed comparing like for like.
>
> RE: "in what way do you intend to compare the systems". That is *the*
> question.
> Speed - Yes, it would be nice to be able to implement the same execution
> procedure in two different systems/languages, run it on the same cluster
> (at
> different times!) and compare the time it takes to execute. The variable
> here would be the size of data to process
> Architecture - A good one to discuss. Is the required infrastructure
> identical on both systems (I know, for instance, that Dryad and Hadoop have
> the "one master and many slaves" architecture, albeit for different roles.
> Parallelization policy - Indeed, at one point does the execution switch
> from
> sequential to parallel, which nodes execute in parallel etc...
> Fault Tolernce - This is one I'd be keen to explore. The obvious advantage
> in using Pig for my research is that I get fault tolerance for free from
> Hadoop. Great! But I want to be able to control failures to analyse the
> performance of recovery. I would need to investigate exactly how to create
> a
> fault, other than killing the DataNode service using the Linux kill
> command.
> Answers on the back of a postcard, thanks.
>
> I've just had a quick look at JAQL. Wow, good suggestion, the core of the
> language offers: filter, transform, group, join, sort and expand. A few of
> these are matched in Pig, and JAQL can also from delimited files, like Pig
> does. I will certainly spend time looking into this, and see if I can
> create
> an input file and process it using both JAQL and Pig without any
> alterations
> to the input, whilst generating an identical output file. If so, I'm in
> business... This would eliminate the distributed nature of the systems as a
> variable (they both use Hadoop) also.
>
> I had been pointed in the direction of Dryad, and whilst I am, at this
> stage, open to suggestions for my study, I do have a few concerns about
> using DryadLINQ. Firstly, they require Windows Server 2008 SP1 servers for
> the master and the slaves, and they need to have the Dryad software
> installed. I'm not so sure on how accessible that is to me. Also, I wonder
> where my support would dry up once I've started (and I wouldn't have a
> community to rely on like this one!).
>
> RE: Avram - "that closely relates toe Pig". I *think* I meant both in terms
> of an underlying architecture (in Pig's case, Hadoop), syntax that is above
> the level of data allocation to DataNodes, and also the sort of
> functionality Pig provides (basic data processing/manipulation using
> filter/join although I realise that you can write user defined functions to
> fill the gap). I will indeed have a look at Hive. It will be interesting
> the
> see the differences between Hive and Pig, bearing in mind they have both
> been merged into the Apache Hadoop software stack, to see how much
> crossover
> exists between the two. Finally, Cascading looks interesting also, I shall
> try and get an example working, and take it from there. Is it anticipated
> that Cascading will get merged into the Hadoop software stack?
>
>
> thanks guys, no doubt I will have a ton of problems/questions that need
> solving when I've tried these out.
>
>
> Rob Stewart
>
>
> 2009/10/7 Dmitriy Ryaboy <dv...@gmail.com>
>
> > Hi Rob,
> >
> > CouchDB is a totally different project with very different goals.
> > Preemptively -- so are Cassandra, Project Voldemort, Tokyo Tyrant, and
> > HBase. They are also different from each other, but that's a long
> > conversation..
> >
> > In what way do you intend to compare the systems -- speed, architecture,
> > parallelization policy?
> >
> > In the Hadoop world, Hive is a system with similar goals to Pig, although
> > it
> > has a somewhat different philosophy.
> > You may also want to check out JAQL.
> >
> > Microsoft has been letting academics get access to its Dryad  system, so
> > you
> > may want to look at their DryadLINQ and SCOPE stuff. I am not sure of the
> > extent MS actually lets you play with their stack, but they seem to be
> > getting more student-researcher-friendly in recent years.
> >
> > -D
> >
> > On Wed, Oct 7, 2009 at 9:18 AM, Rob Stewart <robstewart57@googlemail.com
> > >wrote:
> >
> > > Hello Pig user group !
> > >
> > > OK, here's two things about me:
> > > 1. I'm new to Pig and Hadoop
> > > 2. I'm studying for a Masters in Software Engineering in the UK.
> > > 3. I'm looking to do a comparitive study on probably two distributed
> > > systems
> > > over a cluster network. I have investigated Hadoop, and have deployed
> > > Hadoop
> > > across various virtual Linux systems on this PC I'm using (which was
> > fun!),
> > > and my university has given me permission to use the cluster at
> > university
> > > to deploy Hadoop, which I'm excited about. (They may even use it for
> > future
> > > research, or better still, production processing!).
> > >
> > > Anyway... I have had a look at Pig, and have worked through the various
> > > tutorials, which are very well written, and have these tutorials
> working
> > on
> > > my virtual Hadoop cluster here on this PC, and I assume the same would
> be
> > > the case on the university cluster.
> > >
> > > I am needing another system, as similar as possible to the function and
> > use
> > > of Pig. My supervisor has pointed me in the direction of CouchDB
> (written
> > > in
> > > Erlang) as another tool which potentially could be used for comparison
> > for
> > > my studies. Reading a little about it, there seems no formal process
> for
> > > distributing a CouchDB job however, across a cluster of nodes for
> > parallel
> > > processing. I have contacted the CouchDB mailing list for clarification
> > > about this however.
> > >
> > > So, I write to you guys for four reasons:
> > > 1. To touch base, and say - "hey, I'm hoping to use Pig for a
> comparitive
> > > study for my Masters dissertation - Thanks !!"
> > > 2. To ask, if there is any other solution out there that can be closely
> > > compared to the functionality and use of Pig.
> > > 3. If CouchDB has been benchmarked against Pig before now, where I can
> > find
> > > it, or who can help me with this.
> > > 4. Am I off the mark with these questions? If so, please speak now!
> > >
> > >
> > > thanks,
> > >
> > > Rob Stewart
> > >
> >
>

Re: Using Pig for a comparative Study

Posted by Rob Stewart <ro...@googlemail.com>.

Hi Dmitry, excellent response, thanks.

I was predominately looking at CouchDB simply for the fact that it's written
in Erlang, which is a scalable, distributable language.  I do realise that
if I were to compare CouchDB with Pig/Hadoop, it would be difficult to argue
that I was indeed comparing like for like.

RE: "in what way do you intend to compare the systems". That is *the*
question.
Speed - Yes, it would be nice to be able to implement the same execution
procedure in two different systems/languages, run it on the same cluster (at
different times!) and compare the time it takes to execute. The variable
here would be the size of data to process
Architecture - A good one to discuss. Is the required infrastructure
identical on both systems (I know, for instance, that Dryad and Hadoop have
the "one master and many slaves" architecture, albeit for different roles.
Parallelization policy - Indeed, at one point does the execution switch from
sequential to parallel, which nodes execute in parallel etc...
Fault Tolernce - This is one I'd be keen to explore. The obvious advantage
in using Pig for my research is that I get fault tolerance for free from
Hadoop. Great! But I want to be able to control failures to analyse the
performance of recovery. I would need to investigate exactly how to create a
fault, other than killing the DataNode service using the Linux kill command.
Answers on the back of a postcard, thanks.

I've just had a quick look at JAQL. Wow, good suggestion, the core of the
language offers: filter, transform, group, join, sort and expand. A few of
these are matched in Pig, and JAQL can also from delimited files, like Pig
does. I will certainly spend time looking into this, and see if I can create
an input file and process it using both JAQL and Pig without any alterations
to the input, whilst generating an identical output file. If so, I'm in
business... This would eliminate the distributed nature of the systems as a
variable (they both use Hadoop) also.

I had been pointed in the direction of Dryad, and whilst I am, at this
stage, open to suggestions for my study, I do have a few concerns about
using DryadLINQ. Firstly, they require Windows Server 2008 SP1 servers for
the master and the slaves, and they need to have the Dryad software
installed. I'm not so sure on how accessible that is to me. Also, I wonder
where my support would dry up once I've started (and I wouldn't have a
community to rely on like this one!).

RE: Avram - "that closely relates toe Pig". I *think* I meant both in terms
of an underlying architecture (in Pig's case, Hadoop), syntax that is above
the level of data allocation to DataNodes, and also the sort of
functionality Pig provides (basic data processing/manipulation using
filter/join although I realise that you can write user defined functions to
fill the gap). I will indeed have a look at Hive. It will be interesting the
see the differences between Hive and Pig, bearing in mind they have both
been merged into the Apache Hadoop software stack, to see how much crossover
exists between the two. Finally, Cascading looks interesting also, I shall
try and get an example working, and take it from there. Is it anticipated
that Cascading will get merged into the Hadoop software stack?

thanks guys, no doubt I will have a ton of problems/questions that need
solving when I've tried these out.

Rob Stewart

2009/10/7 Dmitriy Ryaboy <dv...@gmail.com>

> Hi Rob,
>
> CouchDB is a totally different project with very different goals.
> Preemptively -- so are Cassandra, Project Voldemort, Tokyo Tyrant, and
> HBase. They are also different from each other, but that's a long
> conversation..
>
> In what way do you intend to compare the systems -- speed, architecture,
> parallelization policy?
>
> In the Hadoop world, Hive is a system with similar goals to Pig, although
> it
> has a somewhat different philosophy.
> You may also want to check out JAQL.
>
> Microsoft has been letting academics get access to its Dryad  system, so
> you
> may want to look at their DryadLINQ and SCOPE stuff. I am not sure of the
> extent MS actually lets you play with their stack, but they seem to be
> getting more student-researcher-friendly in recent years.
>
> -D
>
> On Wed, Oct 7, 2009 at 9:18 AM, Rob Stewart <robstewart57@googlemail.com
> >wrote:
>
> > Hello Pig user group !
> >
> > OK, here's two things about me:
> > 1. I'm new to Pig and Hadoop
> > 2. I'm studying for a Masters in Software Engineering in the UK.
> > 3. I'm looking to do a comparitive study on probably two distributed
> > systems
> > over a cluster network. I have investigated Hadoop, and have deployed
> > Hadoop
> > across various virtual Linux systems on this PC I'm using (which was
> fun!),
> > and my university has given me permission to use the cluster at
> university
> > to deploy Hadoop, which I'm excited about. (They may even use it for
> future
> > research, or better still, production processing!).
> >
> > Anyway... I have had a look at Pig, and have worked through the various
> > tutorials, which are very well written, and have these tutorials working
> on
> > my virtual Hadoop cluster here on this PC, and I assume the same would be
> > the case on the university cluster.
> >
> > I am needing another system, as similar as possible to the function and
> use
> > of Pig. My supervisor has pointed me in the direction of CouchDB (written
> > in
> > Erlang) as another tool which potentially could be used for comparison
> for
> > my studies. Reading a little about it, there seems no formal process for
> > distributing a CouchDB job however, across a cluster of nodes for
> parallel
> > processing. I have contacted the CouchDB mailing list for clarification
> > about this however.
> >
> > So, I write to you guys for four reasons:
> > 1. To touch base, and say - "hey, I'm hoping to use Pig for a comparitive
> > study for my Masters dissertation - Thanks !!"
> > 2. To ask, if there is any other solution out there that can be closely
> > compared to the functionality and use of Pig.
> > 3. If CouchDB has been benchmarked against Pig before now, where I can
> find
> > it, or who can help me with this.
> > 4. Am I off the mark with these questions? If so, please speak now!
> >
> >
> > thanks,
> >
> > Rob Stewart
> >
>

Re: Using Pig for a comparative Study

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Hi Rob,

CouchDB is a totally different project with very different goals.
Preemptively -- so are Cassandra, Project Voldemort, Tokyo Tyrant, and
HBase. They are also different from each other, but that's a long
conversation..

In what way do you intend to compare the systems -- speed, architecture,
parallelization policy?

In the Hadoop world, Hive is a system with similar goals to Pig, although it
has a somewhat different philosophy.
You may also want to check out JAQL.

Microsoft has been letting academics get access to its Dryad  system, so you
may want to look at their DryadLINQ and SCOPE stuff. I am not sure of the
extent MS actually lets you play with their stack, but they seem to be
getting more student-researcher-friendly in recent years.

-D

On Wed, Oct 7, 2009 at 9:18 AM, Rob Stewart <ro...@googlemail.com>wrote:

> Hello Pig user group !
>
> OK, here's two things about me:
> 1. I'm new to Pig and Hadoop
> 2. I'm studying for a Masters in Software Engineering in the UK.
> 3. I'm looking to do a comparitive study on probably two distributed
> systems
> over a cluster network. I have investigated Hadoop, and have deployed
> Hadoop
> across various virtual Linux systems on this PC I'm using (which was fun!),
> and my university has given me permission to use the cluster at university
> to deploy Hadoop, which I'm excited about. (They may even use it for future
> research, or better still, production processing!).
>
> Anyway... I have had a look at Pig, and have worked through the various
> tutorials, which are very well written, and have these tutorials working on
> my virtual Hadoop cluster here on this PC, and I assume the same would be
> the case on the university cluster.
>
> I am needing another system, as similar as possible to the function and use
> of Pig. My supervisor has pointed me in the direction of CouchDB (written
> in
> Erlang) as another tool which potentially could be used for comparison for
> my studies. Reading a little about it, there seems no formal process for
> distributing a CouchDB job however, across a cluster of nodes for parallel
> processing. I have contacted the CouchDB mailing list for clarification
> about this however.
>
> So, I write to you guys for four reasons:
> 1. To touch base, and say - "hey, I'm hoping to use Pig for a comparitive
> study for my Masters dissertation - Thanks !!"
> 2. To ask, if there is any other solution out there that can be closely
> compared to the functionality and use of Pig.
> 3. If CouchDB has been benchmarked against Pig before now, where I can find
> it, or who can help me with this.
> 4. Am I off the mark with these questions? If so, please speak now!
>
>
> thanks,
>
> Rob Stewart
>

RE: Using Pig for a comparative Study

Posted by Santhosh Srinivasan <sm...@yahoo-inc.com>.

Rob,

>> 2. To ask, if there is any other solution out there that can be
closely compared to the functionality and use of Pig.

Hive (http://hadoop.apache.org/hive/) provides a SQL interface on top of
Hadoop and JAQL (http://www.jaql.org/), another query language which
also works on Hadoop are two good candidates.

>> 4. Am I off the mark with these questions? If so, please speak now!

Not at all. It will be great if you could share the parameters that form
the basis for the comparison.

Thanks,
Santhosh

-----Original Message-----
From: Rob Stewart [mailto:robstewart57@googlemail.com] 
Sent: Wednesday, October 07, 2009 6:18 AM
To: pig-user@hadoop.apache.org
Subject: Using Pig for a comparative Study

Hello Pig user group !

OK, here's two things about me:
1. I'm new to Pig and Hadoop
2. I'm studying for a Masters in Software Engineering in the UK.
3. I'm looking to do a comparitive study on probably two distributed
systems over a cluster network. I have investigated Hadoop, and have
deployed Hadoop across various virtual Linux systems on this PC I'm
using (which was fun!), and my university has given me permission to use
the cluster at university to deploy Hadoop, which I'm excited about.
(They may even use it for future research, or better still, production
processing!).

Anyway... I have had a look at Pig, and have worked through the various
tutorials, which are very well written, and have these tutorials working
on my virtual Hadoop cluster here on this PC, and I assume the same
would be the case on the university cluster.

I am needing another system, as similar as possible to the function and
use of Pig. My supervisor has pointed me in the direction of CouchDB
(written in
Erlang) as another tool which potentially could be used for comparison
for my studies. Reading a little about it, there seems no formal process
for distributing a CouchDB job however, across a cluster of nodes for
parallel processing. I have contacted the CouchDB mailing list for
clarification about this however.

So, I write to you guys for four reasons:
1. To touch base, and say - "hey, I'm hoping to use Pig for a
comparitive study for my Masters dissertation - Thanks !!"
2. To ask, if there is any other solution out there that can be closely
compared to the functionality and use of Pig.
3. If CouchDB has been benchmarked against Pig before now, where I can
find it, or who can help me with this.
4. Am I off the mark with these questions? If so, please speak now!


thanks,

Rob Stewart