You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Rob Stewart <ro...@googlemail.com> on 2009/10/30 13:05:24 UTC

Follow Up Questions: PigMix, DataGenerator etc...

Hi there.

As some of you may have read on this mailing list previously, I'm studying
various interfaces with Hadoop, one of those being Pig.

I have three further questions. I am now beginning to think about the design
of my testing (query design, indicators of performance etc...).

1:
I have had a good look at the PigMix benchmark Wiki page, and find it
interesting that a few Pig queries now execute more quickly than the
associative Java MapReduce application implementation (
http://wiki.apache.org/pig/PigMix ). The following data processing functions
in Pig outperform the Java equivalent:
distinct aggregation
anti join
group applicationorder by 1 field
order by multiple fields
distinct + join
multi-store


A few questions: Am I able to obtain the actual queries used for the PigMix
benchmarking? And how about obtaining their Java Hadoop equivalent? And,
how, technically, is this high performance achieved? I have read the paper
"A benchmark for Hive, Pig and Hadoop" (Yuntao Jia, Zheng Shao), and they
were using a snapshot of Pig trunk from June 2009, showing Pig executing an
aggregation query and a join query more slowly than Java Hadoop or Hive, but
the aspect that interests me is the number of Map/Reduce tasks created by
Pig and Hive. In what way does this number have an effect on the execution
time performance. I have a feeling that Pig produces more Map/Reduce tasks
than other interfaces, which may be benficial where there is extremely
skewed data. Am I wrong in thinking this, or is there another benifit to
more Map/Reduce tasks. And how to Pig go about splitting a job into these
number of tasks?

2.
This sort of leads onto my next question. I want to, in some way, be able to
test the performance of Pig against others when dealing with a dataset of
extremely skewed data. I have a vague understanding of what this may mean,
but I do need clarity on the definition of skewed data, and the effect this
has on the performance on the Map/Reduce model.

3.
In another question relating to the number of Map/Reduce tasks generated by
Pig. I am interested to see if, despite that fact that Pig, Hive, JAQL
etc... all use the Hadoop fault tolerance technologies, whether the number
of Map/Reduce tasks has an effect on Hadoop's ability to recover, from say,
a DataNode that fails. I.e. If there are 100 Map jobs, spread across 10
DataNodes, and one DataNode fails, then approximately 10 Map jobs will be
redistributed over the remaining 9 DataNodes. If, however, there were 500
Map jobs over the 10 DataNodes, one of them fails, then 50 Map jobs will be
reallocated to the remaining 9 DataNodes. Am I to expect a difference in
overal performance in both of these scenario's?

4.
Which leads onto... the Pig DataGenerator. Is this what I'm looking for to
generate my data? I am probably looking to generate data that would
typically take 30 minutes to execute, and I have a cluster of 10 nodes
available to me. I would happily use some existing tool to create my test
data for analysis, and I just need to know whether the DataGenerator is the
tool for me? ( http://wiki.apache.org/pig/DataGeneratorHadoop )

5.
Finally... What is the best way to analyse the performance of each Pig job
sent to the Hadoop cluster? Obviously, I can simply use the Unix time
command, but are there any real-time performance analysis tools built into
either Pig or Hadoop to monitor performance. This would typically include
the number of Map tasks and the number of reduce tasks over time, and the
behaviour of the Hadoop cluster in the event of a DataNode failure, i.e.
following a failure, the amount of network bandwidth used, the CPU/Memory of
the NameNode during a failure of a DataNode etc....


I know it's a lot, but I really apprecaite any assistance, and fully intend
to return my complete paper back to this Pig and Hadoop community on it's
completion!!


Many thanks,


Rob Stewart

Re: Follow Up Questions: PigMix, DataGenerator etc...

Posted by Alan Gates <ga...@yahoo-inc.com>.
On Oct 31, 2009, at 11:22 AM, Rob Stewart wrote:

<snip>
>> Map and reduce parallelism are controlled differently in Hadoop.  Map
>> parallelism is controlled by the InputSplit.  IS determines how  
>> many maps to
>> start and which file blocks to assign to which maps.  In the case  
>> of PigMix,
>> both the MR Java code and the Pig code use some subclass of  
>> FileInputFormat,
>> so the map parallelism is the same in both tests.  I do not know  
>> for sure,
>> but I believe Hive also uses FileInputFormat.
>>
>> Reduce parallelism is set explicitly as part of the job  
>> configuration.  In
>> MapReduce this is done through the Java API.  In Pig it is done  
>> through
>> though the PARALLEL command.  In PigMix, we set parallelism for  
>> both the
>> same (40 I believe for this data size).
>>
>
> I have a query about this procedure. It will warrant a simple answer I
> assume, but I just need clarity on this. I am wondering how, for  
> example,
> both the MR applications and the Pig programs will react if there  
> are no
> specifications for the number of Map or Reduce jobs. If, let's say,  
> I were a
> programmer writing some Pig scripts where I do not know the skew of  
> the
> data, my first execution of the Pig script would be done without any
> specification of #Mappers or #Reducers. Is it not a more natural  
> examination
> of Pig vs MR apps where both Pig and the MR app have to decide these  
> details
> for themselves? So my question is: Why is it a fundamental  
> requirement that
> the Pig script and the associated MR app be given figures for initial
> Map/Reduce tasks?

You as a Pig Latin script writer never control parallelism of the  
map.  That is controlled by Hadoop's InputFormat class.  The vast  
majority of MR programs written in Java use FileInputFormat, so most  
Java MR programmers don't directly control it either.  FileInputFormat  
by default assigns one HDFS block to one map.

For reducers, if the script does not specify the level of parallelism  
for a given operation then Pig tells Hadoop to use the cluster  
default.  Out of the box the cluster default is 1.

>
>
>>
>> In general the places where Pig beats MR is due to better  
>> algorithms.  The
>> MR code was written assuming a basic level of MR coding and database
>> knowledge.  So for example, the order by queries, the MR code  
>> achieves a
>> total order by having a single reducer at the end.  Pig has a much  
>> more
>> sophisticated system where it samples the data, determines a  
>> distribution,
>> and then uses multiple reducers while maintaining a total order.   
>> So for
>> large data sets Pig will beat MR for these particular tests.
>>
>
> Sounds very elegant, a really neat solution to skewed data. Is there  
> some
> documentation of this process, as I'd like to include that  
> methodology in my
> report. And then display data results like: "skewed data / exeution  
> time",
> where trend lines for Pig, Hive and MR apps are shown. It would be  
> nice to
> show that, as skew of data increases, Pig overtakes the associative  
> MR app
> for execution performance.

Functional specs for Pig are linked off of http://wiki.apache.org/ 
pig/  There is a spec there for how we handle skew in joins.  I don't  
see one on handling skew in order by.

Alan.

Re: Follow Up Questions: PigMix, DataGenerator etc...

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Rob, check out the test cases for how to use Pig embedded in Java;
here's the relevant API:
http://hadoop.apache.org/pig/javadoc/docs/api/org/apache/pig/PigServer.html

Essentially -- you can initialize a new PigServer, register a few
queries, and store results or open an iterator on a relation.
Naturally, you could do this in a java loop.

-D

On Sun, Nov 8, 2009 at 10:08 AM, Rob Stewart
<ro...@googlemail.com> wrote:
> Hi, thanks for the definition of Turing completeness in Pig and Hive. I
> understand that SQL is not Turing complete, and so, by definition, neither
> is Hive. And you're right, I don't see any looping functionality within Pig
> "out of the box".
>
> Can I give you the simplest of examples. See this sample of data:
> Parent       Child
> --------       --------
> John          Harry
> Steven      Paul
> John          Jamie
> John          Rob
> James       Grant
> Rob           Gordon
> Rob           Tom
>
> Imagine that this dataset contains many millions of rows, and the above is
> mixed randomly within them. I'd like to design, say, a program that, given a
> name of a person, I return every person beneath them in the family tree.
> See http://www.linuxsoftwareblog.com/Hadoop/family.jpeg
>
> For Java Hadoop, I could create a program that iterated over a method, say
> getAllChildren(). For all results, call this function again, and stop
> through each branch when no children are found. Each time the method is
> called, I would save the children in an array, and return this array when
> the recursion is exhausted. e.g.
>> Hadoop -jar GetChildren.jar john
>
> returns: [Harry, Jamie, Rob, Gordon, Tom]
>
> So, Alan, you're correct, MapReduce, on its own does not provide me with
> loops, I have to wrap a loop around this MapReduce method "getAllChildren()"
> to get all children of john. When you say that I would have to wrap Java
> around Pig to simulate turing completeness, what exactly do you mean? Are
> there Pig Java classes that I can make use of to implement a Pig version of
> "getAllChildren()"? Or do you mean to create a UDF ?
>
> Is there any comment to be made on the similarity between SQL and MapReduce
> as they share the common feature (lack thereof) of recursing down the above
> family tree in one pass to give me all responses (where the depth of the
> tree is not known)?
>
> Rob Stewart
>
>
>
> 2009/11/2 Alan Gates <ga...@yahoo-inc.com>
>
>>
>> On Oct 31, 2009, at 1:04 PM, Rob Stewart wrote:
>>
>>  2009/10/31 Santhosh Srinivasan <sm...@yahoo-inc.com>
>>>
>>>  Misc question: Do you anticipate that Pig will be compatible with
>>>>>
>>>> Hadoop 0.20 ?
>>>>
>>>> The Hadoop 0.20 compatible version, Pig 0.5.0,  will be released
>>>> shortly. The release got the required votes.
>>>>
>>>>
>>> thanks, I will watch out for that, and anticipate using 0.5 for my study.
>>>
>>>
>>>>  Finally, I am correct to assume that Pig is not Turing Complete? I am
>>>>>
>>>> not clear on this. SQL is not Turing Complete, whereas Java is. So does
>>>> that make, Hive or Pig, for example Turing complete, or not?
>>>>
>>>> Short answer: Hive and Pig are not Turing complete. Turing completeness
>>>> is for a particular language and not for the language implementing the
>>>> language under question. Since Hive is SQL (like), its not Turing
>>>> complete. Till Pig supports loops and conditional statements, Pig will
>>>> not be Turing complete.
>>>>
>>>>
>>> OK, as I thought. Thanks. I assume therefore that, as Java is turing
>>> complete, I would be able to illustrate this difference with a certain
>>> query
>>> design that requires turing completeness to execute?
>>>
>>
>> The common case where we see users wanting Turing Completeness in Pig is
>> for iterative algorithms that need their answer to converge.  You can't do
>> this in a single pass of MR either.  You can write Java code around either
>> Pig or MR to iterate until your data reaches convergence.
>>
>>
>>>
>>>
>>>>
>> Alan.
>>
>

Re: Follow Up Questions: PigMix, DataGenerator etc...

Posted by Alan Gates <ga...@yahoo-inc.com>.
On Nov 8, 2009, at 7:08 AM, Rob Stewart wrote:

> <snip>
>
> So, Alan, you're correct, MapReduce, on its own does not provide me  
> with
> loops, I have to wrap a loop around this MapReduce method  
> "getAllChildren()"
> to get all children of john. When you say that I would have to wrap  
> Java
> around Pig to simulate turing completeness, what exactly do you  
> mean? Are
> there Pig Java classes that I can make use of to implement a Pig  
> version of
> "getAllChildren()"? Or do you mean to create a UDF ?

As Dmitry said, I wasn't thinking of a UDF as much as writing Java  
code that called PigServer.registerQuery and openIterator multiple  
times until you have found no new children.

>
> Is there any comment to be made on the similarity between SQL and  
> MapReduce
> as they share the common feature (lack thereof) of recursing down  
> the above
> family tree in one pass to give me all responses (where the depth of  
> the
> tree is not known)?

Just that none of these three approaches (MapReduce, Pig Latin, and  
SQL) provide the necessary primitives to determine convergence.  In  
all three cases you are forced to write the test and loop  
functionality outside of the main data processing.  MR will never  
provide the primitives, because it is by definition a predefined  
operation controlled from the outside.  SQL can do it in constructs  
like Oracle's PL/SQL.  In a similar way Pig Latin could be extended to  
add loops and branches, but it is unclear at this point if that is  
what it should do.  Adding these constructs to Pig Latin would take it  
from a data flow language to a data processing language.  At least in  
the short term it is much simpler to depend on outside languages that  
already provide this functionality.

Alan.

>
> Rob Stewart
>
>
>


Re: Follow Up Questions: PigMix, DataGenerator etc...

Posted by Rob Stewart <ro...@googlemail.com>.
Hi, thanks for the definition of Turing completeness in Pig and Hive. I
understand that SQL is not Turing complete, and so, by definition, neither
is Hive. And you're right, I don't see any looping functionality within Pig
"out of the box".

Can I give you the simplest of examples. See this sample of data:
Parent       Child
--------       --------
John          Harry
Steven      Paul
John          Jamie
John          Rob
James       Grant
Rob           Gordon
Rob           Tom

Imagine that this dataset contains many millions of rows, and the above is
mixed randomly within them. I'd like to design, say, a program that, given a
name of a person, I return every person beneath them in the family tree.
See http://www.linuxsoftwareblog.com/Hadoop/family.jpeg

For Java Hadoop, I could create a program that iterated over a method, say
getAllChildren(). For all results, call this function again, and stop
through each branch when no children are found. Each time the method is
called, I would save the children in an array, and return this array when
the recursion is exhausted. e.g.
> Hadoop -jar GetChildren.jar john

returns: [Harry, Jamie, Rob, Gordon, Tom]

So, Alan, you're correct, MapReduce, on its own does not provide me with
loops, I have to wrap a loop around this MapReduce method "getAllChildren()"
to get all children of john. When you say that I would have to wrap Java
around Pig to simulate turing completeness, what exactly do you mean? Are
there Pig Java classes that I can make use of to implement a Pig version of
"getAllChildren()"? Or do you mean to create a UDF ?

Is there any comment to be made on the similarity between SQL and MapReduce
as they share the common feature (lack thereof) of recursing down the above
family tree in one pass to give me all responses (where the depth of the
tree is not known)?

Rob Stewart



2009/11/2 Alan Gates <ga...@yahoo-inc.com>

>
> On Oct 31, 2009, at 1:04 PM, Rob Stewart wrote:
>
>  2009/10/31 Santhosh Srinivasan <sm...@yahoo-inc.com>
>>
>>  Misc question: Do you anticipate that Pig will be compatible with
>>>>
>>> Hadoop 0.20 ?
>>>
>>> The Hadoop 0.20 compatible version, Pig 0.5.0,  will be released
>>> shortly. The release got the required votes.
>>>
>>>
>> thanks, I will watch out for that, and anticipate using 0.5 for my study.
>>
>>
>>>  Finally, I am correct to assume that Pig is not Turing Complete? I am
>>>>
>>> not clear on this. SQL is not Turing Complete, whereas Java is. So does
>>> that make, Hive or Pig, for example Turing complete, or not?
>>>
>>> Short answer: Hive and Pig are not Turing complete. Turing completeness
>>> is for a particular language and not for the language implementing the
>>> language under question. Since Hive is SQL (like), its not Turing
>>> complete. Till Pig supports loops and conditional statements, Pig will
>>> not be Turing complete.
>>>
>>>
>> OK, as I thought. Thanks. I assume therefore that, as Java is turing
>> complete, I would be able to illustrate this difference with a certain
>> query
>> design that requires turing completeness to execute?
>>
>
> The common case where we see users wanting Turing Completeness in Pig is
> for iterative algorithms that need their answer to converge.  You can't do
> this in a single pass of MR either.  You can write Java code around either
> Pig or MR to iterate until your data reaches convergence.
>
>
>>
>>
>>>
> Alan.
>

Re: Follow Up Questions: PigMix, DataGenerator etc...

Posted by Alan Gates <ga...@yahoo-inc.com>.
On Oct 31, 2009, at 1:04 PM, Rob Stewart wrote:

> 2009/10/31 Santhosh Srinivasan <sm...@yahoo-inc.com>
>
>>> Misc question: Do you anticipate that Pig will be compatible with
>> Hadoop 0.20 ?
>>
>> The Hadoop 0.20 compatible version, Pig 0.5.0,  will be released
>> shortly. The release got the required votes.
>>
>
> thanks, I will watch out for that, and anticipate using 0.5 for my  
> study.
>
>>
>>> Finally, I am correct to assume that Pig is not Turing Complete? I  
>>> am
>> not clear on this. SQL is not Turing Complete, whereas Java is. So  
>> does
>> that make, Hive or Pig, for example Turing complete, or not?
>>
>> Short answer: Hive and Pig are not Turing complete. Turing  
>> completeness
>> is for a particular language and not for the language implementing  
>> the
>> language under question. Since Hive is SQL (like), its not Turing
>> complete. Till Pig supports loops and conditional statements, Pig  
>> will
>> not be Turing complete.
>>
>
> OK, as I thought. Thanks. I assume therefore that, as Java is turing
> complete, I would be able to illustrate this difference with a  
> certain query
> design that requires turing completeness to execute?

The common case where we see users wanting Turing Completeness in Pig  
is for iterative algorithms that need their answer to converge.  You  
can't do this in a single pass of MR either.  You can write Java code  
around either Pig or MR to iterate until your data reaches convergence.

>
>
>>

Alan.

Re: Follow Up Questions: PigMix, DataGenerator etc...

Posted by Rob Stewart <ro...@googlemail.com>.
2009/10/31 Santhosh Srinivasan <sm...@yahoo-inc.com>

> > Misc question: Do you anticipate that Pig will be compatible with
> Hadoop 0.20 ?
>
> The Hadoop 0.20 compatible version, Pig 0.5.0,  will be released
> shortly. The release got the required votes.
>

thanks, I will watch out for that, and anticipate using 0.5 for my study.

>
> > Finally, I am correct to assume that Pig is not Turing Complete? I am
> not clear on this. SQL is not Turing Complete, whereas Java is. So does
> that make, Hive or Pig, for example Turing complete, or not?
>
> Short answer: Hive and Pig are not Turing complete. Turing completeness
> is for a particular language and not for the language implementing the
> language under question. Since Hive is SQL (like), its not Turing
> complete. Till Pig supports loops and conditional statements, Pig will
> not be Turing complete.
>

OK, as I thought. Thanks. I assume therefore that, as Java is turing
complete, I would be able to illustrate this difference with a certain query
design that requires turing completeness to execute?


>
> Santhosh
>
> -----Original Message-----
> From: Rob Stewart [mailto:robstewart57@googlemail.com]
> Sent: Saturday, October 31, 2009 11:22 AM
> To: pig-user@hadoop.apache.org
> Subject: Re: Follow Up Questions: PigMix, DataGenerator etc...
>
> Alan,
>
> thanks for getting in touch,  I appreciate your time, given that it's
> clear you're busy popping up in Pig discussion videos on Vimeo and
> YouTube just now, see my responses below.
>
> I intend to get a good feel for the data generation, and to see first of
> all: how easy it is for the various interfaces (Pig, JAQL etc..) can
> plug into the same file structures, and secondly, how easily and fairly
> I would be able to port my queries from one interface to the next.
>
> Misc question: Do you anticipate that Pig will be compatible with Hadoop
> 0.20 ?
>
> Finally, I am correct to assume that Pig is not Turing Complete? I am
> not clear on this. SQL is not Turing Complete, whereas Java is. So does
> that make, Hive or Pig, for example Turing complete, or not?
>
> Again, see my responses below, and thanks again.
>
>
> Rob Stewart
>
>
>
> 2009/10/30 Alan Gates <ga...@yahoo-inc.com>
>
> >
> > On Oct 30, 2009, at 5:05 AM, Rob Stewart wrote:
> >
> >  Hi there.
> >>
> >> As some of you may have read on this mailing list previously, I'm
> >> studying various interfaces with Hadoop, one of those being Pig.
> >>
> >> I have three further questions. I am now beginning to think about the
>
> >> design of my testing (query design, indicators of performance
> >> etc...).
> >>
> >> 1:
> >> I have had a good look at the PigMix benchmark Wiki page, and find it
>
> >> interesting that a few Pig queries now execute more quickly than the
> >> associative Java MapReduce application implementation (
> >> http://wiki.apache.org/pig/PigMix ). The following data processing
> >> functions in Pig outperform the Java equivalent:
> >> distinct aggregation
> >> anti join
> >> group applicationorder by 1 field
> >> order by multiple fields
> >> distinct + join
> >> multi-store
> >>
> >>
> >> A few questions: Am I able to obtain the actual queries used for the
> >> PigMix benchmarking? And how about obtaining their Java Hadoop
> >> equivalent?
> >>
> >
> > https://issues.apache.org/jira/browse/PIG-200 has the code.  The
> > original perf.patch has both the MR Java code and the Pig Latin
> > scripts.  The data generator is also in this patch.
> >
> >
> I will check that code out. thanks.
>
>
> >
> >  And,
> >> how, technically, is this high performance achieved? I have read the
> >> paper "A benchmark for Hive, Pig and Hadoop" (Yuntao Jia, Zheng
> >> Shao), and they were using a snapshot of Pig trunk from June 2009,
> >> showing Pig executing an aggregation query and a join query more
> >> slowly than Java Hadoop or Hive, but the aspect that interests me is
> >> the number of Map/Reduce tasks created by Pig and Hive. In what way
> >> does this number have an effect on the execution time performance. I
> >> have a feeling that Pig produces more Map/Reduce tasks than other
> >> interfaces, which may be benficial where there is extremely skewed
> >> data. Am I wrong in thinking this, or is there another benifit to
> >> more Map/Reduce tasks. And how to Pig go about splitting a job into
> >> these number of tasks?
> >>
> >
> > Map and reduce parallelism are controlled differently in Hadoop.  Map
> > parallelism is controlled by the InputSplit.  IS determines how many
> > maps to start and which file blocks to assign to which maps.  In the
> > case of PigMix, both the MR Java code and the Pig code use some
> > subclass of FileInputFormat, so the map parallelism is the same in
> > both tests.  I do not know for sure, but I believe Hive also uses
> FileInputFormat.
> >
> > Reduce parallelism is set explicitly as part of the job configuration.
>
> > In MapReduce this is done through the Java API.  In Pig it is done
> > through though the PARALLEL command.  In PigMix, we set parallelism
> > for both the same (40 I believe for this data size).
> >
>
> I have a query about this procedure. It will warrant a simple answer I
> assume, but I just need clarity on this. I am wondering how, for
> example, both the MR applications and the Pig programs will react if
> there are no specifications for the number of Map or Reduce jobs. If,
> let's say, I were a programmer writing some Pig scripts where I do not
> know the skew of the data, my first execution of the Pig script would be
> done without any specification of #Mappers or #Reducers. Is it not a
> more natural examination of Pig vs MR apps where both Pig and the MR app
> have to decide these details for themselves? So my question is: Why is
> it a fundamental requirement that the Pig script and the associated MR
> app be given figures for initial Map/Reduce tasks?
>
>
> >
> > In general the places where Pig beats MR is due to better algorithms.
>
> > The MR code was written assuming a basic level of MR coding and
> > database knowledge.  So for example, the order by queries, the MR code
>
> > achieves a total order by having a single reducer at the end.  Pig has
>
> > a much more sophisticated system where it samples the data, determines
>
> > a distribution, and then uses multiple reducers while maintaining a
> > total order.  So for large data sets Pig will beat MR for these
> particular tests.
> >
>
> Sounds very elegant, a really neat solution to skewed data. Is there
> some documentation of this process, as I'd like to include that
> methodology in my report. And then display data results like: "skewed
> data / exeution time", where trend lines for Pig, Hive and MR apps are
> shown. It would be nice to show that, as skew of data increases, Pig
> overtakes the associative MR app for execution performance.
>
>
> >
> > It is, by definition, always possible to write MR Java code as fast as
>
> > Pig code, since Pig is implemented over MR.  But that isn't what
> > PixMix is testing.  PigMix aims to test how fast code is for what we
> > guess is a typical programmer choosing between MR and Pig.  If you
> > want instead to test the best possible MR against the best possible
> > Pig Latin, the MR code in these tests should be rewritten.
> >
> >
> >
> >> 2.
> >> This sort of leads onto my next question. I want to, in some way, be
> >> able to test the performance of Pig against others when dealing with
> >> a dataset of extremely skewed data. I have a vague understanding of
> >> what this may mean, but I do need clarity on the definition of skewed
>
> >> data, and the effect this has on the performance on the Map/Reduce
> >> model.
> >>
> >
> > In practice we see that almost all data we deal with at Yahoo is power
>
> > law distributed.  Hence we built most of the PigMix data that way as
> > well.  We used zipf as a definition of skewed, as it turned out to be
> > a reasonably close match to our data.
> >
> > The interesting situation, for us anyway, is when the data skews
> > enough that it is no longer possible to process a single key in memory
>
> > on a single node.  Once you have to spill parts of the data to disk,
> > performance suffers badly.  For sufficiently large data sets (10G or
> > more) zipf distribution meets this criteria.
> >
> > Pig has quite a bit of processing dedicated to dealing with skew
> > issues gracefully, since that is one of the weak points of Map Reduce.
>
> > As mentioned above, order by samples the data and distributes skewed
> > keys across multiple reducers (since for a total ordering there is no
> > need to collect all keys onto one reducer).  Pig has a skewed join
> > that also splits skewed keys across multiple reducers.  For group by
> > (where it truly can't split keys) Pig uses the combiner and the up
> > coming accumulator interface (see
> https://issues.apache.org/jira/browse/PIG-979 ).
> >
> >
> >
> >> 3.
> >> In another question relating to the number of Map/Reduce tasks
> >> generated by Pig. I am interested to see if, despite that fact that
> >> Pig, Hive, JAQL etc... all use the Hadoop fault tolerance
> >> technologies, whether the number of Map/Reduce tasks has an effect on
>
> >> Hadoop's ability to recover, from say, a DataNode that fails. I.e. If
>
> >> there are 100 Map jobs, spread across 10 DataNodes, and one DataNode
> >> fails, then approximately 10 Map jobs will be redistributed over the
> >> remaining 9 DataNodes. If, however, there were 500 Map jobs over the
> >> 10 DataNodes, one of them fails, then 50 Map jobs will be reallocated
>
> >> to the remaining 9 DataNodes. Am I to expect a difference in overal
> >> performance in both of these scenario's?
> >>
> >
> > In Pig's case (and I believe in Hive's and JAQL's) this is all handled
>
> > by Map Reduce.  So I would direct questions on this to
> > mapreduce-user@hadoop.apache.org
> >
> >
> >
> >> 4.
> >> Which leads onto... the Pig DataGenerator. Is this what I'm looking
> >> for to generate my data? I am probably looking to generate data that
> >> would typically take 30 minutes to execute, and I have a cluster of
> >> 10 nodes available to me. I would happily use some existing tool to
> >> create my test data for analysis, and I just need to know whether the
>
> >> DataGenerator is the tool for me? (
> >> http://wiki.apache.org/pig/DataGeneratorHadoop )
> >>
> >
> > The tool we used to generate data for the original PigMix is attached
> > to that patch referenced above.  The downfall with the original tool
> > is that it takes about 30 hours to generate the amount of data needed
> > for PigMix (which runs for about 5 minutes on an older 10 node
> > cluster, so you'd need significantly more) because it's single
> > threaded.  The DataGenerator is an attempt to convert that original
> > tool to work in MR so it can go much faster.  When I've played with
> > it, it does create skewed data, but it does not create the very long
> > tail of data that the original tool does.  If the very long tail is
> > not important to you than it may be a good choice.  It may also be
> > possible to use the tool to create the long tail and I just configured
> it incorrectly.
> >
> >
> >
> >> 5.
> >> Finally... What is the best way to analyse the performance of each
> >> Pig job sent to the Hadoop cluster? Obviously, I can simply use the
> >> Unix time command, but are there any real-time performance analysis
> >> tools built into either Pig or Hadoop to monitor performance. This
> >> would typically include the number of Map tasks and the number of
> >> reduce tasks over time, and the behaviour of the Hadoop cluster in
> the event of a DataNode failure, i.e.
> >> following a failure, the amount of network bandwidth used, the
> >> CPU/Memory of the NameNode during a failure of a DataNode etc....
> >>
> >
> > I generally use Hadoop's GUI to monitor and find historic information
> > about my jobs.  The same information can be gleaned from the Hadoop
> > logs.  You might take a look at the Chukwa project, which provides a
> > parser for Hadoop logs and presents it all in a graphical format.
> >
> >
> OK, so by this, do you mean that you use the web interface to view:
> MapReduce tracker, task trackers, and the HDFS name node? I've had a
> look at the Chuckwa project, and I may be mistaken, but to me it looks
> like a bit of a beast to configure, and becomes more useful as you
> increase the number of nodes in the cluster. The cluster I have
> available to me is 10 nodes. I will have a good look at the Hadoop logs
> generated by each of the nodes to see if that would suffice.
>
>
> >
> >
> >>
> >> I know it's a lot, but I really apprecaite any assistance, and fully
> >> intend to return my complete paper back to this Pig and Hadoop
> >> community on it's completion!!
> >>
> >>
> >> Many thanks,
> >>
> >>
> >> Rob Stewart
> >>
> >
> > Alan.
> >
> >
>

RE: Follow Up Questions: PigMix, DataGenerator etc...

Posted by Santhosh Srinivasan <sm...@yahoo-inc.com>.
> Misc question: Do you anticipate that Pig will be compatible with
Hadoop 0.20 ?

The Hadoop 0.20 compatible version, Pig 0.5.0,  will be released
shortly. The release got the required votes.

> Finally, I am correct to assume that Pig is not Turing Complete? I am
not clear on this. SQL is not Turing Complete, whereas Java is. So does
that make, Hive or Pig, for example Turing complete, or not?

Short answer: Hive and Pig are not Turing complete. Turing completeness
is for a particular language and not for the language implementing the
language under question. Since Hive is SQL (like), its not Turing
complete. Till Pig supports loops and conditional statements, Pig will
not be Turing complete. 

Santhosh

-----Original Message-----
From: Rob Stewart [mailto:robstewart57@googlemail.com] 
Sent: Saturday, October 31, 2009 11:22 AM
To: pig-user@hadoop.apache.org
Subject: Re: Follow Up Questions: PigMix, DataGenerator etc...

Alan,

thanks for getting in touch,  I appreciate your time, given that it's
clear you're busy popping up in Pig discussion videos on Vimeo and
YouTube just now, see my responses below.

I intend to get a good feel for the data generation, and to see first of
all: how easy it is for the various interfaces (Pig, JAQL etc..) can
plug into the same file structures, and secondly, how easily and fairly
I would be able to port my queries from one interface to the next.

Misc question: Do you anticipate that Pig will be compatible with Hadoop
0.20 ?

Finally, I am correct to assume that Pig is not Turing Complete? I am
not clear on this. SQL is not Turing Complete, whereas Java is. So does
that make, Hive or Pig, for example Turing complete, or not?

Again, see my responses below, and thanks again.


Rob Stewart



2009/10/30 Alan Gates <ga...@yahoo-inc.com>

>
> On Oct 30, 2009, at 5:05 AM, Rob Stewart wrote:
>
>  Hi there.
>>
>> As some of you may have read on this mailing list previously, I'm 
>> studying various interfaces with Hadoop, one of those being Pig.
>>
>> I have three further questions. I am now beginning to think about the

>> design of my testing (query design, indicators of performance 
>> etc...).
>>
>> 1:
>> I have had a good look at the PigMix benchmark Wiki page, and find it

>> interesting that a few Pig queries now execute more quickly than the 
>> associative Java MapReduce application implementation ( 
>> http://wiki.apache.org/pig/PigMix ). The following data processing 
>> functions in Pig outperform the Java equivalent:
>> distinct aggregation
>> anti join
>> group applicationorder by 1 field
>> order by multiple fields
>> distinct + join
>> multi-store
>>
>>
>> A few questions: Am I able to obtain the actual queries used for the 
>> PigMix benchmarking? And how about obtaining their Java Hadoop 
>> equivalent?
>>
>
> https://issues.apache.org/jira/browse/PIG-200 has the code.  The 
> original perf.patch has both the MR Java code and the Pig Latin 
> scripts.  The data generator is also in this patch.
>
>
I will check that code out. thanks.


>
>  And,
>> how, technically, is this high performance achieved? I have read the 
>> paper "A benchmark for Hive, Pig and Hadoop" (Yuntao Jia, Zheng 
>> Shao), and they were using a snapshot of Pig trunk from June 2009, 
>> showing Pig executing an aggregation query and a join query more 
>> slowly than Java Hadoop or Hive, but the aspect that interests me is 
>> the number of Map/Reduce tasks created by Pig and Hive. In what way 
>> does this number have an effect on the execution time performance. I 
>> have a feeling that Pig produces more Map/Reduce tasks than other 
>> interfaces, which may be benficial where there is extremely skewed 
>> data. Am I wrong in thinking this, or is there another benifit to 
>> more Map/Reduce tasks. And how to Pig go about splitting a job into 
>> these number of tasks?
>>
>
> Map and reduce parallelism are controlled differently in Hadoop.  Map 
> parallelism is controlled by the InputSplit.  IS determines how many 
> maps to start and which file blocks to assign to which maps.  In the 
> case of PigMix, both the MR Java code and the Pig code use some 
> subclass of FileInputFormat, so the map parallelism is the same in 
> both tests.  I do not know for sure, but I believe Hive also uses
FileInputFormat.
>
> Reduce parallelism is set explicitly as part of the job configuration.

> In MapReduce this is done through the Java API.  In Pig it is done 
> through though the PARALLEL command.  In PigMix, we set parallelism 
> for both the same (40 I believe for this data size).
>

I have a query about this procedure. It will warrant a simple answer I
assume, but I just need clarity on this. I am wondering how, for
example, both the MR applications and the Pig programs will react if
there are no specifications for the number of Map or Reduce jobs. If,
let's say, I were a programmer writing some Pig scripts where I do not
know the skew of the data, my first execution of the Pig script would be
done without any specification of #Mappers or #Reducers. Is it not a
more natural examination of Pig vs MR apps where both Pig and the MR app
have to decide these details for themselves? So my question is: Why is
it a fundamental requirement that the Pig script and the associated MR
app be given figures for initial Map/Reduce tasks?


>
> In general the places where Pig beats MR is due to better algorithms.

> The MR code was written assuming a basic level of MR coding and 
> database knowledge.  So for example, the order by queries, the MR code

> achieves a total order by having a single reducer at the end.  Pig has

> a much more sophisticated system where it samples the data, determines

> a distribution, and then uses multiple reducers while maintaining a 
> total order.  So for large data sets Pig will beat MR for these
particular tests.
>

Sounds very elegant, a really neat solution to skewed data. Is there
some documentation of this process, as I'd like to include that
methodology in my report. And then display data results like: "skewed
data / exeution time", where trend lines for Pig, Hive and MR apps are
shown. It would be nice to show that, as skew of data increases, Pig
overtakes the associative MR app for execution performance.


>
> It is, by definition, always possible to write MR Java code as fast as

> Pig code, since Pig is implemented over MR.  But that isn't what 
> PixMix is testing.  PigMix aims to test how fast code is for what we 
> guess is a typical programmer choosing between MR and Pig.  If you 
> want instead to test the best possible MR against the best possible 
> Pig Latin, the MR code in these tests should be rewritten.
>
>
>
>> 2.
>> This sort of leads onto my next question. I want to, in some way, be 
>> able to test the performance of Pig against others when dealing with 
>> a dataset of extremely skewed data. I have a vague understanding of 
>> what this may mean, but I do need clarity on the definition of skewed

>> data, and the effect this has on the performance on the Map/Reduce 
>> model.
>>
>
> In practice we see that almost all data we deal with at Yahoo is power

> law distributed.  Hence we built most of the PigMix data that way as 
> well.  We used zipf as a definition of skewed, as it turned out to be 
> a reasonably close match to our data.
>
> The interesting situation, for us anyway, is when the data skews 
> enough that it is no longer possible to process a single key in memory

> on a single node.  Once you have to spill parts of the data to disk, 
> performance suffers badly.  For sufficiently large data sets (10G or 
> more) zipf distribution meets this criteria.
>
> Pig has quite a bit of processing dedicated to dealing with skew 
> issues gracefully, since that is one of the weak points of Map Reduce.

> As mentioned above, order by samples the data and distributes skewed 
> keys across multiple reducers (since for a total ordering there is no 
> need to collect all keys onto one reducer).  Pig has a skewed join 
> that also splits skewed keys across multiple reducers.  For group by 
> (where it truly can't split keys) Pig uses the combiner and the up 
> coming accumulator interface (see
https://issues.apache.org/jira/browse/PIG-979 ).
>
>
>
>> 3.
>> In another question relating to the number of Map/Reduce tasks 
>> generated by Pig. I am interested to see if, despite that fact that 
>> Pig, Hive, JAQL etc... all use the Hadoop fault tolerance 
>> technologies, whether the number of Map/Reduce tasks has an effect on

>> Hadoop's ability to recover, from say, a DataNode that fails. I.e. If

>> there are 100 Map jobs, spread across 10 DataNodes, and one DataNode 
>> fails, then approximately 10 Map jobs will be redistributed over the 
>> remaining 9 DataNodes. If, however, there were 500 Map jobs over the 
>> 10 DataNodes, one of them fails, then 50 Map jobs will be reallocated

>> to the remaining 9 DataNodes. Am I to expect a difference in overal 
>> performance in both of these scenario's?
>>
>
> In Pig's case (and I believe in Hive's and JAQL's) this is all handled

> by Map Reduce.  So I would direct questions on this to 
> mapreduce-user@hadoop.apache.org
>
>
>
>> 4.
>> Which leads onto... the Pig DataGenerator. Is this what I'm looking 
>> for to generate my data? I am probably looking to generate data that 
>> would typically take 30 minutes to execute, and I have a cluster of 
>> 10 nodes available to me. I would happily use some existing tool to 
>> create my test data for analysis, and I just need to know whether the

>> DataGenerator is the tool for me? ( 
>> http://wiki.apache.org/pig/DataGeneratorHadoop )
>>
>
> The tool we used to generate data for the original PigMix is attached 
> to that patch referenced above.  The downfall with the original tool 
> is that it takes about 30 hours to generate the amount of data needed 
> for PigMix (which runs for about 5 minutes on an older 10 node 
> cluster, so you'd need significantly more) because it's single 
> threaded.  The DataGenerator is an attempt to convert that original 
> tool to work in MR so it can go much faster.  When I've played with 
> it, it does create skewed data, but it does not create the very long 
> tail of data that the original tool does.  If the very long tail is 
> not important to you than it may be a good choice.  It may also be 
> possible to use the tool to create the long tail and I just configured
it incorrectly.
>
>
>
>> 5.
>> Finally... What is the best way to analyse the performance of each 
>> Pig job sent to the Hadoop cluster? Obviously, I can simply use the 
>> Unix time command, but are there any real-time performance analysis 
>> tools built into either Pig or Hadoop to monitor performance. This 
>> would typically include the number of Map tasks and the number of 
>> reduce tasks over time, and the behaviour of the Hadoop cluster in
the event of a DataNode failure, i.e.
>> following a failure, the amount of network bandwidth used, the 
>> CPU/Memory of the NameNode during a failure of a DataNode etc....
>>
>
> I generally use Hadoop's GUI to monitor and find historic information 
> about my jobs.  The same information can be gleaned from the Hadoop 
> logs.  You might take a look at the Chukwa project, which provides a 
> parser for Hadoop logs and presents it all in a graphical format.
>
>
OK, so by this, do you mean that you use the web interface to view:
MapReduce tracker, task trackers, and the HDFS name node? I've had a
look at the Chuckwa project, and I may be mistaken, but to me it looks
like a bit of a beast to configure, and becomes more useful as you
increase the number of nodes in the cluster. The cluster I have
available to me is 10 nodes. I will have a good look at the Hadoop logs
generated by each of the nodes to see if that would suffice.


>
>
>>
>> I know it's a lot, but I really apprecaite any assistance, and fully 
>> intend to return my complete paper back to this Pig and Hadoop 
>> community on it's completion!!
>>
>>
>> Many thanks,
>>
>>
>> Rob Stewart
>>
>
> Alan.
>
>

Re: Follow Up Questions: PigMix, DataGenerator etc...

Posted by Rob Stewart <ro...@googlemail.com>.
Alan,

thanks for getting in touch,  I appreciate your time, given that it's clear
you're busy popping up in Pig discussion videos on Vimeo and YouTube just
now, see my responses below.

I intend to get a good feel for the data generation, and to see first of
all: how easy it is for the various interfaces (Pig, JAQL etc..) can plug
into the same file structures, and secondly, how easily and fairly I would
be able to port my queries from one interface to the next.

Misc question: Do you anticipate that Pig will be compatible with Hadoop
0.20 ?

Finally, I am correct to assume that Pig is not Turing Complete? I am not
clear on this. SQL is not Turing Complete, whereas Java is. So does that
make, Hive or Pig, for example Turing complete, or not?

Again, see my responses below, and thanks again.


Rob Stewart



2009/10/30 Alan Gates <ga...@yahoo-inc.com>

>
> On Oct 30, 2009, at 5:05 AM, Rob Stewart wrote:
>
>  Hi there.
>>
>> As some of you may have read on this mailing list previously, I'm studying
>> various interfaces with Hadoop, one of those being Pig.
>>
>> I have three further questions. I am now beginning to think about the
>> design
>> of my testing (query design, indicators of performance etc...).
>>
>> 1:
>> I have had a good look at the PigMix benchmark Wiki page, and find it
>> interesting that a few Pig queries now execute more quickly than the
>> associative Java MapReduce application implementation (
>> http://wiki.apache.org/pig/PigMix ). The following data processing
>> functions
>> in Pig outperform the Java equivalent:
>> distinct aggregation
>> anti join
>> group applicationorder by 1 field
>> order by multiple fields
>> distinct + join
>> multi-store
>>
>>
>> A few questions: Am I able to obtain the actual queries used for the
>> PigMix
>> benchmarking? And how about obtaining their Java Hadoop equivalent?
>>
>
> https://issues.apache.org/jira/browse/PIG-200 has the code.  The original
> perf.patch has both the MR Java code and the Pig Latin scripts.  The data
> generator is also in this patch.
>
>
I will check that code out. thanks.


>
>  And,
>> how, technically, is this high performance achieved? I have read the paper
>> "A benchmark for Hive, Pig and Hadoop" (Yuntao Jia, Zheng Shao), and they
>> were using a snapshot of Pig trunk from June 2009, showing Pig executing
>> an
>> aggregation query and a join query more slowly than Java Hadoop or Hive,
>> but
>> the aspect that interests me is the number of Map/Reduce tasks created by
>> Pig and Hive. In what way does this number have an effect on the execution
>> time performance. I have a feeling that Pig produces more Map/Reduce tasks
>> than other interfaces, which may be benficial where there is extremely
>> skewed data. Am I wrong in thinking this, or is there another benifit to
>> more Map/Reduce tasks. And how to Pig go about splitting a job into these
>> number of tasks?
>>
>
> Map and reduce parallelism are controlled differently in Hadoop.  Map
> parallelism is controlled by the InputSplit.  IS determines how many maps to
> start and which file blocks to assign to which maps.  In the case of PigMix,
> both the MR Java code and the Pig code use some subclass of FileInputFormat,
> so the map parallelism is the same in both tests.  I do not know for sure,
> but I believe Hive also uses FileInputFormat.
>
> Reduce parallelism is set explicitly as part of the job configuration.  In
> MapReduce this is done through the Java API.  In Pig it is done through
> though the PARALLEL command.  In PigMix, we set parallelism for both the
> same (40 I believe for this data size).
>

I have a query about this procedure. It will warrant a simple answer I
assume, but I just need clarity on this. I am wondering how, for example,
both the MR applications and the Pig programs will react if there are no
specifications for the number of Map or Reduce jobs. If, let's say, I were a
programmer writing some Pig scripts where I do not know the skew of the
data, my first execution of the Pig script would be done without any
specification of #Mappers or #Reducers. Is it not a more natural examination
of Pig vs MR apps where both Pig and the MR app have to decide these details
for themselves? So my question is: Why is it a fundamental requirement that
the Pig script and the associated MR app be given figures for initial
Map/Reduce tasks?


>
> In general the places where Pig beats MR is due to better algorithms.  The
> MR code was written assuming a basic level of MR coding and database
> knowledge.  So for example, the order by queries, the MR code achieves a
> total order by having a single reducer at the end.  Pig has a much more
> sophisticated system where it samples the data, determines a distribution,
> and then uses multiple reducers while maintaining a total order.  So for
> large data sets Pig will beat MR for these particular tests.
>

Sounds very elegant, a really neat solution to skewed data. Is there some
documentation of this process, as I'd like to include that methodology in my
report. And then display data results like: "skewed data / exeution time",
where trend lines for Pig, Hive and MR apps are shown. It would be nice to
show that, as skew of data increases, Pig overtakes the associative MR app
for execution performance.


>
> It is, by definition, always possible to write MR Java code as fast as Pig
> code, since Pig is implemented over MR.  But that isn't what PixMix is
> testing.  PigMix aims to test how fast code is for what we guess is a
> typical programmer choosing between MR and Pig.  If you want instead to test
> the best possible MR against the best possible Pig Latin, the MR code in
> these tests should be rewritten.
>
>
>
>> 2.
>> This sort of leads onto my next question. I want to, in some way, be able
>> to
>> test the performance of Pig against others when dealing with a dataset of
>> extremely skewed data. I have a vague understanding of what this may mean,
>> but I do need clarity on the definition of skewed data, and the effect
>> this
>> has on the performance on the Map/Reduce model.
>>
>
> In practice we see that almost all data we deal with at Yahoo is power law
> distributed.  Hence we built most of the PigMix data that way as well.  We
> used zipf as a definition of skewed, as it turned out to be a reasonably
> close match to our data.
>
> The interesting situation, for us anyway, is when the data skews enough
> that it is no longer possible to process a single key in memory on a single
> node.  Once you have to spill parts of the data to disk, performance suffers
> badly.  For sufficiently large data sets (10G or more) zipf distribution
> meets this criteria.
>
> Pig has quite a bit of processing dedicated to dealing with skew issues
> gracefully, since that is one of the weak points of Map Reduce.  As
> mentioned above, order by samples the data and distributes skewed keys
> across multiple reducers (since for a total ordering there is no need to
> collect all keys onto one reducer).  Pig has a skewed join that also splits
> skewed keys across multiple reducers.  For group by (where it truly can't
> split keys) Pig uses the combiner and the up coming accumulator interface
> (see https://issues.apache.org/jira/browse/PIG-979 ).
>
>
>
>> 3.
>> In another question relating to the number of Map/Reduce tasks generated
>> by
>> Pig. I am interested to see if, despite that fact that Pig, Hive, JAQL
>> etc... all use the Hadoop fault tolerance technologies, whether the number
>> of Map/Reduce tasks has an effect on Hadoop's ability to recover, from
>> say,
>> a DataNode that fails. I.e. If there are 100 Map jobs, spread across 10
>> DataNodes, and one DataNode fails, then approximately 10 Map jobs will be
>> redistributed over the remaining 9 DataNodes. If, however, there were 500
>> Map jobs over the 10 DataNodes, one of them fails, then 50 Map jobs will
>> be
>> reallocated to the remaining 9 DataNodes. Am I to expect a difference in
>> overal performance in both of these scenario's?
>>
>
> In Pig's case (and I believe in Hive's and JAQL's) this is all handled by
> Map Reduce.  So I would direct questions on this to
> mapreduce-user@hadoop.apache.org
>
>
>
>> 4.
>> Which leads onto... the Pig DataGenerator. Is this what I'm looking for to
>> generate my data? I am probably looking to generate data that would
>> typically take 30 minutes to execute, and I have a cluster of 10 nodes
>> available to me. I would happily use some existing tool to create my test
>> data for analysis, and I just need to know whether the DataGenerator is
>> the
>> tool for me? ( http://wiki.apache.org/pig/DataGeneratorHadoop )
>>
>
> The tool we used to generate data for the original PigMix is attached to
> that patch referenced above.  The downfall with the original tool is that it
> takes about 30 hours to generate the amount of data needed for PigMix (which
> runs for about 5 minutes on an older 10 node cluster, so you'd need
> significantly more) because it's single threaded.  The DataGenerator is an
> attempt to convert that original tool to work in MR so it can go much
> faster.  When I've played with it, it does create skewed data, but it does
> not create the very long tail of data that the original tool does.  If the
> very long tail is not important to you than it may be a good choice.  It may
> also be possible to use the tool to create the long tail and I just
> configured it incorrectly.
>
>
>
>> 5.
>> Finally... What is the best way to analyse the performance of each Pig job
>> sent to the Hadoop cluster? Obviously, I can simply use the Unix time
>> command, but are there any real-time performance analysis tools built into
>> either Pig or Hadoop to monitor performance. This would typically include
>> the number of Map tasks and the number of reduce tasks over time, and the
>> behaviour of the Hadoop cluster in the event of a DataNode failure, i.e.
>> following a failure, the amount of network bandwidth used, the CPU/Memory
>> of
>> the NameNode during a failure of a DataNode etc....
>>
>
> I generally use Hadoop's GUI to monitor and find historic information about
> my jobs.  The same information can be gleaned from the Hadoop logs.  You
> might take a look at the Chukwa project, which provides a parser for Hadoop
> logs and presents it all in a graphical format.
>
>
OK, so by this, do you mean that you use the web interface to view:
MapReduce tracker, task trackers, and the HDFS name node? I've had a look at
the Chuckwa project, and I may be mistaken, but to me it looks like a bit of
a beast to configure, and becomes more useful as you increase the number of
nodes in the cluster. The cluster I have available to me is 10 nodes. I will
have a good look at the Hadoop logs generated by each of the nodes to see if
that would suffice.


>
>
>>
>> I know it's a lot, but I really apprecaite any assistance, and fully
>> intend
>> to return my complete paper back to this Pig and Hadoop community on it's
>> completion!!
>>
>>
>> Many thanks,
>>
>>
>> Rob Stewart
>>
>
> Alan.
>
>

Re: Follow Up Questions: PigMix, DataGenerator etc...

Posted by Alan Gates <ga...@yahoo-inc.com>.
On Oct 30, 2009, at 5:05 AM, Rob Stewart wrote:

> Hi there.
>
> As some of you may have read on this mailing list previously, I'm  
> studying
> various interfaces with Hadoop, one of those being Pig.
>
> I have three further questions. I am now beginning to think about  
> the design
> of my testing (query design, indicators of performance etc...).
>
> 1:
> I have had a good look at the PigMix benchmark Wiki page, and find it
> interesting that a few Pig queries now execute more quickly than the
> associative Java MapReduce application implementation (
> http://wiki.apache.org/pig/PigMix ). The following data processing  
> functions
> in Pig outperform the Java equivalent:
> distinct aggregation
> anti join
> group applicationorder by 1 field
> order by multiple fields
> distinct + join
> multi-store
>
>
> A few questions: Am I able to obtain the actual queries used for the  
> PigMix
> benchmarking? And how about obtaining their Java Hadoop equivalent?

https://issues.apache.org/jira/browse/PIG-200 has the code.  The  
original perf.patch has both the MR Java code and the Pig Latin  
scripts.  The data generator is also in this patch.

> And,
> how, technically, is this high performance achieved? I have read the  
> paper
> "A benchmark for Hive, Pig and Hadoop" (Yuntao Jia, Zheng Shao), and  
> they
> were using a snapshot of Pig trunk from June 2009, showing Pig  
> executing an
> aggregation query and a join query more slowly than Java Hadoop or  
> Hive, but
> the aspect that interests me is the number of Map/Reduce tasks  
> created by
> Pig and Hive. In what way does this number have an effect on the  
> execution
> time performance. I have a feeling that Pig produces more Map/Reduce  
> tasks
> than other interfaces, which may be benficial where there is extremely
> skewed data. Am I wrong in thinking this, or is there another  
> benifit to
> more Map/Reduce tasks. And how to Pig go about splitting a job into  
> these
> number of tasks?

Map and reduce parallelism are controlled differently in Hadoop.  Map  
parallelism is controlled by the InputSplit.  IS determines how many  
maps to start and which file blocks to assign to which maps.  In the  
case of PigMix, both the MR Java code and the Pig code use some  
subclass of FileInputFormat, so the map parallelism is the same in  
both tests.  I do not know for sure, but I believe Hive also uses  
FileInputFormat.

Reduce parallelism is set explicitly as part of the job  
configuration.  In MapReduce this is done through the Java API.  In  
Pig it is done through though the PARALLEL command.  In PigMix, we set  
parallelism for both the same (40 I believe for this data size).

In general the places where Pig beats MR is due to better algorithms.   
The MR code was written assuming a basic level of MR coding and  
database knowledge.  So for example, the order by queries, the MR code  
achieves a total order by having a single reducer at the end.  Pig has  
a much more sophisticated system where it samples the data, determines  
a distribution, and then uses multiple reducers while maintaining a  
total order.  So for large data sets Pig will beat MR for these  
particular tests.

It is, by definition, always possible to write MR Java code as fast as  
Pig code, since Pig is implemented over MR.  But that isn't what  
PixMix is testing.  PigMix aims to test how fast code is for what we  
guess is a typical programmer choosing between MR and Pig.  If you  
want instead to test the best possible MR against the best possible  
Pig Latin, the MR code in these tests should be rewritten.

>
> 2.
> This sort of leads onto my next question. I want to, in some way, be  
> able to
> test the performance of Pig against others when dealing with a  
> dataset of
> extremely skewed data. I have a vague understanding of what this may  
> mean,
> but I do need clarity on the definition of skewed data, and the  
> effect this
> has on the performance on the Map/Reduce model.

In practice we see that almost all data we deal with at Yahoo is power  
law distributed.  Hence we built most of the PigMix data that way as  
well.  We used zipf as a definition of skewed, as it turned out to be  
a reasonably close match to our data.

The interesting situation, for us anyway, is when the data skews  
enough that it is no longer possible to process a single key in memory  
on a single node.  Once you have to spill parts of the data to disk,  
performance suffers badly.  For sufficiently large data sets (10G or  
more) zipf distribution meets this criteria.

Pig has quite a bit of processing dedicated to dealing with skew  
issues gracefully, since that is one of the weak points of Map  
Reduce.  As mentioned above, order by samples the data and distributes  
skewed keys across multiple reducers (since for a total ordering there  
is no need to collect all keys onto one reducer).  Pig has a skewed  
join that also splits skewed keys across multiple reducers.  For group  
by (where it truly can't split keys) Pig uses the combiner and the up  
coming accumulator interface (see https://issues.apache.org/jira/browse/PIG-979 
  ).

>
> 3.
> In another question relating to the number of Map/Reduce tasks  
> generated by
> Pig. I am interested to see if, despite that fact that Pig, Hive, JAQL
> etc... all use the Hadoop fault tolerance technologies, whether the  
> number
> of Map/Reduce tasks has an effect on Hadoop's ability to recover,  
> from say,
> a DataNode that fails. I.e. If there are 100 Map jobs, spread across  
> 10
> DataNodes, and one DataNode fails, then approximately 10 Map jobs  
> will be
> redistributed over the remaining 9 DataNodes. If, however, there  
> were 500
> Map jobs over the 10 DataNodes, one of them fails, then 50 Map jobs  
> will be
> reallocated to the remaining 9 DataNodes. Am I to expect a  
> difference in
> overal performance in both of these scenario's?

In Pig's case (and I believe in Hive's and JAQL's) this is all handled  
by Map Reduce.  So I would direct questions on this to mapreduce-user@hadoop.apache.org

>
> 4.
> Which leads onto... the Pig DataGenerator. Is this what I'm looking  
> for to
> generate my data? I am probably looking to generate data that would
> typically take 30 minutes to execute, and I have a cluster of 10 nodes
> available to me. I would happily use some existing tool to create my  
> test
> data for analysis, and I just need to know whether the DataGenerator  
> is the
> tool for me? ( http://wiki.apache.org/pig/DataGeneratorHadoop )

The tool we used to generate data for the original PigMix is attached  
to that patch referenced above.  The downfall with the original tool  
is that it takes about 30 hours to generate the amount of data needed  
for PigMix (which runs for about 5 minutes on an older 10 node  
cluster, so you'd need significantly more) because it's single  
threaded.  The DataGenerator is an attempt to convert that original  
tool to work in MR so it can go much faster.  When I've played with  
it, it does create skewed data, but it does not create the very long  
tail of data that the original tool does.  If the very long tail is  
not important to you than it may be a good choice.  It may also be  
possible to use the tool to create the long tail and I just configured  
it incorrectly.

>
> 5.
> Finally... What is the best way to analyse the performance of each  
> Pig job
> sent to the Hadoop cluster? Obviously, I can simply use the Unix time
> command, but are there any real-time performance analysis tools  
> built into
> either Pig or Hadoop to monitor performance. This would typically  
> include
> the number of Map tasks and the number of reduce tasks over time,  
> and the
> behaviour of the Hadoop cluster in the event of a DataNode failure,  
> i.e.
> following a failure, the amount of network bandwidth used, the CPU/ 
> Memory of
> the NameNode during a failure of a DataNode etc....

I generally use Hadoop's GUI to monitor and find historic information  
about my jobs.  The same information can be gleaned from the Hadoop  
logs.  You might take a look at the Chukwa project, which provides a  
parser for Hadoop logs and presents it all in a graphical format.

>
>
> I know it's a lot, but I really apprecaite any assistance, and fully  
> intend
> to return my complete paper back to this Pig and Hadoop community on  
> it's
> completion!!
>
>
> Many thanks,
>
>
> Rob Stewart

Alan.