You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Travis Brady <tr...@mochimedia.com> on 2008/03/26 22:03:04 UTC

Pig performance

I really like writing pig code, but I'm experiencing pretty terrible
performance using Pig for a simple data rollup taking about 90 minutes to
complete.  The equivalent expressed using shell scripts and Haskell and
executed with hadoop streaming runs in roughly 5 minutes.
My dataset is stored on hdfs as a handful of tab delimited text files.  In
sum there are 19 million rows of data.

This is running on a 3-node cluster where each machine has 8GB of ram.  I
have all three machines configured per the instructions on the Hadoop wiki
on setting up Hadoop on Ubuntu.

Here is the pig code:
<code>
Raw = LOAD 'stats_dump_200707' USING PigStorage('\t');

HourGroups = GROUP Raw by $0;

RollUp = FOREACH HourGroups {
    GENERATE FLATTEN(group), COUNT(Raw);
}

DUMP RollUp;
</code>

Do I need to add the PARALLEL keyword in there somewhere?  Change something
in hadoop-site.xml?

The Hadoop streaming stuff uses "cut -c 1-13" as the mapper and a bit of
Haskell compiled with ghc as the reducer:
I can send the Haskell code along if it would help, but for now I assume I
must be doing something wrong for it to perform so poorly.

thank you

-- 
Travis Brady
www.mochiads.com

Re: Pig performance

Posted by Arun C Murthy <ar...@yahoo-inc.com>.

On Mar 26, 2008, at 7:03 PM, pi song wrote:

> Olga,
>
> Is there any task profiling facility in Hadoop that we can use?
>

With hadoop-0.17 there is a feature where-in the user can ask for a  
small subset of map and reduce tasks to be profiled by the built-in  
java profiler. That's hadoop-0.17 though...

Other than that I've attached profilers manually to tasks and looked  
at them, tedious but doable.

Arun

> Pi
>
> On 3/27/08, pi song <pi...@gmail.com> wrote:
>>
>> This is obviously a memory management problem that I'm investigating.
>> Travis, when did you download Pig? After the PIG-18 commit, I find  
>> this
>> problem occurs less often. Can you get the latest version and try  
>> again?
>>
>> Also, a few days ago, we have identified an issue in the memory  
>> manager.
>> Though Ben hasn't come up with a patch yet. If the above still  
>> doesn't work,
>> could you please try this out?
>>
>> 1.  Look at SpillableMemoryManager.java
>> 2.  change the code that looks like this
>> "
>> *if* (o1Size == o2Size) {
>> *return* 0;
>> }
>> *if* (o1Size < o2Size) {
>> *return* -1;
>> }
>> *return* 1;
>> }
>> "
>>
>> to
>>
>> "
>> *if* (o1Size == o2Size) {
>> *return* 0;
>> }
>> *if* (o1Size < o2Size) {
>> *return* 1;
>> }
>> *return* -1;
>> }
>> "
>>
>> (Basically just swap 1 with -1)
>> Hope that will help
>>
>> Pi
>>
>>
>> On 3/27/08, Travis Brady <tr...@mochimedia.com> wrote:
>>>
>>> Hi Olga,
>>>
>>> Thanks for your help and thank you for open sourcing Pig.
>>> I converted the FOREACH statement but it yields a very lengthy  
>>> error the
>>> relevant portion of which I've pasted below.
>>>
>>> The good news is the modified FOREACH was going to invoke the  
>>> combiner.
>>>
>>> 2008-03-26 16:16:00,322 [main] ERROR
>>>
>>> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduc 
>>> eLauncher
>>> -
>>> Error message from task (map) tip_200803261041_0008_m_000000
>>> 2008-03-26 16:16:00,322 [main] ERROR
>>>
>>> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduc 
>>> eLauncher
>>> -
>>> Error message from task (map) tip_200803261041_0008_m_000002
>>> java.lang.OutOfMemoryError: Java heap space
>>>    at java.util.Arrays.copyOf(Arrays.java:2786)
>>>    at java.io.ByteArrayOutputStream.write 
>>> (ByteArrayOutputStream.java:71)
>>>    at java.io.DataOutputStream.write(DataOutputStream.java:71)
>>>    at org.apache.pig.data.Tuple.encodeInt(Tuple.java:362)
>>>    at org.apache.pig.data.DataAtom.write(DataAtom.java:137)
>>>    at org.apache.pig.data.Tuple.write(Tuple.java:301)
>>>    at org.apache.pig.data.IndexedTuple.write(IndexedTuple.java:52)
>>>    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
>>> MapTask.java
>>> :392)
>>>    at
>>>
>>> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapRe 
>>> duce$MapDataOutputCollector.add
>>> (PigMapReduce.java:304)
>>>    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
>>> UnflattenCollector.java:52)
>>>    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(
>>> GenerateSpec.java:230)
>>>    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
>>> UnflattenCollector.java:52)
>>>    at  
>>> org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(
>>> DataCollector.java:93)
>>>    at org.apache.pig.impl.eval.SimpleEvalSpec$1.add 
>>> (SimpleEvalSpec.java
>>> :35)
>>>    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(
>>> GenerateSpec.java:261)
>>>    at org.apache.pig.impl.eval.GenerateSpec$1.add 
>>> (GenerateSpec.java:86)
>>>    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
>>> UnflattenCollector.java:52)
>>>    at
>>>
>>> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapRe 
>>> duce.run
>>> (PigMapReduce.java:113)
>>>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
>>>    at org.apache.hadoop.mapred.TaskTracker$Child.main 
>>> (TaskTracker.java
>>> :2084)
>>>
>>>
>>> Travis
>>>
>>>
>>> On Wed, Mar 26, 2008 at 3:51 PM, Olga Natkovich <olgan@yahoo- 
>>> inc.com>
>>> wrote:
>>>
>>>> Hi Travis,
>>>>
>>>> There are a couple of things you can do to improve performance  
>>>> of your
>>>> script.
>>>>
>>>> (1) At this point we have a pretty basic logic of when a  
>>>> combiner is
>>>> invoked. In the way your query is written now it would not be,
>>> however,
>>>> if you modify you foreach statement it will be:
>>>>
>>>> RollUp = FOREACH HourGroups FLATTEN(group), COUNT(Raw);
>>>>
>>>> You can see if the combiner is invoked by running
>>>>
>>>> Explain RollUp.
>>>>
>>>> (2) You do need to use parallel keyword on the group operator to  
>>>> make
>>>> sure it runs in parallel.
>>>>
>>>> Finally, we are working on some performance improvements as part of
>>>> pipeline redesign. You can track the progress at
>>>> https://issues.apache.org/jira/browse/pig-157.
>>>>
>>>> Olga
>>>>
>>>>> -----Original Message-----
>>>>> From: Travis Brady [mailto:travis@mochimedia.com]
>>>>> Sent: Wednesday, March 26, 2008 2:03 PM
>>>>> To: pig-user@incubator.apache.org
>>>>> Subject: Pig performance
>>>>>
>>>>> I really like writing pig code, but I'm experiencing pretty
>>>>> terrible performance using Pig for a simple data rollup
>>>>> taking about 90 minutes to complete.  The equivalent
>>>>> expressed using shell scripts and Haskell and executed with
>>>>> hadoop streaming runs in roughly 5 minutes.
>>>>> My dataset is stored on hdfs as a handful of tab delimited
>>>>> text files.  In sum there are 19 million rows of data.
>>>>>
>>>>> This is running on a 3-node cluster where each machine has
>>>>> 8GB of ram.  I have all three machines configured per the
>>>>> instructions on the Hadoop wiki on setting up Hadoop on Ubuntu.
>>>>>
>>>>> Here is the pig code:
>>>>> <code>
>>>>> Raw = LOAD 'stats_dump_200707' USING PigStorage('\t');
>>>>>
>>>>> HourGroups = GROUP Raw by $0;
>>>>>
>>>>> RollUp = FOREACH HourGroups {
>>>>>     GENERATE FLATTEN(group), COUNT(Raw); }
>>>>>
>>>>> DUMP RollUp;
>>>>> </code>
>>>>>
>>>>> Do I need to add the PARALLEL keyword in there somewhere?
>>>>> Change something in hadoop-site.xml?
>>>>>
>>>>> The Hadoop streaming stuff uses "cut -c 1-13" as the mapper
>>>>> and a bit of Haskell compiled with ghc as the reducer:
>>>>> I can send the Haskell code along if it would help, but for
>>>>> now I assume I must be doing something wrong for it to
>>>>> perform so poorly.
>>>>>
>>>>> thank you
>>>>>
>>>>> --
>>>>> Travis Brady
>>>>> www.mochiads.com
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Travis Brady
>>> www.mochiads.com
>>>
>>
>>

Re: Pig performance

Posted by pi song <pi...@gmail.com>.

Olga,

Is there any task profiling facility in Hadoop that we can use?

Pi

On 3/27/08, pi song <pi...@gmail.com> wrote:
>
> This is obviously a memory management problem that I'm investigating.
> Travis, when did you download Pig? After the PIG-18 commit, I find this
> problem occurs less often. Can you get the latest version and try again?
>
> Also, a few days ago, we have identified an issue in the memory manager.
> Though Ben hasn't come up with a patch yet. If the above still doesn't work,
> could you please try this out?
>
> 1.  Look at SpillableMemoryManager.java
> 2.  change the code that looks like this
> "
> *if* (o1Size == o2Size) {
> *return* 0;
> }
> *if* (o1Size < o2Size) {
> *return* -1;
> }
> *return* 1;
> }
> "
>
> to
>
> "
> *if* (o1Size == o2Size) {
> *return* 0;
> }
> *if* (o1Size < o2Size) {
> *return* 1;
> }
> *return* -1;
> }
> "
>
> (Basically just swap 1 with -1)
> Hope that will help
>
> Pi
>
>
> On 3/27/08, Travis Brady <tr...@mochimedia.com> wrote:
> >
> > Hi Olga,
> >
> > Thanks for your help and thank you for open sourcing Pig.
> > I converted the FOREACH statement but it yields a very lengthy error the
> > relevant portion of which I've pasted below.
> >
> > The good news is the modified FOREACH was going to invoke the combiner.
> >
> > 2008-03-26 16:16:00,322 [main] ERROR
> >
> > org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> > -
> > Error message from task (map) tip_200803261041_0008_m_000000
> > 2008-03-26 16:16:00,322 [main] ERROR
> >
> > org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> > -
> > Error message from task (map) tip_200803261041_0008_m_000002
> > java.lang.OutOfMemoryError: Java heap space
> >    at java.util.Arrays.copyOf(Arrays.java:2786)
> >    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:71)
> >    at java.io.DataOutputStream.write(DataOutputStream.java:71)
> >    at org.apache.pig.data.Tuple.encodeInt(Tuple.java:362)
> >    at org.apache.pig.data.DataAtom.write(DataAtom.java:137)
> >    at org.apache.pig.data.Tuple.write(Tuple.java:301)
> >    at org.apache.pig.data.IndexedTuple.write(IndexedTuple.java:52)
> >    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
> > MapTask.java
> > :392)
> >    at
> >
> > org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce$MapDataOutputCollector.add
> > (PigMapReduce.java:304)
> >    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
> > UnflattenCollector.java:52)
> >    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(
> > GenerateSpec.java:230)
> >    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
> > UnflattenCollector.java:52)
> >    at org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(
> > DataCollector.java:93)
> >    at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java
> > :35)
> >    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(
> > GenerateSpec.java:261)
> >    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
> >    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
> > UnflattenCollector.java:52)
> >    at
> >
> > org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run
> > (PigMapReduce.java:113)
> >    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
> >    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> > :2084)
> >
> >
> > Travis
> >
> >
> > On Wed, Mar 26, 2008 at 3:51 PM, Olga Natkovich <ol...@yahoo-inc.com>
> > wrote:
> >
> > > Hi Travis,
> > >
> > > There are a couple of things you can do to improve performance of your
> > > script.
> > >
> > > (1) At this point we have a pretty basic logic of when a combiner is
> > > invoked. In the way your query is written now it would not be,
> > however,
> > > if you modify you foreach statement it will be:
> > >
> > > RollUp = FOREACH HourGroups FLATTEN(group), COUNT(Raw);
> > >
> > > You can see if the combiner is invoked by running
> > >
> > > Explain RollUp.
> > >
> > > (2) You do need to use parallel keyword on the group operator to make
> > > sure it runs in parallel.
> > >
> > > Finally, we are working on some performance improvements as part of
> > > pipeline redesign. You can track the progress at
> > > https://issues.apache.org/jira/browse/pig-157.
> > >
> > > Olga
> > >
> > > > -----Original Message-----
> > > > From: Travis Brady [mailto:travis@mochimedia.com]
> > > > Sent: Wednesday, March 26, 2008 2:03 PM
> > > > To: pig-user@incubator.apache.org
> > > > Subject: Pig performance
> > > >
> > > > I really like writing pig code, but I'm experiencing pretty
> > > > terrible performance using Pig for a simple data rollup
> > > > taking about 90 minutes to complete.  The equivalent
> > > > expressed using shell scripts and Haskell and executed with
> > > > hadoop streaming runs in roughly 5 minutes.
> > > > My dataset is stored on hdfs as a handful of tab delimited
> > > > text files.  In sum there are 19 million rows of data.
> > > >
> > > > This is running on a 3-node cluster where each machine has
> > > > 8GB of ram.  I have all three machines configured per the
> > > > instructions on the Hadoop wiki on setting up Hadoop on Ubuntu.
> > > >
> > > > Here is the pig code:
> > > > <code>
> > > > Raw = LOAD 'stats_dump_200707' USING PigStorage('\t');
> > > >
> > > > HourGroups = GROUP Raw by $0;
> > > >
> > > > RollUp = FOREACH HourGroups {
> > > >     GENERATE FLATTEN(group), COUNT(Raw); }
> > > >
> > > > DUMP RollUp;
> > > > </code>
> > > >
> > > > Do I need to add the PARALLEL keyword in there somewhere?
> > > > Change something in hadoop-site.xml?
> > > >
> > > > The Hadoop streaming stuff uses "cut -c 1-13" as the mapper
> > > > and a bit of Haskell compiled with ghc as the reducer:
> > > > I can send the Haskell code along if it would help, but for
> > > > now I assume I must be doing something wrong for it to
> > > > perform so poorly.
> > > >
> > > > thank you
> > > >
> > > > --
> > > > Travis Brady
> > > > www.mochiads.com
> > > >
> > >
> >
> >
> >
> > --
> > Travis Brady
> > www.mochiads.com
> >
>
>

Re: Pig performance

Posted by Travis Brady <tr...@mochimedia.com>.

Hi Pi,

I'm thinking this could be related to my hadoop-site.xml settings.
Specifically mapred.child.java.opts, which I'd had set at 200m (the
default), but after reading this thread:
http://www.mail-archive.com/pig-user@incubator.apache.org/msg00038.html I
bumped mine up to 3060m and I'm not getting the memory errors anymore.

What about mapred.map.tasks and mapred.reduce.tasks?  Any thoughts on what
those should be set to?

I svn up'd pig a few hours ago so my code should be all up to date.

I'm glad the memory error is gone, but is there anything else I can do to
improve performance?

thanks,
Travis


On Wed, Mar 26, 2008 at 6:58 PM, pi song <pi...@gmail.com> wrote:

> This is obviously a memory management problem that I'm investigating.
> Travis, when did you download Pig? After the PIG-18 commit, I find this
> problem occurs less often. Can you get the latest version and try again?
>
> Also, a few days ago, we have identified an issue in the memory manager.
> Though Ben hasn't come up with a patch yet. If the above still doesn't
> work,
> could you please try this out?
>
> 1.  Look at SpillableMemoryManager.java
> 2.  change the code that looks like this
> "
> *if* (o1Size == o2Size) {
> *return* 0;
> }
> *if* (o1Size < o2Size) {
> *return* -1;
> }
> *return* 1;
> }
> "
>
> to
>
> "
> *if* (o1Size == o2Size) {
> *return* 0;
> }
> *if* (o1Size < o2Size) {
> *return* 1;
> }
> *return* -1;
> }
> "
>
> (Basically just swap 1 with -1)
> Hope that will help
>
> Pi
>
>
> On 3/27/08, Travis Brady <tr...@mochimedia.com> wrote:
> >
> > Hi Olga,
> >
> > Thanks for your help and thank you for open sourcing Pig.
> > I converted the FOREACH statement but it yields a very lengthy error the
> > relevant portion of which I've pasted below.
> >
> > The good news is the modified FOREACH was going to invoke the combiner.
> >
> > 2008-03-26 16:16:00,322 [main] ERROR
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> > -
> > Error message from task (map) tip_200803261041_0008_m_000000
> > 2008-03-26 16:16:00,322 [main] ERROR
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> > -
> > Error message from task (map) tip_200803261041_0008_m_000002
> > java.lang.OutOfMemoryError: Java heap space
> >    at java.util.Arrays.copyOf(Arrays.java:2786)
> >    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:71)
> >    at java.io.DataOutputStream.write(DataOutputStream.java:71)
> >    at org.apache.pig.data.Tuple.encodeInt(Tuple.java:362)
> >    at org.apache.pig.data.DataAtom.write(DataAtom.java:137)
> >    at org.apache.pig.data.Tuple.write(Tuple.java:301)
> >    at org.apache.pig.data.IndexedTuple.write(IndexedTuple.java:52)
> >    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
> > MapTask.java
> > :392)
> >    at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce$MapDataOutputCollector.add
> > (PigMapReduce.java:304)
> >    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
> > UnflattenCollector.java:52)
> >    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(
> > GenerateSpec.java:230)
> >    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
> > UnflattenCollector.java:52)
> >    at org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(
> > DataCollector.java:93)
> >    at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java
> > :35)
> >    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(
> > GenerateSpec.java:261)
> >    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
> >    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
> > UnflattenCollector.java:52)
> >    at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run
> > (PigMapReduce.java:113)
> >    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
> >    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> > :2084)
> >
> >
> > Travis
> >
> >
> > On Wed, Mar 26, 2008 at 3:51 PM, Olga Natkovich <ol...@yahoo-inc.com>
> > wrote:
> >
> > > Hi Travis,
> > >
> > > There are a couple of things you can do to improve performance of your
> > > script.
> > >
> > > (1) At this point we have a pretty basic logic of when a combiner is
> > > invoked. In the way your query is written now it would not be,
> however,
> > > if you modify you foreach statement it will be:
> > >
> > > RollUp = FOREACH HourGroups FLATTEN(group), COUNT(Raw);
> > >
> > > You can see if the combiner is invoked by running
> > >
> > > Explain RollUp.
> > >
> > > (2) You do need to use parallel keyword on the group operator to make
> > > sure it runs in parallel.
> > >
> > > Finally, we are working on some performance improvements as part of
> > > pipeline redesign. You can track the progress at
> > > https://issues.apache.org/jira/browse/pig-157.
> > >
> > > Olga
> > >
> > > > -----Original Message-----
> > > > From: Travis Brady [mailto:travis@mochimedia.com]
> > > > Sent: Wednesday, March 26, 2008 2:03 PM
> > > > To: pig-user@incubator.apache.org
> > > > Subject: Pig performance
> > > >
> > > > I really like writing pig code, but I'm experiencing pretty
> > > > terrible performance using Pig for a simple data rollup
> > > > taking about 90 minutes to complete.  The equivalent
> > > > expressed using shell scripts and Haskell and executed with
> > > > hadoop streaming runs in roughly 5 minutes.
> > > > My dataset is stored on hdfs as a handful of tab delimited
> > > > text files.  In sum there are 19 million rows of data.
> > > >
> > > > This is running on a 3-node cluster where each machine has
> > > > 8GB of ram.  I have all three machines configured per the
> > > > instructions on the Hadoop wiki on setting up Hadoop on Ubuntu.
> > > >
> > > > Here is the pig code:
> > > > <code>
> > > > Raw = LOAD 'stats_dump_200707' USING PigStorage('\t');
> > > >
> > > > HourGroups = GROUP Raw by $0;
> > > >
> > > > RollUp = FOREACH HourGroups {
> > > >     GENERATE FLATTEN(group), COUNT(Raw); }
> > > >
> > > > DUMP RollUp;
> > > > </code>
> > > >
> > > > Do I need to add the PARALLEL keyword in there somewhere?
> > > > Change something in hadoop-site.xml?
> > > >
> > > > The Hadoop streaming stuff uses "cut -c 1-13" as the mapper
> > > > and a bit of Haskell compiled with ghc as the reducer:
> > > > I can send the Haskell code along if it would help, but for
> > > > now I assume I must be doing something wrong for it to
> > > > perform so poorly.
> > > >
> > > > thank you
> > > >
> > > > --
> > > > Travis Brady
> > > > www.mochiads.com
> > > >
> > >
> >
> >
> >
> > --
> > Travis Brady
> > www.mochiads.com
> >
>



-- 
Travis Brady
www.mochiads.com

Re: Pig performance

Posted by pi song <pi...@gmail.com>.

This is obviously a memory management problem that I'm investigating.
Travis, when did you download Pig? After the PIG-18 commit, I find this
problem occurs less often. Can you get the latest version and try again?

Also, a few days ago, we have identified an issue in the memory manager.
Though Ben hasn't come up with a patch yet. If the above still doesn't work,
could you please try this out?

1.  Look at SpillableMemoryManager.java
2.  change the code that looks like this
"
*if* (o1Size == o2Size) {
*return* 0;
}
*if* (o1Size < o2Size) {
*return* -1;
}
*return* 1;
}
"

to

"
*if* (o1Size == o2Size) {
*return* 0;
}
*if* (o1Size < o2Size) {
*return* 1;
}
*return* -1;
}
"

(Basically just swap 1 with -1)
Hope that will help

Pi


On 3/27/08, Travis Brady <tr...@mochimedia.com> wrote:
>
> Hi Olga,
>
> Thanks for your help and thank you for open sourcing Pig.
> I converted the FOREACH statement but it yields a very lengthy error the
> relevant portion of which I've pasted below.
>
> The good news is the modified FOREACH was going to invoke the combiner.
>
> 2008-03-26 16:16:00,322 [main] ERROR
>
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> -
> Error message from task (map) tip_200803261041_0008_m_000000
> 2008-03-26 16:16:00,322 [main] ERROR
>
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher
> -
> Error message from task (map) tip_200803261041_0008_m_000002
> java.lang.OutOfMemoryError: Java heap space
>    at java.util.Arrays.copyOf(Arrays.java:2786)
>    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:71)
>    at java.io.DataOutputStream.write(DataOutputStream.java:71)
>    at org.apache.pig.data.Tuple.encodeInt(Tuple.java:362)
>    at org.apache.pig.data.DataAtom.write(DataAtom.java:137)
>    at org.apache.pig.data.Tuple.write(Tuple.java:301)
>    at org.apache.pig.data.IndexedTuple.write(IndexedTuple.java:52)
>    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(
> MapTask.java
> :392)
>    at
>
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce$MapDataOutputCollector.add
> (PigMapReduce.java:304)
>    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
> UnflattenCollector.java:52)
>    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(
> GenerateSpec.java:230)
>    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
> UnflattenCollector.java:52)
>    at org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(
> DataCollector.java:93)
>    at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java
> :35)
>    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(
> GenerateSpec.java:261)
>    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
>    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
> UnflattenCollector.java:52)
>    at
>
> org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run
> (PigMapReduce.java:113)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
>    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> :2084)
>
>
> Travis
>
>
> On Wed, Mar 26, 2008 at 3:51 PM, Olga Natkovich <ol...@yahoo-inc.com>
> wrote:
>
> > Hi Travis,
> >
> > There are a couple of things you can do to improve performance of your
> > script.
> >
> > (1) At this point we have a pretty basic logic of when a combiner is
> > invoked. In the way your query is written now it would not be, however,
> > if you modify you foreach statement it will be:
> >
> > RollUp = FOREACH HourGroups FLATTEN(group), COUNT(Raw);
> >
> > You can see if the combiner is invoked by running
> >
> > Explain RollUp.
> >
> > (2) You do need to use parallel keyword on the group operator to make
> > sure it runs in parallel.
> >
> > Finally, we are working on some performance improvements as part of
> > pipeline redesign. You can track the progress at
> > https://issues.apache.org/jira/browse/pig-157.
> >
> > Olga
> >
> > > -----Original Message-----
> > > From: Travis Brady [mailto:travis@mochimedia.com]
> > > Sent: Wednesday, March 26, 2008 2:03 PM
> > > To: pig-user@incubator.apache.org
> > > Subject: Pig performance
> > >
> > > I really like writing pig code, but I'm experiencing pretty
> > > terrible performance using Pig for a simple data rollup
> > > taking about 90 minutes to complete.  The equivalent
> > > expressed using shell scripts and Haskell and executed with
> > > hadoop streaming runs in roughly 5 minutes.
> > > My dataset is stored on hdfs as a handful of tab delimited
> > > text files.  In sum there are 19 million rows of data.
> > >
> > > This is running on a 3-node cluster where each machine has
> > > 8GB of ram.  I have all three machines configured per the
> > > instructions on the Hadoop wiki on setting up Hadoop on Ubuntu.
> > >
> > > Here is the pig code:
> > > <code>
> > > Raw = LOAD 'stats_dump_200707' USING PigStorage('\t');
> > >
> > > HourGroups = GROUP Raw by $0;
> > >
> > > RollUp = FOREACH HourGroups {
> > >     GENERATE FLATTEN(group), COUNT(Raw); }
> > >
> > > DUMP RollUp;
> > > </code>
> > >
> > > Do I need to add the PARALLEL keyword in there somewhere?
> > > Change something in hadoop-site.xml?
> > >
> > > The Hadoop streaming stuff uses "cut -c 1-13" as the mapper
> > > and a bit of Haskell compiled with ghc as the reducer:
> > > I can send the Haskell code along if it would help, but for
> > > now I assume I must be doing something wrong for it to
> > > perform so poorly.
> > >
> > > thank you
> > >
> > > --
> > > Travis Brady
> > > www.mochiads.com
> > >
> >
>
>
>
> --
> Travis Brady
> www.mochiads.com
>

Re: Pig performance

Posted by Travis Brady <tr...@mochimedia.com>.

Hi Olga,

Thanks for your help and thank you for open sourcing Pig.
I converted the FOREACH statement but it yields a very lengthy error the
relevant portion of which I've pasted below.

The good news is the modified FOREACH was going to invoke the combiner.

2008-03-26 16:16:00,322 [main] ERROR
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher-
Error message from task (map) tip_200803261041_0008_m_000000
2008-03-26 16:16:00,322 [main] ERROR
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.MapReduceLauncher-
Error message from task (map) tip_200803261041_0008_m_000002
java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2786)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:71)
    at java.io.DataOutputStream.write(DataOutputStream.java:71)
    at org.apache.pig.data.Tuple.encodeInt(Tuple.java:362)
    at org.apache.pig.data.DataAtom.write(DataAtom.java:137)
    at org.apache.pig.data.Tuple.write(Tuple.java:301)
    at org.apache.pig.data.IndexedTuple.write(IndexedTuple.java:52)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java
:392)
    at
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce$MapDataOutputCollector.add
(PigMapReduce.java:304)
    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
UnflattenCollector.java:52)
    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.add(
GenerateSpec.java:230)
    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
UnflattenCollector.java:52)
    at org.apache.pig.impl.eval.collector.DataCollector.addToSuccessor(
DataCollector.java:93)
    at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:35)
    at org.apache.pig.impl.eval.GenerateSpec$CrossProductItem.exec(
GenerateSpec.java:261)
    at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
    at org.apache.pig.impl.eval.collector.UnflattenCollector.add(
UnflattenCollector.java:52)
    at
org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run
(PigMapReduce.java:113)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
:2084)


Travis


On Wed, Mar 26, 2008 at 3:51 PM, Olga Natkovich <ol...@yahoo-inc.com> wrote:

> Hi Travis,
>
> There are a couple of things you can do to improve performance of your
> script.
>
> (1) At this point we have a pretty basic logic of when a combiner is
> invoked. In the way your query is written now it would not be, however,
> if you modify you foreach statement it will be:
>
> RollUp = FOREACH HourGroups FLATTEN(group), COUNT(Raw);
>
> You can see if the combiner is invoked by running
>
> Explain RollUp.
>
> (2) You do need to use parallel keyword on the group operator to make
> sure it runs in parallel.
>
> Finally, we are working on some performance improvements as part of
> pipeline redesign. You can track the progress at
> https://issues.apache.org/jira/browse/pig-157.
>
> Olga
>
> > -----Original Message-----
> > From: Travis Brady [mailto:travis@mochimedia.com]
> > Sent: Wednesday, March 26, 2008 2:03 PM
> > To: pig-user@incubator.apache.org
> > Subject: Pig performance
> >
> > I really like writing pig code, but I'm experiencing pretty
> > terrible performance using Pig for a simple data rollup
> > taking about 90 minutes to complete.  The equivalent
> > expressed using shell scripts and Haskell and executed with
> > hadoop streaming runs in roughly 5 minutes.
> > My dataset is stored on hdfs as a handful of tab delimited
> > text files.  In sum there are 19 million rows of data.
> >
> > This is running on a 3-node cluster where each machine has
> > 8GB of ram.  I have all three machines configured per the
> > instructions on the Hadoop wiki on setting up Hadoop on Ubuntu.
> >
> > Here is the pig code:
> > <code>
> > Raw = LOAD 'stats_dump_200707' USING PigStorage('\t');
> >
> > HourGroups = GROUP Raw by $0;
> >
> > RollUp = FOREACH HourGroups {
> >     GENERATE FLATTEN(group), COUNT(Raw); }
> >
> > DUMP RollUp;
> > </code>
> >
> > Do I need to add the PARALLEL keyword in there somewhere?
> > Change something in hadoop-site.xml?
> >
> > The Hadoop streaming stuff uses "cut -c 1-13" as the mapper
> > and a bit of Haskell compiled with ghc as the reducer:
> > I can send the Haskell code along if it would help, but for
> > now I assume I must be doing something wrong for it to
> > perform so poorly.
> >
> > thank you
> >
> > --
> > Travis Brady
> > www.mochiads.com
> >
>



-- 
Travis Brady
www.mochiads.com

RE: Pig performance

Posted by Olga Natkovich <ol...@yahoo-inc.com>.

Hi Travis,

There are a couple of things you can do to improve performance of your
script.

(1) At this point we have a pretty basic logic of when a combiner is
invoked. In the way your query is written now it would not be, however,
if you modify you foreach statement it will be:

RollUp = FOREACH HourGroups FLATTEN(group), COUNT(Raw);

You can see if the combiner is invoked by running

Explain RollUp.

(2) You do need to use parallel keyword on the group operator to make
sure it runs in parallel.

Finally, we are working on some performance improvements as part of
pipeline redesign. You can track the progress at
https://issues.apache.org/jira/browse/pig-157.

Olga

> -----Original Message-----
> From: Travis Brady [mailto:travis@mochimedia.com] 
> Sent: Wednesday, March 26, 2008 2:03 PM
> To: pig-user@incubator.apache.org
> Subject: Pig performance
> 
> I really like writing pig code, but I'm experiencing pretty 
> terrible performance using Pig for a simple data rollup 
> taking about 90 minutes to complete.  The equivalent 
> expressed using shell scripts and Haskell and executed with 
> hadoop streaming runs in roughly 5 minutes.
> My dataset is stored on hdfs as a handful of tab delimited 
> text files.  In sum there are 19 million rows of data.
> 
> This is running on a 3-node cluster where each machine has 
> 8GB of ram.  I have all three machines configured per the 
> instructions on the Hadoop wiki on setting up Hadoop on Ubuntu.
> 
> Here is the pig code:
> <code>
> Raw = LOAD 'stats_dump_200707' USING PigStorage('\t');
> 
> HourGroups = GROUP Raw by $0;
> 
> RollUp = FOREACH HourGroups {
>     GENERATE FLATTEN(group), COUNT(Raw); }
> 
> DUMP RollUp;
> </code>
> 
> Do I need to add the PARALLEL keyword in there somewhere?  
> Change something in hadoop-site.xml?
> 
> The Hadoop streaming stuff uses "cut -c 1-13" as the mapper 
> and a bit of Haskell compiled with ghc as the reducer:
> I can send the Haskell code along if it would help, but for 
> now I assume I must be doing something wrong for it to 
> perform so poorly.
> 
> thank you
> 
> --
> Travis Brady
> www.mochiads.com
>