You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Martin Z <ma...@hotmail.com> on 2011/01/26 16:44:42 UTC

Tips for debugging pig

Hi all,

I'm running a Pig script in local mode, and it finishes successfully. When I use the same dataset and script to run pig in its distributed mode, it hangs at 90% and the hadoop processes in the node machines takes almost all the memory. It always hangs at the reduce task of the last job.

The conf/mapred-site.xml is:

  <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx1000m</value>
  </property>
  <property>
    <name>mapred.child.ulimit</name>
    <value>4000000</value>
    <final>true</final>
  </property>

Do you know how I can debug the processes to find out where the problem is?

Thanks!
 		 	   		  

RE: Tips for debugging pig

Posted by jacob <ja...@gmail.com>.
Martin,

It's a bit of a black art at the moment. Every Pig script is broken down
into one or more map-reduce jobs based on the types of operations you've
got in there. JOIN, GROUP, COGROUP, and ORDER will require a reduce
(except for special circumstances that you must specify ahead of time).
If your script only has one operation that kicks off a reduce then it
has to be that operation that's the problem.

At the phase you're looking at (~66%) is where the partition and sort
have finished and you're running the actual reduce portion of your task.
Keep in mind that, at this phase, output data is finally being written
to the hdfs. Do you have write permissions where you're trying to write?
What do the namenode logs say? The datanode logs?


--jacob
 
On Wed, 2011-01-26 at 13:32 -0300, Martin Z wrote:
> 
> 
> 
> 
> 
> 
> 
> 
> Hi Jacob,
> 
> Thanks for your response. These are the last lines of the task log, before killing the process. The machine is 192.168.1.18.
> 
> 2011-01-26 11:14:32,485 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201101261105_0002_r_000009_0 0.33333334% reduce > copy (12 of 12 at 0.00 MB/s)
> 2011-01-26 11:14:32,623 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201101261105_0002_r_000009_0 0.33333334% reduce > copy (12 of 12 at 0.00 MB/s)
> 2011-01-26 11:14:32,670 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201101261105_0002_r_000009_0 0.33333334% reduce > copy (12 of 12 at 0.00 MB/s)
> 2011-01-26 11:14:32,814 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201101261105_0002_r_000006_0 0.6666667% reduce > reduce
> 2011-01-26 11:14:32,971 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201101261105_0002_r_000003_0 0.6666667% reduce > reduce
> 2011-01-26 11:14:33,639 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201101261105_0002_r_000000_0 0.6666667% reduce > reduce
> 2011-01-26 11:14:33,966 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 17393 bytes for reduce: 10 from map: attempt_201101261105_0002_m_000001_0 given 17393/17389
> 2011-01-26 11:14:33,966 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.1.18:50060, dest: 192.168.1.19:56250, bytes: 17393, op: MAPRED_SHUFFLE, cliID: attempt_201101261105_0002_m_000001_0, duration: 2439516
> 2011-01-26 11:14:34,065 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 8246 bytes for reduce: 10 from map: attempt_201101261105_0002_m_000004_0 given 8246/8242
> 2011-01-26 11:14:34,066 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.1.18:50060, dest: 192.168.1.19:56250, bytes: 8246, op: MAPRED_SHUFFLE, cliID: attempt_201101261105_0002_m_000004_0, duration: 1218536
> 2011-01-26 11:14:34,084 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 11992 bytes for reduce: 10 from map: attempt_201101261105_0002_m_000006_0 given 11992/11988
> 2011-01-26 11:14:34,085 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.1.18:50060, dest: 192.168.1.19:56250, bytes: 11992, op: MAPRED_SHUFFLE, cliID: attempt_201101261105_0002_m_000006_0, duration: 1474773
> 2011-01-26 11:14:34,092 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 7195 bytes for reduce: 10 from map: attempt_201101261105_0002_m_000008_0 given 7195/7191
> 2011-01-26 11:14:34,092 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.1.18:50060, dest: 192.168.1.19:56250, bytes: 7195, op: MAPRED_SHUFFLE, cliID: attempt_201101261105_0002_m_000008_0, duration: 1679538
> 2011-01-26 11:14:34,113 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 13086 bytes for reduce: 11 from map: attempt_201101261105_0002_m_000001_0 given 13086/13082
> 2011-01-26 11:14:34,113 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.1.18:50060, dest: 192.168.1.14:60067, bytes: 13086, op: MAPRED_SHUFFLE, cliID: attempt_201101261105_0002_m_000001_0, duration: 1596605
> 2011-01-26 11:14:34,288 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 15422 bytes for reduce: 11 from map: attempt_201101261105_0002_m_000004_0 given 15422/15418
> 2011-01-26 11:14:34,288 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.1.18:50060, dest: 192.168.1.14:60067, bytes: 15422, op: MAPRED_SHUFFLE, cliID: attempt_201101261105_0002_m_000004_0, duration: 1191562
> 2011-01-26 11:14:34,310 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 8648 bytes for reduce: 11 from map: attempt_201101261105_0002_m_000006_0 given 8648/8644
> 2011-01-26 11:14:34,311 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.1.18:50060, dest: 192.168.1.14:60067, bytes: 8648, op: MAPRED_SHUFFLE, cliID: attempt_201101261105_0002_m_000006_0, duration: 1486513
> 2011-01-26 11:14:34,351 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 8181 bytes for reduce: 11 from map: attempt_201101261105_0002_m_000008_0 given 8181/8177
> 2011-01-26 11:14:34,352 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.1.18:50060, dest: 192.168.1.14:60067, bytes: 8181, op: MAPRED_SHUFFLE, cliID: attempt_201101261105_0002_m_000008_0, duration: 1530920
> 2011-01-26 11:14:35,617 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201101261105_0002_r_000009_0 0.6666667% reduce > reduce
> 2011-01-26 11:14:36,083 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201101261105_0002_r_000006_0 0.6666667% reduce > reduce
> 2011-01-26 11:14:38,625 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201101261105_0002_r_000009_0 0.6666667% reduce > reduce
> 
> 
> It then hangs at 66%.
> 
> About the script, is there a way of telling in which part of it the process is hanging?
> 
> 
> Thanks!
> 
> > Subject: Re: Tips for debugging pig
> > From: jacob.a.perkins@gmail.com
> > To: user@pig.apache.org
> > Date: Wed, 26 Jan 2011 10:08:08 -0600
> > 
> > Martin,
> > 	When you look at the task logs for the particular reducer that's stuck,
> > what do you see? What kind of operations do you have going on in the
> > script, possibly a GROUP ALL?
> > 
> > --jacob
> > 
> > On Wed, 2011-01-26 at 12:44 -0300, Martin Z wrote:
> > > Hi all,
> > > 
> > > I'm running a Pig script in local mode, and it finishes successfully. When I use the same dataset and script to run pig in its distributed mode, it hangs at 90% and the hadoop processes in the node machines takes almost all the memory. It always hangs at the reduce task of the last job.
> > > 
> > > The conf/mapred-site.xml is:
> > > 
> > >   <property>
> > >     <name>mapred.child.java.opts</name>
> > >     <value>-Xmx1000m</value>
> > >   </property>
> > >   <property>
> > >     <name>mapred.child.ulimit</name>
> > >     <value>4000000</value>
> > >     <final>true</final>
> > >   </property>
> > > 
> > > Do you know how I can debug the processes to find out where the problem is?
> > > 
> > > Thanks!
> > >  		 	   		  
> > 
> > 
>  		 	   		  



RE: Tips for debugging pig

Posted by Martin Z <ma...@hotmail.com>.








Hi Jacob,

Thanks for your response. These are the last lines of the task log, before killing the process. The machine is 192.168.1.18.

2011-01-26 11:14:32,485 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201101261105_0002_r_000009_0 0.33333334% reduce > copy (12 of 12 at 0.00 MB/s)
2011-01-26 11:14:32,623 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201101261105_0002_r_000009_0 0.33333334% reduce > copy (12 of 12 at 0.00 MB/s)
2011-01-26 11:14:32,670 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201101261105_0002_r_000009_0 0.33333334% reduce > copy (12 of 12 at 0.00 MB/s)
2011-01-26 11:14:32,814 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201101261105_0002_r_000006_0 0.6666667% reduce > reduce
2011-01-26 11:14:32,971 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201101261105_0002_r_000003_0 0.6666667% reduce > reduce
2011-01-26 11:14:33,639 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201101261105_0002_r_000000_0 0.6666667% reduce > reduce
2011-01-26 11:14:33,966 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 17393 bytes for reduce: 10 from map: attempt_201101261105_0002_m_000001_0 given 17393/17389
2011-01-26 11:14:33,966 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.1.18:50060, dest: 192.168.1.19:56250, bytes: 17393, op: MAPRED_SHUFFLE, cliID: attempt_201101261105_0002_m_000001_0, duration: 2439516
2011-01-26 11:14:34,065 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 8246 bytes for reduce: 10 from map: attempt_201101261105_0002_m_000004_0 given 8246/8242
2011-01-26 11:14:34,066 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.1.18:50060, dest: 192.168.1.19:56250, bytes: 8246, op: MAPRED_SHUFFLE, cliID: attempt_201101261105_0002_m_000004_0, duration: 1218536
2011-01-26 11:14:34,084 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 11992 bytes for reduce: 10 from map: attempt_201101261105_0002_m_000006_0 given 11992/11988
2011-01-26 11:14:34,085 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.1.18:50060, dest: 192.168.1.19:56250, bytes: 11992, op: MAPRED_SHUFFLE, cliID: attempt_201101261105_0002_m_000006_0, duration: 1474773
2011-01-26 11:14:34,092 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 7195 bytes for reduce: 10 from map: attempt_201101261105_0002_m_000008_0 given 7195/7191
2011-01-26 11:14:34,092 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.1.18:50060, dest: 192.168.1.19:56250, bytes: 7195, op: MAPRED_SHUFFLE, cliID: attempt_201101261105_0002_m_000008_0, duration: 1679538
2011-01-26 11:14:34,113 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 13086 bytes for reduce: 11 from map: attempt_201101261105_0002_m_000001_0 given 13086/13082
2011-01-26 11:14:34,113 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.1.18:50060, dest: 192.168.1.14:60067, bytes: 13086, op: MAPRED_SHUFFLE, cliID: attempt_201101261105_0002_m_000001_0, duration: 1596605
2011-01-26 11:14:34,288 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 15422 bytes for reduce: 11 from map: attempt_201101261105_0002_m_000004_0 given 15422/15418
2011-01-26 11:14:34,288 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.1.18:50060, dest: 192.168.1.14:60067, bytes: 15422, op: MAPRED_SHUFFLE, cliID: attempt_201101261105_0002_m_000004_0, duration: 1191562
2011-01-26 11:14:34,310 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 8648 bytes for reduce: 11 from map: attempt_201101261105_0002_m_000006_0 given 8648/8644
2011-01-26 11:14:34,311 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.1.18:50060, dest: 192.168.1.14:60067, bytes: 8648, op: MAPRED_SHUFFLE, cliID: attempt_201101261105_0002_m_000006_0, duration: 1486513
2011-01-26 11:14:34,351 INFO org.apache.hadoop.mapred.TaskTracker: Sent out 8181 bytes for reduce: 11 from map: attempt_201101261105_0002_m_000008_0 given 8181/8177
2011-01-26 11:14:34,352 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 192.168.1.18:50060, dest: 192.168.1.14:60067, bytes: 8181, op: MAPRED_SHUFFLE, cliID: attempt_201101261105_0002_m_000008_0, duration: 1530920
2011-01-26 11:14:35,617 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201101261105_0002_r_000009_0 0.6666667% reduce > reduce
2011-01-26 11:14:36,083 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201101261105_0002_r_000006_0 0.6666667% reduce > reduce
2011-01-26 11:14:38,625 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201101261105_0002_r_000009_0 0.6666667% reduce > reduce


It then hangs at 66%.

About the script, is there a way of telling in which part of it the process is hanging?


Thanks!

> Subject: Re: Tips for debugging pig
> From: jacob.a.perkins@gmail.com
> To: user@pig.apache.org
> Date: Wed, 26 Jan 2011 10:08:08 -0600
> 
> Martin,
> 	When you look at the task logs for the particular reducer that's stuck,
> what do you see? What kind of operations do you have going on in the
> script, possibly a GROUP ALL?
> 
> --jacob
> 
> On Wed, 2011-01-26 at 12:44 -0300, Martin Z wrote:
> > Hi all,
> > 
> > I'm running a Pig script in local mode, and it finishes successfully. When I use the same dataset and script to run pig in its distributed mode, it hangs at 90% and the hadoop processes in the node machines takes almost all the memory. It always hangs at the reduce task of the last job.
> > 
> > The conf/mapred-site.xml is:
> > 
> >   <property>
> >     <name>mapred.child.java.opts</name>
> >     <value>-Xmx1000m</value>
> >   </property>
> >   <property>
> >     <name>mapred.child.ulimit</name>
> >     <value>4000000</value>
> >     <final>true</final>
> >   </property>
> > 
> > Do you know how I can debug the processes to find out where the problem is?
> > 
> > Thanks!
> >  		 	   		  
> 
> 
 		 	   		  

Re: Tips for debugging pig

Posted by jacob <ja...@gmail.com>.
Martin,
	When you look at the task logs for the particular reducer that's stuck,
what do you see? What kind of operations do you have going on in the
script, possibly a GROUP ALL?

--jacob

On Wed, 2011-01-26 at 12:44 -0300, Martin Z wrote:
> Hi all,
> 
> I'm running a Pig script in local mode, and it finishes successfully. When I use the same dataset and script to run pig in its distributed mode, it hangs at 90% and the hadoop processes in the node machines takes almost all the memory. It always hangs at the reduce task of the last job.
> 
> The conf/mapred-site.xml is:
> 
>   <property>
>     <name>mapred.child.java.opts</name>
>     <value>-Xmx1000m</value>
>   </property>
>   <property>
>     <name>mapred.child.ulimit</name>
>     <value>4000000</value>
>     <final>true</final>
>   </property>
> 
> Do you know how I can debug the processes to find out where the problem is?
> 
> Thanks!
>