You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mua Ban <ba...@gmail.com> on 2013/04/12 17:18:21 UTC

join operation fails on big data set

Hi,

I am very new to PIG/Hadoop, I just started writing my first PIG script a
couple days ago. I ran into this problem.

My cluster has 9 nodes. I have to join two data sets big and small, each is
collected for 4 weeks. I first take two subsets of my data set (which is
for the first week of data), let's call them B1 and S1 for big and small
data sets of the first week. The entire data sets of 4 weeks is B4 and S4.

I ran my script on my cluster to join B1 and S1 and everything is fine. I
got my joined data. However, when I ran my script to join B4 and S4, the
script failed. B4 is 39GB, S4 is 300MB. B4 is skewed, some id appears more
frequently than others. I tried both 'using skewed' and 'using replicated'
modes for the join operation (by appending them to the end of the below
join clause), they both fail.

Here is my script and i think it is very simple:

*big = load 'bigdir/' using PigStorage(',') as (id:chararray,
data:chararray);*
*small = load 'smalldir/' using PigStorage(',') as
(t1:double,t2:double,data:chararray,id:chararray);
*
*J = JOIN big by id LEFT OUTER, small by id;
*
*store J into 'outputdir' using PigStorage(',');
*

On the web ui of the tracker, I see that the job has 40 reducers (I guess
since the total data is about 40GB, and each 1GB will need one reducer by
default of PIG and hadoop setting, so this is normal). If I use 'parallel
80' in the join operation above, then I see 80 reducers, and the join
operation still failed.

I checked file  mapred-default.xml and found this:
<name>mapred.reduce.tasks</name>
  <value>1</value>

If I set the value of parallel in join operation, it should overwrite this,
right?


On the tracker GUI, I see that for different runs, the number of completed
reducers changes from 4 to 10 (out of 40 total reducers). The tracker GUI
shows the reason for the failed reducers: "Task
attempt_201304081613_0046_r_000006_0 failed to report status for 600
seconds. Killing!"

*Could you please help?*
Thank you very much,
-Mua

--------------------------------------------------------------------------------------------------------------
Here is the error report from the console screen where I ran this script:

job_201304081613_0032   616     0       230     12      32      0   0
0       big     MAP_ONLY
job_201304081613_0033   705     1       21      6       6       234 2
34      234             SAMPLER

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_201304081613_0034   small   SKEWED_JOIN     Message: Job failed!
Error - # of failed Reduce Tasks exceeded allowed limit. FailedCount: 1.
LastFailedTask: task_201304081613_0034_r_000012

Input(s):
Successfully read 364285458 records (39528533645 bytes) from:
"hdfs://d0521b01:24990/user/abc/big/"
Failed to read data from "hdfs://d0521b01:24990/user/abc/small/"

Output(s):

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201304081613_0032   ->      job_201304081613_0033,
job_201304081613_0033   ->      job_201304081613_0034,
job_201304081613_0034   ->      null,
null


2013-04-10 20:11:23,815 [main] WARN
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Encountered Warning
REDUCER_COUNT_LOW 1 time(s).
2013-04-10 20:11:23,815 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Some jobs have faile
d! Stop running all dependent jobs
2013-04-10 20:11:23,815 [main] ERROR org.apache.pig.tools.grunt.GruntParser
- ERROR 2997: Encountered IOException. java.io.IOException: Er
ror Recovery for block blk_312487981794332936_26563 failed  because
recovery from primary datanode 10.6.25.31:54563 failed 6 times.  Pipel
ine was 10.6.25.31:54563. Aborting...
Details at logfile: /homes/abc/pig-flatten/scripts/pig_1365627648226.log
2013-04-10 20:11:23,818 [main] ERROR org.apache.pig.tools.grunt.GruntParser
- ERROR 2244: Job failed, hadoop does not return any error mes
sage
Details at logfile: /homes/abc/pig-flatten/scripts/pig_1365627648226.log

Re: join operation fails on big data set

Posted by Johnny Zhang <xi...@cloudera.com>.
Hi, Mua:
Your log has
2013-04-12 14:00:00,777 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-199210310173610155_28360 *bad datanode[0]**
10.6.25.33:49197*
2013-04-12 14:00:00,866 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-199210310173610155_28360 in pipeline
10.6.25.33:49197, 10.6.25.141:39369, 10.6.25.31:54563: *bad datanode**
10.6.25.33:49197*

can you check your datanode 10.6.25.33:49197 ? You can either
log in to that node to check if the datanode daemon is on
or you can go to your namenode URL <namenode_IP>:50070/dfshealth.jsp   (it
shows how many live DN, how many dead DN)
or you can <namenode_IP>:50070/dfsclusterhealth.jsp  (it shows how many
live DN, how many dead DN)


You can bump your log level by open /etc/pig/conf/pig.properties and
change debug=DEBUG (sorry for confusing, I didn't want to mean log4j debug
level here....)
this way, you will have something like 2013-04-15 10:46:04,069 [main] DEBUG
xxxxxxxxx in your console output...

Johnny


On Sun, Apr 14, 2013 at 6:13 AM, Mua Ban <ba...@gmail.com> wrote:

> Hi Johnny,
>
> Thank you very much for your email. I am very new here. Please tell me
> where to check the health of the data node (which log file should I look
> at?), and how to set the logging level of log4j to DEBUG.
>
> Thanks,
> -Mua
>
>
> On Fri, Apr 12, 2013 at 5:01 PM, Johnny Zhang <xi...@cloudera.com>
> wrote:
>
> > seems a HDFS issue, as you said, cannot retrieval certain block from
> > certain DN. Can you check the health of all DN? And properly also bump
> the
> > log4j level to DEBUG.
> >
> > Johnny
> >
> >
> > On Fri, Apr 12, 2013 at 12:06 PM, Mua Ban <ba...@gmail.com> wrote:
> >
> > > Thank you very much Cheolsoo,
> > >
> > > I am running the script once more right now and I see 7 failed reducers
> > at
> > > the moment on the job tracker GUI. I browse these failed reducers and I
> > > found the task logs. From these 7 failed reducers, some have type 1
> task
> > > log, the rest have type 2 task log as I show below.
> > >
> > > They seem related to some connection issue among nodes in the cluster.
> Do
> > > you know any parameters I should configure to figure out the actual
> > > problem?
> > >
> > > Thank you,
> > > -Mua
> > >
> > > ---------------------------------------
> > > *Type 1 task log*
> > >
> > > 3-04-12 13:42:24,960 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 5 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:42:25,259 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:42:25,271 INFO org.apache.hadoop.mapred.ReduceTask:
> > > Initiating in-memory merge with 610 segments...
> > > 2013-04-12 13:42:25,273 INFO org.apache.hadoop.mapred.Merger: Merging
> 610
> > > sorted segments
> > > 2013-04-12 13:42:25,275 INFO org.apache.hadoop.mapred.Merger: Down to
> the
> > > last merge-pass, with 610 segments left of total size: 96922927 bytes
> > > 2013-04-12 13:42:27,348 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Merge of the 610 files in-memory
> > > complete. Local file is
> > >
> > >
> >
> /hdfs/sp/filesystem/mapred/local/taskTracker/vul/jobcache/job_201304081613_0049/attempt_201304081613_0049_r_000009_0/output/map_6.out
> > > of size 96921713
> > > 2013-04-12 13:42:27,349 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Thread waiting: Thread for merging
> > > on-disk files
> > > 2013-04-12 13:42:30,263 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:42:35,267 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 2 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:42:38,145 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0049_m_000584_0'
> > > 2013-04-12 13:42:44,150 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0049_m_000557_0'
> > > 2013-04-12 13:42:55,283 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:43:05,164 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0049_m_000604_0'
> > > 2013-04-12 13:43:06,036 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:43:11,169 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0049_m_000597_1'
> > > 2013-04-12 13:43:21,040 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Need another 5 map output(s)
> where 0
> > > is already in progress
> > > 2013-04-12 13:43:21,040 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 0 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:44:21,042 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Need another 5 map output(s)
> where 0
> > > is already in progress
> > > 2013-04-12 13:44:21,043 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:44:29,222 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0049_m_000576_0'
> > > 2013-04-12 13:45:21,333 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Need another 4 map output(s)
> where 0
> > > is already in progress
> > > 2013-04-12 13:45:21,333 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 0 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:46:01,334 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:46:06,341 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:46:21,350 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Need another 2 map output(s)
> where 0
> > > is already in progress
> > > 2013-04-12 13:46:21,350 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 0 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:46:41,301 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0049_m_000616_1'
> > > 2013-04-12 13:46:41,351 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0049_r_000009_0 Scheduled 2 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-12 13:46:42,301 INFO org.apache.hadoop.mapred.ReduceTask:
> > > GetMapEventsThread exiting
> > > 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask:
> > > getMapsEventsThread joined.
> > > 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask:
> Closed
> > > ram manager
> > > 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask:
> > > Interleaved on-disk merge complete: 1 files left.
> > > 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask:
> > In-memory
> > > merge complete: 11 files left.
> > > 2013-04-12 13:46:42,303 INFO org.apache.hadoop.mapred.Merger: Merging
> 11
> > > sorted segments
> > > 2013-04-12 13:46:42,303 INFO org.apache.hadoop.mapred.Merger: Down to
> the
> > > last merge-pass, with 11 segments left of total size: 3152550 bytes
> > > 2013-04-12 13:46:42,393 INFO org.apache.hadoop.mapred.ReduceTask:
> Merged
> > 11
> > > segments, 3152550 bytes to disk to satisfy reduce memory limit
> > > 2013-04-12 13:46:42,394 INFO org.apache.hadoop.mapred.ReduceTask:
> > Merging 2
> > > files, 100074247 bytes from disk
> > > 2013-04-12 13:46:42,395 INFO org.apache.hadoop.mapred.ReduceTask:
> > Merging 0
> > > segments, 0 bytes from memory into reduce
> > > 2013-04-12 13:46:42,395 INFO org.apache.hadoop.mapred.Merger: Merging 2
> > > sorted segments
> > > 2013-04-12 13:46:42,398 INFO org.apache.hadoop.mapred.Merger: Down to
> the
> > > last merge-pass, with 2 segments left of total size: 100074239 bytes
> > > 2013-04-12 13:57:45,872 WARN org.apache.hadoop.hdfs.DFSClient:
> > > DFSOutputStream ResponseProcessor exception  for block
> > > blk_-199210310173610155_28360java.net.SocketTimeoutException: 69000
> > millis
> > > timeout while waiting for channel to be ready for read. ch :
> > > java.nio.channels.SocketChannel[connected local=/10.6.25.33:47987
> remote=/
> > > 10.6.25.33:49197]
> > > at
> > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> > > at
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> > > at
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> > > at java.io.DataInputStream.readFully(DataInputStream.java:189)
> > > at java.io.DataInputStream.readLong(DataInputStream.java:410)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:124)
> > > at
> > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2967)
> > >
> > > 2013-04-12 14:00:00,777 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_-199210310173610155_28360 bad datanode[0]
> > > 10.6.25.33:49197
> > > 2013-04-12 14:00:00,866 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_-199210310173610155_28360 in pipeline
> > > 10.6.25.33:49197, 10.6.25.141:39369, 10.6.25.31:54563: bad datanode
> > > 10.6.25.33:49197
> > > 2013-04-12 14:04:55,904 INFO org.apache.hadoop.hdfs.DFSClient:
> Exception
> > in
> > > createBlockOutputStream 10.6.25.33:49197 java.io.IOException: Bad
> > connect
> > > ack with firstBadLink as 10.6.25.32:53741
> > > 2013-04-12 14:04:55,904 INFO org.apache.hadoop.hdfs.DFSClient:
> Abandoning
> > > block blk_-8936348770928346278_28365
> > > 2013-04-12 14:04:55,907 INFO org.apache.hadoop.hdfs.DFSClient:
> Excluding
> > > datanode 10.6.25.32:53741
> > > 2013-04-12 14:06:07,789 WARN org.apache.hadoop.hdfs.DFSClient:
> > > DFSOutputStream ResponseProcessor exception  for block
> > > blk_8322203915584568195_28367java.io.IOException: Bad response 1 for
> > block
> > > blk_8322203915584568195_28367 from datanode 10.6.25.31:54563
> > > at
> > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2977)
> > >
> > > 2013-04-12 14:06:25,735 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_8322203915584568195_28367 bad datanode[2]
> > > 10.6.25.31:54563
> > > 2013-04-12 14:06:25,735 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_8322203915584568195_28367 in pipeline
> > > 10.6.25.33:49197, 10.6.25.145:48897, 10.6.25.31:54563: bad datanode
> > > 10.6.25.31:54563
> > > 2013-04-12 14:06:45,112 WARN org.apache.hadoop.mapred.Task: Parent
> died.
> > >  Exiting attempt_201304081613_0049_r_000009_0
> > >
> > >
> > > ----------------------------------------------------------
> > >
> > > *Type 2 task log*
> > >
> > >
> > > 14_1'
> > > 2013-04-11 15:24:33,168 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0046_m_000609_1'
> > > 2013-04-11 15:24:39,172 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0046_m_000599_0'
> > > 2013-04-11 15:25:01,179 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0046_r_000010_0 Need another 5 map output(s)
> where 0
> > > is already in progress
> > > 2013-04-11 15:25:01,179 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0046_r_000010_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-11 15:25:24,203 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0046_m_000577_0'
> > > 2013-04-11 15:25:51,529 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0046_r_000010_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-11 15:26:00,227 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0046_m_000556_1'
> > > 2013-04-11 15:26:01,558 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0046_r_000010_0 Need another 3 map output(s)
> where 0
> > > is already in progress
> > > 2013-04-11 15:26:01,558 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0046_r_000010_0 Scheduled 2 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-11 15:26:06,235 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0046_m_000576_0'
> > > 2013-04-11 15:26:06,235 INFO org.apache.hadoop.mapred.ReduceTask:
> > Ignoring
> > > obsolete output of KILLED map-task:
> > 'attempt_201304081613_0046_m_000560_1'
> > > 2013-04-11 15:26:06,603 INFO org.apache.hadoop.mapred.ReduceTask:
> > > attempt_201304081613_0046_r_000010_0 Scheduled 1 outputs (0 slow hosts
> > and0
> > > dup hosts)
> > > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> > > GetMapEventsThread exiting
> > > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> > > getMapsEventsThread joined.
> > > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> Closed
> > > ram manager
> > > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> > > Interleaved on-disk merge complete: 1 files left.
> > > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> > In-memory
> > > merge complete: 109 files left.
> > > 2013-04-11 15:26:07,238 INFO org.apache.hadoop.mapred.Merger: Merging
> 109
> > > sorted segments
> > > 2013-04-11 15:26:07,238 INFO org.apache.hadoop.mapred.Merger: Down to
> the
> > > last merge-pass, with 109 segments left of total size: 23323822 bytes
> > > 2013-04-11 15:26:07,528 INFO org.apache.hadoop.mapred.ReduceTask:
> Merged
> > > 109 segments, 23323822 bytes to disk to satisfy reduce memory limit
> > > 2013-04-11 15:26:07,528 INFO org.apache.hadoop.mapred.ReduceTask:
> > Merging 2
> > > files, 120503030 bytes from disk
> > > 2013-04-11 15:26:07,529 INFO org.apache.hadoop.mapred.ReduceTask:
> > Merging 0
> > > segments, 0 bytes from memory into reduce
> > > 2013-04-11 15:26:07,529 INFO org.apache.hadoop.mapred.Merger: Merging 2
> > > sorted segments
> > > 2013-04-11 15:26:07,531 INFO org.apache.hadoop.mapred.Merger: Down to
> the
> > > last merge-pass, with 2 segments left of total size: 120503022 bytes
> > > 2013-04-11 15:28:34,121 INFO org.apache.hadoop.hdfs.DFSClient:
> Exception
> > in
> > > createBlockOutputStream 10.6.25.144:40971 java.io.IOException: Bad
> > connect
> > > ack with firstBadLink as 10.6.25.32:53741
> > > 2013-04-11 15:28:34,121 INFO org.apache.hadoop.hdfs.DFSClient:
> Abandoning
> > > block blk_423162934996020555_28131
> > > 2013-04-11 15:28:34,123 INFO org.apache.hadoop.hdfs.DFSClient:
> Excluding
> > > datanode 10.6.25.32:53741
> > > 2013-04-11 15:58:06,150 INFO org.apache.hadoop.hdfs.DFSClient:
> Exception
> > in
> > > createBlockOutputStream 10.6.25.144:40971 java.io.IOException: Bad
> > connect
> > > ack with firstBadLink as 10.6.25.31:54563
> > > 2013-04-11 15:58:06,150 INFO org.apache.hadoop.hdfs.DFSClient:
> Abandoning
> > > block blk_-7269648817395135125_28148
> > > 2013-04-11 15:58:06,152 INFO org.apache.hadoop.hdfs.DFSClient:
> Excluding
> > > datanode 10.6.25.31:54563
> > > 2013-04-11 16:06:10,175 WARN org.apache.hadoop.mapred.Task: Parent
> died.
> > >  Exiting attempt_201304081613_0046_r_000010_0
> > >
> > >
> > >
> > > On Fri, Apr 12, 2013 at 2:25 PM, Cheolsoo Park <pi...@gmail.com>
> > > wrote:
> > >
> > > > Hi Mua,
> > > >
> > > > I guess you misunderstood me. The pig_****.log file is not a task
> log.
> > > >
> > > > You should look for task logs on data nodes where your task tracker
> > ran.
> > > > Here is some explanation regarding various log files in Hadoop and
> > where
> > > to
> > > > find them:
> > > >
> > > >
> > > >
> > >
> >
> http://blog.cloudera.com/blog/2009/09/apache-hadoop-log-files-where-to-find-them-in-cdh-and-what-info-they-contain/
> > > >
> > > > Thanks,
> > > > Cheolsoo
> > > >
> > > >
> > > > On Fri, Apr 12, 2013 at 10:27 AM, Mua Ban <ba...@gmail.com>
> > wrote:
> > > >
> > > > > Thank you very much for your reply.
> > > > >
> > > > > Below is the stack log in the pig_****.log file
> > > > >
> > > > > Can you please give me some suggestions?
> > > > >
> > > > > -Mua
> > > > > ------------------
> > > > > Backend error message
> > > > > ---------------------
> > > > > Task attempt_201304081613_0048_r_000001_0 failed to report status
> for
> > > 601
> > > > > seconds. Killing!
> > > > >
> > > > > Pig Stack Trace
> > > > > ---------------
> > > > > ERROR 2997: Unable to recreate exception from backed error: Task
> > > > > attempt_201304081613_0048_r_000001_0 failed to report status for
> 601
> > > > > seconds. Killing!
> > > > >
> > > > > org.apache.pig.backend.executionengine.ExecException: ERROR 2997:
> > > Unable
> > > > to
> > > > > recreate exception from backed error: Task
> > > > > attempt_201304081613_0048_r_000001_0 failed to report status for
> 601
> > > > > seconds. Killing!
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:217)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:152)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:383)
> > > > >         at org.apache.pig.PigServer.launchPlan(PigServer.java:1270)
> > > > >         at
> > > > >
> > >
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1255)
> > > > >         at org.apache.pig.PigServer.execute(PigServer.java:1245)
> > > > >         at
> org.apache.pig.PigServer.executeBatch(PigServer.java:362)
> > > > >         at
> > > > >
> > >
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:132)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:193)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
> > > > >         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
> > > > >         at org.apache.pig.Main.run(Main.java:555)
> > > > >         at org.apache.pig.Main.main(Main.java:111)
> > > > >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > Method)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
> > > > >         at
> > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
> > > > >         at java.lang.reflect.Method.invoke(Method.java:611)
> > > > >         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> > > > >
> > > > >
> > > >
> > >
> >
> ================================================================================
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Apr 12, 2013 at 11:29 AM, Cheolsoo Park <
> > piaozhexiu@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > Did you look at task logs to see why those tasks failed? Since
> > it's a
> > > > > > back-end error, the console output doesn't tell you much. Task
> logs
> > > > > should
> > > > > > have a stack trace that shows why it failed, and you can go from
> > > there.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Apr 12, 2013 at 8:18 AM, Mua Ban <ba...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I am very new to PIG/Hadoop, I just started writing my first
> PIG
> > > > > script a
> > > > > > > couple days ago. I ran into this problem.
> > > > > > >
> > > > > > > My cluster has 9 nodes. I have to join two data sets big and
> > small,
> > > > > each
> > > > > > is
> > > > > > > collected for 4 weeks. I first take two subsets of my data set
> > > (which
> > > > > is
> > > > > > > for the first week of data), let's call them B1 and S1 for big
> > and
> > > > > small
> > > > > > > data sets of the first week. The entire data sets of 4 weeks is
> > B4
> > > > and
> > > > > > S4.
> > > > > > >
> > > > > > > I ran my script on my cluster to join B1 and S1 and everything
> is
> > > > > fine. I
> > > > > > > got my joined data. However, when I ran my script to join B4
> and
> > > S4,
> > > > > the
> > > > > > > script failed. B4 is 39GB, S4 is 300MB. B4 is skewed, some id
> > > appears
> > > > > > more
> > > > > > > frequently than others. I tried both 'using skewed' and 'using
> > > > > > replicated'
> > > > > > > modes for the join operation (by appending them to the end of
> the
> > > > below
> > > > > > > join clause), they both fail.
> > > > > > >
> > > > > > > Here is my script and i think it is very simple:
> > > > > > >
> > > > > > > *big = load 'bigdir/' using PigStorage(',') as (id:chararray,
> > > > > > > data:chararray);*
> > > > > > > *small = load 'smalldir/' using PigStorage(',') as
> > > > > > > (t1:double,t2:double,data:chararray,id:chararray);
> > > > > > > *
> > > > > > > *J = JOIN big by id LEFT OUTER, small by id;
> > > > > > > *
> > > > > > > *store J into 'outputdir' using PigStorage(',');
> > > > > > > *
> > > > > > >
> > > > > > > On the web ui of the tracker, I see that the job has 40
> reducers
> > (I
> > > > > guess
> > > > > > > since the total data is about 40GB, and each 1GB will need one
> > > > reducer
> > > > > by
> > > > > > > default of PIG and hadoop setting, so this is normal). If I use
> > > > > 'parallel
> > > > > > > 80' in the join operation above, then I see 80 reducers, and
> the
> > > join
> > > > > > > operation still failed.
> > > > > > >
> > > > > > > I checked file  mapred-default.xml and found this:
> > > > > > > <name>mapred.reduce.tasks</name>
> > > > > > >   <value>1</value>
> > > > > > >
> > > > > > > If I set the value of parallel in join operation, it should
> > > overwrite
> > > > > > this,
> > > > > > > right?
> > > > > > >
> > > > > > >
> > > > > > > On the tracker GUI, I see that for different runs, the number
> of
> > > > > > completed
> > > > > > > reducers changes from 4 to 10 (out of 40 total reducers). The
> > > tracker
> > > > > GUI
> > > > > > > shows the reason for the failed reducers: "Task
> > > > > > > attempt_201304081613_0046_r_000006_0 failed to report status
> for
> > > 600
> > > > > > > seconds. Killing!"
> > > > > > >
> > > > > > > *Could you please help?*
> > > > > > > Thank you very much,
> > > > > > > -Mua
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------------------------------------
> > > > > > > Here is the error report from the console screen where I ran
> this
> > > > > script:
> > > > > > >
> > > > > > > job_201304081613_0032   616     0       230     12      32
>  0
> > >   0
> > > > > > > 0       big     MAP_ONLY
> > > > > > > job_201304081613_0033   705     1       21      6       6
> > > 234 2
> > > > > > > 34      234             SAMPLER
> > > > > > >
> > > > > > > Failed Jobs:
> > > > > > > JobId   Alias   Feature Message Outputs
> > > > > > > job_201304081613_0034   small   SKEWED_JOIN     Message: Job
> > > failed!
> > > > > > > Error - # of failed Reduce Tasks exceeded allowed limit.
> > > FailedCount:
> > > > > 1.
> > > > > > > LastFailedTask: task_201304081613_0034_r_000012
> > > > > > >
> > > > > > > Input(s):
> > > > > > > Successfully read 364285458 records (39528533645 bytes) from:
> > > > > > > "hdfs://d0521b01:24990/user/abc/big/"
> > > > > > > Failed to read data from
> "hdfs://d0521b01:24990/user/abc/small/"
> > > > > > >
> > > > > > > Output(s):
> > > > > > >
> > > > > > > Counters:
> > > > > > > Total records written : 0
> > > > > > > Total bytes written : 0
> > > > > > > Spillable Memory Manager spill count : 0
> > > > > > > Total bags proactively spilled: 0
> > > > > > > Total records proactively spilled: 0
> > > > > > >
> > > > > > > Job DAG:
> > > > > > > job_201304081613_0032   ->      job_201304081613_0033,
> > > > > > > job_201304081613_0033   ->      job_201304081613_0034,
> > > > > > > job_201304081613_0034   ->      null,
> > > > > > > null
> > > > > > >
> > > > > > >
> > > > > > > 2013-04-10 20:11:23,815 [main] WARN
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > > > > > - Encountered Warning
> > > > > > > REDUCER_COUNT_LOW 1 time(s).
> > > > > > > 2013-04-10 20:11:23,815 [main] INFO
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > > > > > - Some jobs have faile
> > > > > > > d! Stop running all dependent jobs
> > > > > > > 2013-04-10 20:11:23,815 [main] ERROR
> > > > > > org.apache.pig.tools.grunt.GruntParser
> > > > > > > - ERROR 2997: Encountered IOException. java.io.IOException: Er
> > > > > > > ror Recovery for block blk_312487981794332936_26563 failed
> >  because
> > > > > > > recovery from primary datanode 10.6.25.31:54563 failed 6
> times.
> > > >  Pipel
> > > > > > > ine was 10.6.25.31:54563. Aborting...
> > > > > > > Details at logfile:
> > > > > /homes/abc/pig-flatten/scripts/pig_1365627648226.log
> > > > > > > 2013-04-10 20:11:23,818 [main] ERROR
> > > > > > org.apache.pig.tools.grunt.GruntParser
> > > > > > > - ERROR 2244: Job failed, hadoop does not return any error mes
> > > > > > > sage
> > > > > > > Details at logfile:
> > > > > /homes/abc/pig-flatten/scripts/pig_1365627648226.log
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: join operation fails on big data set

Posted by Mua Ban <ba...@gmail.com>.
Hi Johnny,

Thank you very much for your email. I am very new here. Please tell me
where to check the health of the data node (which log file should I look
at?), and how to set the logging level of log4j to DEBUG.

Thanks,
-Mua


On Fri, Apr 12, 2013 at 5:01 PM, Johnny Zhang <xi...@cloudera.com> wrote:

> seems a HDFS issue, as you said, cannot retrieval certain block from
> certain DN. Can you check the health of all DN? And properly also bump the
> log4j level to DEBUG.
>
> Johnny
>
>
> On Fri, Apr 12, 2013 at 12:06 PM, Mua Ban <ba...@gmail.com> wrote:
>
> > Thank you very much Cheolsoo,
> >
> > I am running the script once more right now and I see 7 failed reducers
> at
> > the moment on the job tracker GUI. I browse these failed reducers and I
> > found the task logs. From these 7 failed reducers, some have type 1 task
> > log, the rest have type 2 task log as I show below.
> >
> > They seem related to some connection issue among nodes in the cluster. Do
> > you know any parameters I should configure to figure out the actual
> > problem?
> >
> > Thank you,
> > -Mua
> >
> > ---------------------------------------
> > *Type 1 task log*
> >
> > 3-04-12 13:42:24,960 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 5 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:42:25,259 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:42:25,271 INFO org.apache.hadoop.mapred.ReduceTask:
> > Initiating in-memory merge with 610 segments...
> > 2013-04-12 13:42:25,273 INFO org.apache.hadoop.mapred.Merger: Merging 610
> > sorted segments
> > 2013-04-12 13:42:25,275 INFO org.apache.hadoop.mapred.Merger: Down to the
> > last merge-pass, with 610 segments left of total size: 96922927 bytes
> > 2013-04-12 13:42:27,348 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Merge of the 610 files in-memory
> > complete. Local file is
> >
> >
> /hdfs/sp/filesystem/mapred/local/taskTracker/vul/jobcache/job_201304081613_0049/attempt_201304081613_0049_r_000009_0/output/map_6.out
> > of size 96921713
> > 2013-04-12 13:42:27,349 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Thread waiting: Thread for merging
> > on-disk files
> > 2013-04-12 13:42:30,263 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:42:35,267 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 2 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:42:38,145 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0049_m_000584_0'
> > 2013-04-12 13:42:44,150 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0049_m_000557_0'
> > 2013-04-12 13:42:55,283 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:43:05,164 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0049_m_000604_0'
> > 2013-04-12 13:43:06,036 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:43:11,169 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0049_m_000597_1'
> > 2013-04-12 13:43:21,040 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Need another 5 map output(s) where 0
> > is already in progress
> > 2013-04-12 13:43:21,040 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 0 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:44:21,042 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Need another 5 map output(s) where 0
> > is already in progress
> > 2013-04-12 13:44:21,043 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:44:29,222 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0049_m_000576_0'
> > 2013-04-12 13:45:21,333 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Need another 4 map output(s) where 0
> > is already in progress
> > 2013-04-12 13:45:21,333 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 0 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:46:01,334 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:46:06,341 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:46:21,350 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Need another 2 map output(s) where 0
> > is already in progress
> > 2013-04-12 13:46:21,350 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 0 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:46:41,301 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0049_m_000616_1'
> > 2013-04-12 13:46:41,351 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0049_r_000009_0 Scheduled 2 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-12 13:46:42,301 INFO org.apache.hadoop.mapred.ReduceTask:
> > GetMapEventsThread exiting
> > 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask:
> > getMapsEventsThread joined.
> > 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask: Closed
> > ram manager
> > 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask:
> > Interleaved on-disk merge complete: 1 files left.
> > 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask:
> In-memory
> > merge complete: 11 files left.
> > 2013-04-12 13:46:42,303 INFO org.apache.hadoop.mapred.Merger: Merging 11
> > sorted segments
> > 2013-04-12 13:46:42,303 INFO org.apache.hadoop.mapred.Merger: Down to the
> > last merge-pass, with 11 segments left of total size: 3152550 bytes
> > 2013-04-12 13:46:42,393 INFO org.apache.hadoop.mapred.ReduceTask: Merged
> 11
> > segments, 3152550 bytes to disk to satisfy reduce memory limit
> > 2013-04-12 13:46:42,394 INFO org.apache.hadoop.mapred.ReduceTask:
> Merging 2
> > files, 100074247 bytes from disk
> > 2013-04-12 13:46:42,395 INFO org.apache.hadoop.mapred.ReduceTask:
> Merging 0
> > segments, 0 bytes from memory into reduce
> > 2013-04-12 13:46:42,395 INFO org.apache.hadoop.mapred.Merger: Merging 2
> > sorted segments
> > 2013-04-12 13:46:42,398 INFO org.apache.hadoop.mapred.Merger: Down to the
> > last merge-pass, with 2 segments left of total size: 100074239 bytes
> > 2013-04-12 13:57:45,872 WARN org.apache.hadoop.hdfs.DFSClient:
> > DFSOutputStream ResponseProcessor exception  for block
> > blk_-199210310173610155_28360java.net.SocketTimeoutException: 69000
> millis
> > timeout while waiting for channel to be ready for read. ch :
> > java.nio.channels.SocketChannel[connected local=/10.6.25.33:47987remote=/
> > 10.6.25.33:49197]
> > at
> >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> > at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> > at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> > at java.io.DataInputStream.readFully(DataInputStream.java:189)
> > at java.io.DataInputStream.readLong(DataInputStream.java:410)
> > at
> >
> >
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:124)
> > at
> >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2967)
> >
> > 2013-04-12 14:00:00,777 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-199210310173610155_28360 bad datanode[0]
> > 10.6.25.33:49197
> > 2013-04-12 14:00:00,866 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-199210310173610155_28360 in pipeline
> > 10.6.25.33:49197, 10.6.25.141:39369, 10.6.25.31:54563: bad datanode
> > 10.6.25.33:49197
> > 2013-04-12 14:04:55,904 INFO org.apache.hadoop.hdfs.DFSClient: Exception
> in
> > createBlockOutputStream 10.6.25.33:49197 java.io.IOException: Bad
> connect
> > ack with firstBadLink as 10.6.25.32:53741
> > 2013-04-12 14:04:55,904 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
> > block blk_-8936348770928346278_28365
> > 2013-04-12 14:04:55,907 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
> > datanode 10.6.25.32:53741
> > 2013-04-12 14:06:07,789 WARN org.apache.hadoop.hdfs.DFSClient:
> > DFSOutputStream ResponseProcessor exception  for block
> > blk_8322203915584568195_28367java.io.IOException: Bad response 1 for
> block
> > blk_8322203915584568195_28367 from datanode 10.6.25.31:54563
> > at
> >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2977)
> >
> > 2013-04-12 14:06:25,735 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_8322203915584568195_28367 bad datanode[2]
> > 10.6.25.31:54563
> > 2013-04-12 14:06:25,735 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_8322203915584568195_28367 in pipeline
> > 10.6.25.33:49197, 10.6.25.145:48897, 10.6.25.31:54563: bad datanode
> > 10.6.25.31:54563
> > 2013-04-12 14:06:45,112 WARN org.apache.hadoop.mapred.Task: Parent died.
> >  Exiting attempt_201304081613_0049_r_000009_0
> >
> >
> > ----------------------------------------------------------
> >
> > *Type 2 task log*
> >
> >
> > 14_1'
> > 2013-04-11 15:24:33,168 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0046_m_000609_1'
> > 2013-04-11 15:24:39,172 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0046_m_000599_0'
> > 2013-04-11 15:25:01,179 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0046_r_000010_0 Need another 5 map output(s) where 0
> > is already in progress
> > 2013-04-11 15:25:01,179 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0046_r_000010_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-11 15:25:24,203 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0046_m_000577_0'
> > 2013-04-11 15:25:51,529 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0046_r_000010_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-11 15:26:00,227 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0046_m_000556_1'
> > 2013-04-11 15:26:01,558 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0046_r_000010_0 Need another 3 map output(s) where 0
> > is already in progress
> > 2013-04-11 15:26:01,558 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0046_r_000010_0 Scheduled 2 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-11 15:26:06,235 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0046_m_000576_0'
> > 2013-04-11 15:26:06,235 INFO org.apache.hadoop.mapred.ReduceTask:
> Ignoring
> > obsolete output of KILLED map-task:
> 'attempt_201304081613_0046_m_000560_1'
> > 2013-04-11 15:26:06,603 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201304081613_0046_r_000010_0 Scheduled 1 outputs (0 slow hosts
> and0
> > dup hosts)
> > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> > GetMapEventsThread exiting
> > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> > getMapsEventsThread joined.
> > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask: Closed
> > ram manager
> > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> > Interleaved on-disk merge complete: 1 files left.
> > 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> In-memory
> > merge complete: 109 files left.
> > 2013-04-11 15:26:07,238 INFO org.apache.hadoop.mapred.Merger: Merging 109
> > sorted segments
> > 2013-04-11 15:26:07,238 INFO org.apache.hadoop.mapred.Merger: Down to the
> > last merge-pass, with 109 segments left of total size: 23323822 bytes
> > 2013-04-11 15:26:07,528 INFO org.apache.hadoop.mapred.ReduceTask: Merged
> > 109 segments, 23323822 bytes to disk to satisfy reduce memory limit
> > 2013-04-11 15:26:07,528 INFO org.apache.hadoop.mapred.ReduceTask:
> Merging 2
> > files, 120503030 bytes from disk
> > 2013-04-11 15:26:07,529 INFO org.apache.hadoop.mapred.ReduceTask:
> Merging 0
> > segments, 0 bytes from memory into reduce
> > 2013-04-11 15:26:07,529 INFO org.apache.hadoop.mapred.Merger: Merging 2
> > sorted segments
> > 2013-04-11 15:26:07,531 INFO org.apache.hadoop.mapred.Merger: Down to the
> > last merge-pass, with 2 segments left of total size: 120503022 bytes
> > 2013-04-11 15:28:34,121 INFO org.apache.hadoop.hdfs.DFSClient: Exception
> in
> > createBlockOutputStream 10.6.25.144:40971 java.io.IOException: Bad
> connect
> > ack with firstBadLink as 10.6.25.32:53741
> > 2013-04-11 15:28:34,121 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
> > block blk_423162934996020555_28131
> > 2013-04-11 15:28:34,123 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
> > datanode 10.6.25.32:53741
> > 2013-04-11 15:58:06,150 INFO org.apache.hadoop.hdfs.DFSClient: Exception
> in
> > createBlockOutputStream 10.6.25.144:40971 java.io.IOException: Bad
> connect
> > ack with firstBadLink as 10.6.25.31:54563
> > 2013-04-11 15:58:06,150 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
> > block blk_-7269648817395135125_28148
> > 2013-04-11 15:58:06,152 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
> > datanode 10.6.25.31:54563
> > 2013-04-11 16:06:10,175 WARN org.apache.hadoop.mapred.Task: Parent died.
> >  Exiting attempt_201304081613_0046_r_000010_0
> >
> >
> >
> > On Fri, Apr 12, 2013 at 2:25 PM, Cheolsoo Park <pi...@gmail.com>
> > wrote:
> >
> > > Hi Mua,
> > >
> > > I guess you misunderstood me. The pig_****.log file is not a task log.
> > >
> > > You should look for task logs on data nodes where your task tracker
> ran.
> > > Here is some explanation regarding various log files in Hadoop and
> where
> > to
> > > find them:
> > >
> > >
> > >
> >
> http://blog.cloudera.com/blog/2009/09/apache-hadoop-log-files-where-to-find-them-in-cdh-and-what-info-they-contain/
> > >
> > > Thanks,
> > > Cheolsoo
> > >
> > >
> > > On Fri, Apr 12, 2013 at 10:27 AM, Mua Ban <ba...@gmail.com>
> wrote:
> > >
> > > > Thank you very much for your reply.
> > > >
> > > > Below is the stack log in the pig_****.log file
> > > >
> > > > Can you please give me some suggestions?
> > > >
> > > > -Mua
> > > > ------------------
> > > > Backend error message
> > > > ---------------------
> > > > Task attempt_201304081613_0048_r_000001_0 failed to report status for
> > 601
> > > > seconds. Killing!
> > > >
> > > > Pig Stack Trace
> > > > ---------------
> > > > ERROR 2997: Unable to recreate exception from backed error: Task
> > > > attempt_201304081613_0048_r_000001_0 failed to report status for 601
> > > > seconds. Killing!
> > > >
> > > > org.apache.pig.backend.executionengine.ExecException: ERROR 2997:
> > Unable
> > > to
> > > > recreate exception from backed error: Task
> > > > attempt_201304081613_0048_r_000001_0 failed to report status for 601
> > > > seconds. Killing!
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:217)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:152)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:383)
> > > >         at org.apache.pig.PigServer.launchPlan(PigServer.java:1270)
> > > >         at
> > > >
> > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1255)
> > > >         at org.apache.pig.PigServer.execute(PigServer.java:1245)
> > > >         at org.apache.pig.PigServer.executeBatch(PigServer.java:362)
> > > >         at
> > > >
> > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:132)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:193)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
> > > >         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
> > > >         at org.apache.pig.Main.run(Main.java:555)
> > > >         at org.apache.pig.Main.main(Main.java:111)
> > > >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> > > >         at
> > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
> > > >         at
> > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
> > > >         at java.lang.reflect.Method.invoke(Method.java:611)
> > > >         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> > > >
> > > >
> > >
> >
> ================================================================================
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Apr 12, 2013 at 11:29 AM, Cheolsoo Park <
> piaozhexiu@gmail.com
> > > > >wrote:
> > > >
> > > > > Did you look at task logs to see why those tasks failed? Since
> it's a
> > > > > back-end error, the console output doesn't tell you much. Task logs
> > > > should
> > > > > have a stack trace that shows why it failed, and you can go from
> > there.
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Apr 12, 2013 at 8:18 AM, Mua Ban <ba...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am very new to PIG/Hadoop, I just started writing my first PIG
> > > > script a
> > > > > > couple days ago. I ran into this problem.
> > > > > >
> > > > > > My cluster has 9 nodes. I have to join two data sets big and
> small,
> > > > each
> > > > > is
> > > > > > collected for 4 weeks. I first take two subsets of my data set
> > (which
> > > > is
> > > > > > for the first week of data), let's call them B1 and S1 for big
> and
> > > > small
> > > > > > data sets of the first week. The entire data sets of 4 weeks is
> B4
> > > and
> > > > > S4.
> > > > > >
> > > > > > I ran my script on my cluster to join B1 and S1 and everything is
> > > > fine. I
> > > > > > got my joined data. However, when I ran my script to join B4 and
> > S4,
> > > > the
> > > > > > script failed. B4 is 39GB, S4 is 300MB. B4 is skewed, some id
> > appears
> > > > > more
> > > > > > frequently than others. I tried both 'using skewed' and 'using
> > > > > replicated'
> > > > > > modes for the join operation (by appending them to the end of the
> > > below
> > > > > > join clause), they both fail.
> > > > > >
> > > > > > Here is my script and i think it is very simple:
> > > > > >
> > > > > > *big = load 'bigdir/' using PigStorage(',') as (id:chararray,
> > > > > > data:chararray);*
> > > > > > *small = load 'smalldir/' using PigStorage(',') as
> > > > > > (t1:double,t2:double,data:chararray,id:chararray);
> > > > > > *
> > > > > > *J = JOIN big by id LEFT OUTER, small by id;
> > > > > > *
> > > > > > *store J into 'outputdir' using PigStorage(',');
> > > > > > *
> > > > > >
> > > > > > On the web ui of the tracker, I see that the job has 40 reducers
> (I
> > > > guess
> > > > > > since the total data is about 40GB, and each 1GB will need one
> > > reducer
> > > > by
> > > > > > default of PIG and hadoop setting, so this is normal). If I use
> > > > 'parallel
> > > > > > 80' in the join operation above, then I see 80 reducers, and the
> > join
> > > > > > operation still failed.
> > > > > >
> > > > > > I checked file  mapred-default.xml and found this:
> > > > > > <name>mapred.reduce.tasks</name>
> > > > > >   <value>1</value>
> > > > > >
> > > > > > If I set the value of parallel in join operation, it should
> > overwrite
> > > > > this,
> > > > > > right?
> > > > > >
> > > > > >
> > > > > > On the tracker GUI, I see that for different runs, the number of
> > > > > completed
> > > > > > reducers changes from 4 to 10 (out of 40 total reducers). The
> > tracker
> > > > GUI
> > > > > > shows the reason for the failed reducers: "Task
> > > > > > attempt_201304081613_0046_r_000006_0 failed to report status for
> > 600
> > > > > > seconds. Killing!"
> > > > > >
> > > > > > *Could you please help?*
> > > > > > Thank you very much,
> > > > > > -Mua
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------------------------------------
> > > > > > Here is the error report from the console screen where I ran this
> > > > script:
> > > > > >
> > > > > > job_201304081613_0032   616     0       230     12      32      0
> >   0
> > > > > > 0       big     MAP_ONLY
> > > > > > job_201304081613_0033   705     1       21      6       6
> > 234 2
> > > > > > 34      234             SAMPLER
> > > > > >
> > > > > > Failed Jobs:
> > > > > > JobId   Alias   Feature Message Outputs
> > > > > > job_201304081613_0034   small   SKEWED_JOIN     Message: Job
> > failed!
> > > > > > Error - # of failed Reduce Tasks exceeded allowed limit.
> > FailedCount:
> > > > 1.
> > > > > > LastFailedTask: task_201304081613_0034_r_000012
> > > > > >
> > > > > > Input(s):
> > > > > > Successfully read 364285458 records (39528533645 bytes) from:
> > > > > > "hdfs://d0521b01:24990/user/abc/big/"
> > > > > > Failed to read data from "hdfs://d0521b01:24990/user/abc/small/"
> > > > > >
> > > > > > Output(s):
> > > > > >
> > > > > > Counters:
> > > > > > Total records written : 0
> > > > > > Total bytes written : 0
> > > > > > Spillable Memory Manager spill count : 0
> > > > > > Total bags proactively spilled: 0
> > > > > > Total records proactively spilled: 0
> > > > > >
> > > > > > Job DAG:
> > > > > > job_201304081613_0032   ->      job_201304081613_0033,
> > > > > > job_201304081613_0033   ->      job_201304081613_0034,
> > > > > > job_201304081613_0034   ->      null,
> > > > > > null
> > > > > >
> > > > > >
> > > > > > 2013-04-10 20:11:23,815 [main] WARN
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > > > > - Encountered Warning
> > > > > > REDUCER_COUNT_LOW 1 time(s).
> > > > > > 2013-04-10 20:11:23,815 [main] INFO
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > > > > - Some jobs have faile
> > > > > > d! Stop running all dependent jobs
> > > > > > 2013-04-10 20:11:23,815 [main] ERROR
> > > > > org.apache.pig.tools.grunt.GruntParser
> > > > > > - ERROR 2997: Encountered IOException. java.io.IOException: Er
> > > > > > ror Recovery for block blk_312487981794332936_26563 failed
>  because
> > > > > > recovery from primary datanode 10.6.25.31:54563 failed 6 times.
> > >  Pipel
> > > > > > ine was 10.6.25.31:54563. Aborting...
> > > > > > Details at logfile:
> > > > /homes/abc/pig-flatten/scripts/pig_1365627648226.log
> > > > > > 2013-04-10 20:11:23,818 [main] ERROR
> > > > > org.apache.pig.tools.grunt.GruntParser
> > > > > > - ERROR 2244: Job failed, hadoop does not return any error mes
> > > > > > sage
> > > > > > Details at logfile:
> > > > /homes/abc/pig-flatten/scripts/pig_1365627648226.log
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: join operation fails on big data set

Posted by Johnny Zhang <xi...@cloudera.com>.
seems a HDFS issue, as you said, cannot retrieval certain block from
certain DN. Can you check the health of all DN? And properly also bump the
log4j level to DEBUG.

Johnny


On Fri, Apr 12, 2013 at 12:06 PM, Mua Ban <ba...@gmail.com> wrote:

> Thank you very much Cheolsoo,
>
> I am running the script once more right now and I see 7 failed reducers at
> the moment on the job tracker GUI. I browse these failed reducers and I
> found the task logs. From these 7 failed reducers, some have type 1 task
> log, the rest have type 2 task log as I show below.
>
> They seem related to some connection issue among nodes in the cluster. Do
> you know any parameters I should configure to figure out the actual
> problem?
>
> Thank you,
> -Mua
>
> ---------------------------------------
> *Type 1 task log*
>
> 3-04-12 13:42:24,960 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Scheduled 5 outputs (0 slow hosts and0
> dup hosts)
> 2013-04-12 13:42:25,259 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
> dup hosts)
> 2013-04-12 13:42:25,271 INFO org.apache.hadoop.mapred.ReduceTask:
> Initiating in-memory merge with 610 segments...
> 2013-04-12 13:42:25,273 INFO org.apache.hadoop.mapred.Merger: Merging 610
> sorted segments
> 2013-04-12 13:42:25,275 INFO org.apache.hadoop.mapred.Merger: Down to the
> last merge-pass, with 610 segments left of total size: 96922927 bytes
> 2013-04-12 13:42:27,348 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Merge of the 610 files in-memory
> complete. Local file is
>
> /hdfs/sp/filesystem/mapred/local/taskTracker/vul/jobcache/job_201304081613_0049/attempt_201304081613_0049_r_000009_0/output/map_6.out
> of size 96921713
> 2013-04-12 13:42:27,349 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Thread waiting: Thread for merging
> on-disk files
> 2013-04-12 13:42:30,263 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
> dup hosts)
> 2013-04-12 13:42:35,267 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Scheduled 2 outputs (0 slow hosts and0
> dup hosts)
> 2013-04-12 13:42:38,145 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
> obsolete output of KILLED map-task: 'attempt_201304081613_0049_m_000584_0'
> 2013-04-12 13:42:44,150 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
> obsolete output of KILLED map-task: 'attempt_201304081613_0049_m_000557_0'
> 2013-04-12 13:42:55,283 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
> dup hosts)
> 2013-04-12 13:43:05,164 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
> obsolete output of KILLED map-task: 'attempt_201304081613_0049_m_000604_0'
> 2013-04-12 13:43:06,036 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
> dup hosts)
> 2013-04-12 13:43:11,169 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
> obsolete output of KILLED map-task: 'attempt_201304081613_0049_m_000597_1'
> 2013-04-12 13:43:21,040 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Need another 5 map output(s) where 0
> is already in progress
> 2013-04-12 13:43:21,040 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Scheduled 0 outputs (0 slow hosts and0
> dup hosts)
> 2013-04-12 13:44:21,042 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Need another 5 map output(s) where 0
> is already in progress
> 2013-04-12 13:44:21,043 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
> dup hosts)
> 2013-04-12 13:44:29,222 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
> obsolete output of KILLED map-task: 'attempt_201304081613_0049_m_000576_0'
> 2013-04-12 13:45:21,333 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Need another 4 map output(s) where 0
> is already in progress
> 2013-04-12 13:45:21,333 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Scheduled 0 outputs (0 slow hosts and0
> dup hosts)
> 2013-04-12 13:46:01,334 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
> dup hosts)
> 2013-04-12 13:46:06,341 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
> dup hosts)
> 2013-04-12 13:46:21,350 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Need another 2 map output(s) where 0
> is already in progress
> 2013-04-12 13:46:21,350 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Scheduled 0 outputs (0 slow hosts and0
> dup hosts)
> 2013-04-12 13:46:41,301 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
> obsolete output of KILLED map-task: 'attempt_201304081613_0049_m_000616_1'
> 2013-04-12 13:46:41,351 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0049_r_000009_0 Scheduled 2 outputs (0 slow hosts and0
> dup hosts)
> 2013-04-12 13:46:42,301 INFO org.apache.hadoop.mapred.ReduceTask:
> GetMapEventsThread exiting
> 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask:
> getMapsEventsThread joined.
> 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask: Closed
> ram manager
> 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask:
> Interleaved on-disk merge complete: 1 files left.
> 2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask: In-memory
> merge complete: 11 files left.
> 2013-04-12 13:46:42,303 INFO org.apache.hadoop.mapred.Merger: Merging 11
> sorted segments
> 2013-04-12 13:46:42,303 INFO org.apache.hadoop.mapred.Merger: Down to the
> last merge-pass, with 11 segments left of total size: 3152550 bytes
> 2013-04-12 13:46:42,393 INFO org.apache.hadoop.mapred.ReduceTask: Merged 11
> segments, 3152550 bytes to disk to satisfy reduce memory limit
> 2013-04-12 13:46:42,394 INFO org.apache.hadoop.mapred.ReduceTask: Merging 2
> files, 100074247 bytes from disk
> 2013-04-12 13:46:42,395 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0
> segments, 0 bytes from memory into reduce
> 2013-04-12 13:46:42,395 INFO org.apache.hadoop.mapred.Merger: Merging 2
> sorted segments
> 2013-04-12 13:46:42,398 INFO org.apache.hadoop.mapred.Merger: Down to the
> last merge-pass, with 2 segments left of total size: 100074239 bytes
> 2013-04-12 13:57:45,872 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_-199210310173610155_28360java.net.SocketTimeoutException: 69000 millis
> timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.6.25.33:47987 remote=/
> 10.6.25.33:49197]
> at
>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
> at java.io.DataInputStream.readFully(DataInputStream.java:189)
> at java.io.DataInputStream.readLong(DataInputStream.java:410)
> at
>
> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:124)
> at
>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2967)
>
> 2013-04-12 14:00:00,777 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-199210310173610155_28360 bad datanode[0]
> 10.6.25.33:49197
> 2013-04-12 14:00:00,866 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-199210310173610155_28360 in pipeline
> 10.6.25.33:49197, 10.6.25.141:39369, 10.6.25.31:54563: bad datanode
> 10.6.25.33:49197
> 2013-04-12 14:04:55,904 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
> createBlockOutputStream 10.6.25.33:49197 java.io.IOException: Bad connect
> ack with firstBadLink as 10.6.25.32:53741
> 2013-04-12 14:04:55,904 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
> block blk_-8936348770928346278_28365
> 2013-04-12 14:04:55,907 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
> datanode 10.6.25.32:53741
> 2013-04-12 14:06:07,789 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_8322203915584568195_28367java.io.IOException: Bad response 1 for block
> blk_8322203915584568195_28367 from datanode 10.6.25.31:54563
> at
>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2977)
>
> 2013-04-12 14:06:25,735 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_8322203915584568195_28367 bad datanode[2]
> 10.6.25.31:54563
> 2013-04-12 14:06:25,735 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_8322203915584568195_28367 in pipeline
> 10.6.25.33:49197, 10.6.25.145:48897, 10.6.25.31:54563: bad datanode
> 10.6.25.31:54563
> 2013-04-12 14:06:45,112 WARN org.apache.hadoop.mapred.Task: Parent died.
>  Exiting attempt_201304081613_0049_r_000009_0
>
>
> ----------------------------------------------------------
>
> *Type 2 task log*
>
>
> 14_1'
> 2013-04-11 15:24:33,168 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
> obsolete output of KILLED map-task: 'attempt_201304081613_0046_m_000609_1'
> 2013-04-11 15:24:39,172 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
> obsolete output of KILLED map-task: 'attempt_201304081613_0046_m_000599_0'
> 2013-04-11 15:25:01,179 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0046_r_000010_0 Need another 5 map output(s) where 0
> is already in progress
> 2013-04-11 15:25:01,179 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0046_r_000010_0 Scheduled 1 outputs (0 slow hosts and0
> dup hosts)
> 2013-04-11 15:25:24,203 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
> obsolete output of KILLED map-task: 'attempt_201304081613_0046_m_000577_0'
> 2013-04-11 15:25:51,529 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0046_r_000010_0 Scheduled 1 outputs (0 slow hosts and0
> dup hosts)
> 2013-04-11 15:26:00,227 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
> obsolete output of KILLED map-task: 'attempt_201304081613_0046_m_000556_1'
> 2013-04-11 15:26:01,558 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0046_r_000010_0 Need another 3 map output(s) where 0
> is already in progress
> 2013-04-11 15:26:01,558 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0046_r_000010_0 Scheduled 2 outputs (0 slow hosts and0
> dup hosts)
> 2013-04-11 15:26:06,235 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
> obsolete output of KILLED map-task: 'attempt_201304081613_0046_m_000576_0'
> 2013-04-11 15:26:06,235 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
> obsolete output of KILLED map-task: 'attempt_201304081613_0046_m_000560_1'
> 2013-04-11 15:26:06,603 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201304081613_0046_r_000010_0 Scheduled 1 outputs (0 slow hosts and0
> dup hosts)
> 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> GetMapEventsThread exiting
> 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> getMapsEventsThread joined.
> 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask: Closed
> ram manager
> 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
> Interleaved on-disk merge complete: 1 files left.
> 2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask: In-memory
> merge complete: 109 files left.
> 2013-04-11 15:26:07,238 INFO org.apache.hadoop.mapred.Merger: Merging 109
> sorted segments
> 2013-04-11 15:26:07,238 INFO org.apache.hadoop.mapred.Merger: Down to the
> last merge-pass, with 109 segments left of total size: 23323822 bytes
> 2013-04-11 15:26:07,528 INFO org.apache.hadoop.mapred.ReduceTask: Merged
> 109 segments, 23323822 bytes to disk to satisfy reduce memory limit
> 2013-04-11 15:26:07,528 INFO org.apache.hadoop.mapred.ReduceTask: Merging 2
> files, 120503030 bytes from disk
> 2013-04-11 15:26:07,529 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0
> segments, 0 bytes from memory into reduce
> 2013-04-11 15:26:07,529 INFO org.apache.hadoop.mapred.Merger: Merging 2
> sorted segments
> 2013-04-11 15:26:07,531 INFO org.apache.hadoop.mapred.Merger: Down to the
> last merge-pass, with 2 segments left of total size: 120503022 bytes
> 2013-04-11 15:28:34,121 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
> createBlockOutputStream 10.6.25.144:40971 java.io.IOException: Bad connect
> ack with firstBadLink as 10.6.25.32:53741
> 2013-04-11 15:28:34,121 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
> block blk_423162934996020555_28131
> 2013-04-11 15:28:34,123 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
> datanode 10.6.25.32:53741
> 2013-04-11 15:58:06,150 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
> createBlockOutputStream 10.6.25.144:40971 java.io.IOException: Bad connect
> ack with firstBadLink as 10.6.25.31:54563
> 2013-04-11 15:58:06,150 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
> block blk_-7269648817395135125_28148
> 2013-04-11 15:58:06,152 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
> datanode 10.6.25.31:54563
> 2013-04-11 16:06:10,175 WARN org.apache.hadoop.mapred.Task: Parent died.
>  Exiting attempt_201304081613_0046_r_000010_0
>
>
>
> On Fri, Apr 12, 2013 at 2:25 PM, Cheolsoo Park <pi...@gmail.com>
> wrote:
>
> > Hi Mua,
> >
> > I guess you misunderstood me. The pig_****.log file is not a task log.
> >
> > You should look for task logs on data nodes where your task tracker ran.
> > Here is some explanation regarding various log files in Hadoop and where
> to
> > find them:
> >
> >
> >
> http://blog.cloudera.com/blog/2009/09/apache-hadoop-log-files-where-to-find-them-in-cdh-and-what-info-they-contain/
> >
> > Thanks,
> > Cheolsoo
> >
> >
> > On Fri, Apr 12, 2013 at 10:27 AM, Mua Ban <ba...@gmail.com> wrote:
> >
> > > Thank you very much for your reply.
> > >
> > > Below is the stack log in the pig_****.log file
> > >
> > > Can you please give me some suggestions?
> > >
> > > -Mua
> > > ------------------
> > > Backend error message
> > > ---------------------
> > > Task attempt_201304081613_0048_r_000001_0 failed to report status for
> 601
> > > seconds. Killing!
> > >
> > > Pig Stack Trace
> > > ---------------
> > > ERROR 2997: Unable to recreate exception from backed error: Task
> > > attempt_201304081613_0048_r_000001_0 failed to report status for 601
> > > seconds. Killing!
> > >
> > > org.apache.pig.backend.executionengine.ExecException: ERROR 2997:
> Unable
> > to
> > > recreate exception from backed error: Task
> > > attempt_201304081613_0048_r_000001_0 failed to report status for 601
> > > seconds. Killing!
> > >         at
> > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:217)
> > >         at
> > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:152)
> > >         at
> > >
> > >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:383)
> > >         at org.apache.pig.PigServer.launchPlan(PigServer.java:1270)
> > >         at
> > >
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1255)
> > >         at org.apache.pig.PigServer.execute(PigServer.java:1245)
> > >         at org.apache.pig.PigServer.executeBatch(PigServer.java:362)
> > >         at
> > >
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:132)
> > >         at
> > >
> > >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:193)
> > >         at
> > >
> > >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
> > >         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
> > >         at org.apache.pig.Main.run(Main.java:555)
> > >         at org.apache.pig.Main.main(Main.java:111)
> > >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >         at
> > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
> > >         at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
> > >         at java.lang.reflect.Method.invoke(Method.java:611)
> > >         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> > >
> > >
> >
> ================================================================================
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Apr 12, 2013 at 11:29 AM, Cheolsoo Park <piaozhexiu@gmail.com
> > > >wrote:
> > >
> > > > Did you look at task logs to see why those tasks failed? Since it's a
> > > > back-end error, the console output doesn't tell you much. Task logs
> > > should
> > > > have a stack trace that shows why it failed, and you can go from
> there.
> > > >
> > > >
> > > >
> > > > On Fri, Apr 12, 2013 at 8:18 AM, Mua Ban <ba...@gmail.com>
> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am very new to PIG/Hadoop, I just started writing my first PIG
> > > script a
> > > > > couple days ago. I ran into this problem.
> > > > >
> > > > > My cluster has 9 nodes. I have to join two data sets big and small,
> > > each
> > > > is
> > > > > collected for 4 weeks. I first take two subsets of my data set
> (which
> > > is
> > > > > for the first week of data), let's call them B1 and S1 for big and
> > > small
> > > > > data sets of the first week. The entire data sets of 4 weeks is B4
> > and
> > > > S4.
> > > > >
> > > > > I ran my script on my cluster to join B1 and S1 and everything is
> > > fine. I
> > > > > got my joined data. However, when I ran my script to join B4 and
> S4,
> > > the
> > > > > script failed. B4 is 39GB, S4 is 300MB. B4 is skewed, some id
> appears
> > > > more
> > > > > frequently than others. I tried both 'using skewed' and 'using
> > > > replicated'
> > > > > modes for the join operation (by appending them to the end of the
> > below
> > > > > join clause), they both fail.
> > > > >
> > > > > Here is my script and i think it is very simple:
> > > > >
> > > > > *big = load 'bigdir/' using PigStorage(',') as (id:chararray,
> > > > > data:chararray);*
> > > > > *small = load 'smalldir/' using PigStorage(',') as
> > > > > (t1:double,t2:double,data:chararray,id:chararray);
> > > > > *
> > > > > *J = JOIN big by id LEFT OUTER, small by id;
> > > > > *
> > > > > *store J into 'outputdir' using PigStorage(',');
> > > > > *
> > > > >
> > > > > On the web ui of the tracker, I see that the job has 40 reducers (I
> > > guess
> > > > > since the total data is about 40GB, and each 1GB will need one
> > reducer
> > > by
> > > > > default of PIG and hadoop setting, so this is normal). If I use
> > > 'parallel
> > > > > 80' in the join operation above, then I see 80 reducers, and the
> join
> > > > > operation still failed.
> > > > >
> > > > > I checked file  mapred-default.xml and found this:
> > > > > <name>mapred.reduce.tasks</name>
> > > > >   <value>1</value>
> > > > >
> > > > > If I set the value of parallel in join operation, it should
> overwrite
> > > > this,
> > > > > right?
> > > > >
> > > > >
> > > > > On the tracker GUI, I see that for different runs, the number of
> > > > completed
> > > > > reducers changes from 4 to 10 (out of 40 total reducers). The
> tracker
> > > GUI
> > > > > shows the reason for the failed reducers: "Task
> > > > > attempt_201304081613_0046_r_000006_0 failed to report status for
> 600
> > > > > seconds. Killing!"
> > > > >
> > > > > *Could you please help?*
> > > > > Thank you very much,
> > > > > -Mua
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------------------------------------
> > > > > Here is the error report from the console screen where I ran this
> > > script:
> > > > >
> > > > > job_201304081613_0032   616     0       230     12      32      0
>   0
> > > > > 0       big     MAP_ONLY
> > > > > job_201304081613_0033   705     1       21      6       6
> 234 2
> > > > > 34      234             SAMPLER
> > > > >
> > > > > Failed Jobs:
> > > > > JobId   Alias   Feature Message Outputs
> > > > > job_201304081613_0034   small   SKEWED_JOIN     Message: Job
> failed!
> > > > > Error - # of failed Reduce Tasks exceeded allowed limit.
> FailedCount:
> > > 1.
> > > > > LastFailedTask: task_201304081613_0034_r_000012
> > > > >
> > > > > Input(s):
> > > > > Successfully read 364285458 records (39528533645 bytes) from:
> > > > > "hdfs://d0521b01:24990/user/abc/big/"
> > > > > Failed to read data from "hdfs://d0521b01:24990/user/abc/small/"
> > > > >
> > > > > Output(s):
> > > > >
> > > > > Counters:
> > > > > Total records written : 0
> > > > > Total bytes written : 0
> > > > > Spillable Memory Manager spill count : 0
> > > > > Total bags proactively spilled: 0
> > > > > Total records proactively spilled: 0
> > > > >
> > > > > Job DAG:
> > > > > job_201304081613_0032   ->      job_201304081613_0033,
> > > > > job_201304081613_0033   ->      job_201304081613_0034,
> > > > > job_201304081613_0034   ->      null,
> > > > > null
> > > > >
> > > > >
> > > > > 2013-04-10 20:11:23,815 [main] WARN
> > > > >
> > > > >
> > > >
> > >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > > > - Encountered Warning
> > > > > REDUCER_COUNT_LOW 1 time(s).
> > > > > 2013-04-10 20:11:23,815 [main] INFO
> > > > >
> > > > >
> > > >
> > >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > > > - Some jobs have faile
> > > > > d! Stop running all dependent jobs
> > > > > 2013-04-10 20:11:23,815 [main] ERROR
> > > > org.apache.pig.tools.grunt.GruntParser
> > > > > - ERROR 2997: Encountered IOException. java.io.IOException: Er
> > > > > ror Recovery for block blk_312487981794332936_26563 failed  because
> > > > > recovery from primary datanode 10.6.25.31:54563 failed 6 times.
> >  Pipel
> > > > > ine was 10.6.25.31:54563. Aborting...
> > > > > Details at logfile:
> > > /homes/abc/pig-flatten/scripts/pig_1365627648226.log
> > > > > 2013-04-10 20:11:23,818 [main] ERROR
> > > > org.apache.pig.tools.grunt.GruntParser
> > > > > - ERROR 2244: Job failed, hadoop does not return any error mes
> > > > > sage
> > > > > Details at logfile:
> > > /homes/abc/pig-flatten/scripts/pig_1365627648226.log
> > > > >
> > > >
> > >
> >
>

Re: join operation fails on big data set

Posted by Mua Ban <ba...@gmail.com>.
Thank you very much Cheolsoo,

I am running the script once more right now and I see 7 failed reducers at
the moment on the job tracker GUI. I browse these failed reducers and I
found the task logs. From these 7 failed reducers, some have type 1 task
log, the rest have type 2 task log as I show below.

They seem related to some connection issue among nodes in the cluster. Do
you know any parameters I should configure to figure out the actual problem?

Thank you,
-Mua

---------------------------------------
*Type 1 task log*

3-04-12 13:42:24,960 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Scheduled 5 outputs (0 slow hosts and0
dup hosts)
2013-04-12 13:42:25,259 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)
2013-04-12 13:42:25,271 INFO org.apache.hadoop.mapred.ReduceTask:
Initiating in-memory merge with 610 segments...
2013-04-12 13:42:25,273 INFO org.apache.hadoop.mapred.Merger: Merging 610
sorted segments
2013-04-12 13:42:25,275 INFO org.apache.hadoop.mapred.Merger: Down to the
last merge-pass, with 610 segments left of total size: 96922927 bytes
2013-04-12 13:42:27,348 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Merge of the 610 files in-memory
complete. Local file is
/hdfs/sp/filesystem/mapred/local/taskTracker/vul/jobcache/job_201304081613_0049/attempt_201304081613_0049_r_000009_0/output/map_6.out
of size 96921713
2013-04-12 13:42:27,349 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Thread waiting: Thread for merging
on-disk files
2013-04-12 13:42:30,263 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)
2013-04-12 13:42:35,267 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Scheduled 2 outputs (0 slow hosts and0
dup hosts)
2013-04-12 13:42:38,145 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
obsolete output of KILLED map-task: 'attempt_201304081613_0049_m_000584_0'
2013-04-12 13:42:44,150 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
obsolete output of KILLED map-task: 'attempt_201304081613_0049_m_000557_0'
2013-04-12 13:42:55,283 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)
2013-04-12 13:43:05,164 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
obsolete output of KILLED map-task: 'attempt_201304081613_0049_m_000604_0'
2013-04-12 13:43:06,036 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)
2013-04-12 13:43:11,169 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
obsolete output of KILLED map-task: 'attempt_201304081613_0049_m_000597_1'
2013-04-12 13:43:21,040 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Need another 5 map output(s) where 0
is already in progress
2013-04-12 13:43:21,040 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Scheduled 0 outputs (0 slow hosts and0
dup hosts)
2013-04-12 13:44:21,042 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Need another 5 map output(s) where 0
is already in progress
2013-04-12 13:44:21,043 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)
2013-04-12 13:44:29,222 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
obsolete output of KILLED map-task: 'attempt_201304081613_0049_m_000576_0'
2013-04-12 13:45:21,333 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Need another 4 map output(s) where 0
is already in progress
2013-04-12 13:45:21,333 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Scheduled 0 outputs (0 slow hosts and0
dup hosts)
2013-04-12 13:46:01,334 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)
2013-04-12 13:46:06,341 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)
2013-04-12 13:46:21,350 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Need another 2 map output(s) where 0
is already in progress
2013-04-12 13:46:21,350 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Scheduled 0 outputs (0 slow hosts and0
dup hosts)
2013-04-12 13:46:41,301 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
obsolete output of KILLED map-task: 'attempt_201304081613_0049_m_000616_1'
2013-04-12 13:46:41,351 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0049_r_000009_0 Scheduled 2 outputs (0 slow hosts and0
dup hosts)
2013-04-12 13:46:42,301 INFO org.apache.hadoop.mapred.ReduceTask:
GetMapEventsThread exiting
2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask:
getMapsEventsThread joined.
2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask: Closed
ram manager
2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask:
Interleaved on-disk merge complete: 1 files left.
2013-04-12 13:46:42,302 INFO org.apache.hadoop.mapred.ReduceTask: In-memory
merge complete: 11 files left.
2013-04-12 13:46:42,303 INFO org.apache.hadoop.mapred.Merger: Merging 11
sorted segments
2013-04-12 13:46:42,303 INFO org.apache.hadoop.mapred.Merger: Down to the
last merge-pass, with 11 segments left of total size: 3152550 bytes
2013-04-12 13:46:42,393 INFO org.apache.hadoop.mapred.ReduceTask: Merged 11
segments, 3152550 bytes to disk to satisfy reduce memory limit
2013-04-12 13:46:42,394 INFO org.apache.hadoop.mapred.ReduceTask: Merging 2
files, 100074247 bytes from disk
2013-04-12 13:46:42,395 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0
segments, 0 bytes from memory into reduce
2013-04-12 13:46:42,395 INFO org.apache.hadoop.mapred.Merger: Merging 2
sorted segments
2013-04-12 13:46:42,398 INFO org.apache.hadoop.mapred.Merger: Down to the
last merge-pass, with 2 segments left of total size: 100074239 bytes
2013-04-12 13:57:45,872 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception  for block
blk_-199210310173610155_28360java.net.SocketTimeoutException: 69000 millis
timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.6.25.33:47987 remote=/
10.6.25.33:49197]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.DataInputStream.readFully(DataInputStream.java:189)
at java.io.DataInputStream.readLong(DataInputStream.java:410)
at
org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:124)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2967)

2013-04-12 14:00:00,777 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-199210310173610155_28360 bad datanode[0]
10.6.25.33:49197
2013-04-12 14:00:00,866 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-199210310173610155_28360 in pipeline
10.6.25.33:49197, 10.6.25.141:39369, 10.6.25.31:54563: bad datanode
10.6.25.33:49197
2013-04-12 14:04:55,904 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream 10.6.25.33:49197 java.io.IOException: Bad connect
ack with firstBadLink as 10.6.25.32:53741
2013-04-12 14:04:55,904 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
block blk_-8936348770928346278_28365
2013-04-12 14:04:55,907 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
datanode 10.6.25.32:53741
2013-04-12 14:06:07,789 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception  for block
blk_8322203915584568195_28367java.io.IOException: Bad response 1 for block
blk_8322203915584568195_28367 from datanode 10.6.25.31:54563
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2977)

2013-04-12 14:06:25,735 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_8322203915584568195_28367 bad datanode[2]
10.6.25.31:54563
2013-04-12 14:06:25,735 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_8322203915584568195_28367 in pipeline
10.6.25.33:49197, 10.6.25.145:48897, 10.6.25.31:54563: bad datanode
10.6.25.31:54563
2013-04-12 14:06:45,112 WARN org.apache.hadoop.mapred.Task: Parent died.
 Exiting attempt_201304081613_0049_r_000009_0


----------------------------------------------------------

*Type 2 task log*


14_1'
2013-04-11 15:24:33,168 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
obsolete output of KILLED map-task: 'attempt_201304081613_0046_m_000609_1'
2013-04-11 15:24:39,172 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
obsolete output of KILLED map-task: 'attempt_201304081613_0046_m_000599_0'
2013-04-11 15:25:01,179 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0046_r_000010_0 Need another 5 map output(s) where 0
is already in progress
2013-04-11 15:25:01,179 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0046_r_000010_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)
2013-04-11 15:25:24,203 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
obsolete output of KILLED map-task: 'attempt_201304081613_0046_m_000577_0'
2013-04-11 15:25:51,529 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0046_r_000010_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)
2013-04-11 15:26:00,227 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
obsolete output of KILLED map-task: 'attempt_201304081613_0046_m_000556_1'
2013-04-11 15:26:01,558 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0046_r_000010_0 Need another 3 map output(s) where 0
is already in progress
2013-04-11 15:26:01,558 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0046_r_000010_0 Scheduled 2 outputs (0 slow hosts and0
dup hosts)
2013-04-11 15:26:06,235 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
obsolete output of KILLED map-task: 'attempt_201304081613_0046_m_000576_0'
2013-04-11 15:26:06,235 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring
obsolete output of KILLED map-task: 'attempt_201304081613_0046_m_000560_1'
2013-04-11 15:26:06,603 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201304081613_0046_r_000010_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)
2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
GetMapEventsThread exiting
2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
getMapsEventsThread joined.
2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask: Closed
ram manager
2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask:
Interleaved on-disk merge complete: 1 files left.
2013-04-11 15:26:07,236 INFO org.apache.hadoop.mapred.ReduceTask: In-memory
merge complete: 109 files left.
2013-04-11 15:26:07,238 INFO org.apache.hadoop.mapred.Merger: Merging 109
sorted segments
2013-04-11 15:26:07,238 INFO org.apache.hadoop.mapred.Merger: Down to the
last merge-pass, with 109 segments left of total size: 23323822 bytes
2013-04-11 15:26:07,528 INFO org.apache.hadoop.mapred.ReduceTask: Merged
109 segments, 23323822 bytes to disk to satisfy reduce memory limit
2013-04-11 15:26:07,528 INFO org.apache.hadoop.mapred.ReduceTask: Merging 2
files, 120503030 bytes from disk
2013-04-11 15:26:07,529 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0
segments, 0 bytes from memory into reduce
2013-04-11 15:26:07,529 INFO org.apache.hadoop.mapred.Merger: Merging 2
sorted segments
2013-04-11 15:26:07,531 INFO org.apache.hadoop.mapred.Merger: Down to the
last merge-pass, with 2 segments left of total size: 120503022 bytes
2013-04-11 15:28:34,121 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream 10.6.25.144:40971 java.io.IOException: Bad connect
ack with firstBadLink as 10.6.25.32:53741
2013-04-11 15:28:34,121 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
block blk_423162934996020555_28131
2013-04-11 15:28:34,123 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
datanode 10.6.25.32:53741
2013-04-11 15:58:06,150 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream 10.6.25.144:40971 java.io.IOException: Bad connect
ack with firstBadLink as 10.6.25.31:54563
2013-04-11 15:58:06,150 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
block blk_-7269648817395135125_28148
2013-04-11 15:58:06,152 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
datanode 10.6.25.31:54563
2013-04-11 16:06:10,175 WARN org.apache.hadoop.mapred.Task: Parent died.
 Exiting attempt_201304081613_0046_r_000010_0



On Fri, Apr 12, 2013 at 2:25 PM, Cheolsoo Park <pi...@gmail.com> wrote:

> Hi Mua,
>
> I guess you misunderstood me. The pig_****.log file is not a task log.
>
> You should look for task logs on data nodes where your task tracker ran.
> Here is some explanation regarding various log files in Hadoop and where to
> find them:
>
>
> http://blog.cloudera.com/blog/2009/09/apache-hadoop-log-files-where-to-find-them-in-cdh-and-what-info-they-contain/
>
> Thanks,
> Cheolsoo
>
>
> On Fri, Apr 12, 2013 at 10:27 AM, Mua Ban <ba...@gmail.com> wrote:
>
> > Thank you very much for your reply.
> >
> > Below is the stack log in the pig_****.log file
> >
> > Can you please give me some suggestions?
> >
> > -Mua
> > ------------------
> > Backend error message
> > ---------------------
> > Task attempt_201304081613_0048_r_000001_0 failed to report status for 601
> > seconds. Killing!
> >
> > Pig Stack Trace
> > ---------------
> > ERROR 2997: Unable to recreate exception from backed error: Task
> > attempt_201304081613_0048_r_000001_0 failed to report status for 601
> > seconds. Killing!
> >
> > org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable
> to
> > recreate exception from backed error: Task
> > attempt_201304081613_0048_r_000001_0 failed to report status for 601
> > seconds. Killing!
> >         at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:217)
> >         at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:152)
> >         at
> >
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:383)
> >         at org.apache.pig.PigServer.launchPlan(PigServer.java:1270)
> >         at
> > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1255)
> >         at org.apache.pig.PigServer.execute(PigServer.java:1245)
> >         at org.apache.pig.PigServer.executeBatch(PigServer.java:362)
> >         at
> > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:132)
> >         at
> >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:193)
> >         at
> >
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
> >         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
> >         at org.apache.pig.Main.run(Main.java:555)
> >         at org.apache.pig.Main.main(Main.java:111)
> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >         at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
> >         at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
> >         at java.lang.reflect.Method.invoke(Method.java:611)
> >         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >
> >
> ================================================================================
> >
> >
> >
> >
> >
> > On Fri, Apr 12, 2013 at 11:29 AM, Cheolsoo Park <piaozhexiu@gmail.com
> > >wrote:
> >
> > > Did you look at task logs to see why those tasks failed? Since it's a
> > > back-end error, the console output doesn't tell you much. Task logs
> > should
> > > have a stack trace that shows why it failed, and you can go from there.
> > >
> > >
> > >
> > > On Fri, Apr 12, 2013 at 8:18 AM, Mua Ban <ba...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am very new to PIG/Hadoop, I just started writing my first PIG
> > script a
> > > > couple days ago. I ran into this problem.
> > > >
> > > > My cluster has 9 nodes. I have to join two data sets big and small,
> > each
> > > is
> > > > collected for 4 weeks. I first take two subsets of my data set (which
> > is
> > > > for the first week of data), let's call them B1 and S1 for big and
> > small
> > > > data sets of the first week. The entire data sets of 4 weeks is B4
> and
> > > S4.
> > > >
> > > > I ran my script on my cluster to join B1 and S1 and everything is
> > fine. I
> > > > got my joined data. However, when I ran my script to join B4 and S4,
> > the
> > > > script failed. B4 is 39GB, S4 is 300MB. B4 is skewed, some id appears
> > > more
> > > > frequently than others. I tried both 'using skewed' and 'using
> > > replicated'
> > > > modes for the join operation (by appending them to the end of the
> below
> > > > join clause), they both fail.
> > > >
> > > > Here is my script and i think it is very simple:
> > > >
> > > > *big = load 'bigdir/' using PigStorage(',') as (id:chararray,
> > > > data:chararray);*
> > > > *small = load 'smalldir/' using PigStorage(',') as
> > > > (t1:double,t2:double,data:chararray,id:chararray);
> > > > *
> > > > *J = JOIN big by id LEFT OUTER, small by id;
> > > > *
> > > > *store J into 'outputdir' using PigStorage(',');
> > > > *
> > > >
> > > > On the web ui of the tracker, I see that the job has 40 reducers (I
> > guess
> > > > since the total data is about 40GB, and each 1GB will need one
> reducer
> > by
> > > > default of PIG and hadoop setting, so this is normal). If I use
> > 'parallel
> > > > 80' in the join operation above, then I see 80 reducers, and the join
> > > > operation still failed.
> > > >
> > > > I checked file  mapred-default.xml and found this:
> > > > <name>mapred.reduce.tasks</name>
> > > >   <value>1</value>
> > > >
> > > > If I set the value of parallel in join operation, it should overwrite
> > > this,
> > > > right?
> > > >
> > > >
> > > > On the tracker GUI, I see that for different runs, the number of
> > > completed
> > > > reducers changes from 4 to 10 (out of 40 total reducers). The tracker
> > GUI
> > > > shows the reason for the failed reducers: "Task
> > > > attempt_201304081613_0046_r_000006_0 failed to report status for 600
> > > > seconds. Killing!"
> > > >
> > > > *Could you please help?*
> > > > Thank you very much,
> > > > -Mua
> > > >
> > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------------------------------------
> > > > Here is the error report from the console screen where I ran this
> > script:
> > > >
> > > > job_201304081613_0032   616     0       230     12      32      0   0
> > > > 0       big     MAP_ONLY
> > > > job_201304081613_0033   705     1       21      6       6       234 2
> > > > 34      234             SAMPLER
> > > >
> > > > Failed Jobs:
> > > > JobId   Alias   Feature Message Outputs
> > > > job_201304081613_0034   small   SKEWED_JOIN     Message: Job failed!
> > > > Error - # of failed Reduce Tasks exceeded allowed limit. FailedCount:
> > 1.
> > > > LastFailedTask: task_201304081613_0034_r_000012
> > > >
> > > > Input(s):
> > > > Successfully read 364285458 records (39528533645 bytes) from:
> > > > "hdfs://d0521b01:24990/user/abc/big/"
> > > > Failed to read data from "hdfs://d0521b01:24990/user/abc/small/"
> > > >
> > > > Output(s):
> > > >
> > > > Counters:
> > > > Total records written : 0
> > > > Total bytes written : 0
> > > > Spillable Memory Manager spill count : 0
> > > > Total bags proactively spilled: 0
> > > > Total records proactively spilled: 0
> > > >
> > > > Job DAG:
> > > > job_201304081613_0032   ->      job_201304081613_0033,
> > > > job_201304081613_0033   ->      job_201304081613_0034,
> > > > job_201304081613_0034   ->      null,
> > > > null
> > > >
> > > >
> > > > 2013-04-10 20:11:23,815 [main] WARN
> > > >
> > > >
> > >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > > - Encountered Warning
> > > > REDUCER_COUNT_LOW 1 time(s).
> > > > 2013-04-10 20:11:23,815 [main] INFO
> > > >
> > > >
> > >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > > - Some jobs have faile
> > > > d! Stop running all dependent jobs
> > > > 2013-04-10 20:11:23,815 [main] ERROR
> > > org.apache.pig.tools.grunt.GruntParser
> > > > - ERROR 2997: Encountered IOException. java.io.IOException: Er
> > > > ror Recovery for block blk_312487981794332936_26563 failed  because
> > > > recovery from primary datanode 10.6.25.31:54563 failed 6 times.
>  Pipel
> > > > ine was 10.6.25.31:54563. Aborting...
> > > > Details at logfile:
> > /homes/abc/pig-flatten/scripts/pig_1365627648226.log
> > > > 2013-04-10 20:11:23,818 [main] ERROR
> > > org.apache.pig.tools.grunt.GruntParser
> > > > - ERROR 2244: Job failed, hadoop does not return any error mes
> > > > sage
> > > > Details at logfile:
> > /homes/abc/pig-flatten/scripts/pig_1365627648226.log
> > > >
> > >
> >
>

Re: join operation fails on big data set

Posted by Cheolsoo Park <pi...@gmail.com>.
Hi Mua,

I guess you misunderstood me. The pig_****.log file is not a task log.

You should look for task logs on data nodes where your task tracker ran.
Here is some explanation regarding various log files in Hadoop and where to
find them:

http://blog.cloudera.com/blog/2009/09/apache-hadoop-log-files-where-to-find-them-in-cdh-and-what-info-they-contain/

Thanks,
Cheolsoo


On Fri, Apr 12, 2013 at 10:27 AM, Mua Ban <ba...@gmail.com> wrote:

> Thank you very much for your reply.
>
> Below is the stack log in the pig_****.log file
>
> Can you please give me some suggestions?
>
> -Mua
> ------------------
> Backend error message
> ---------------------
> Task attempt_201304081613_0048_r_000001_0 failed to report status for 601
> seconds. Killing!
>
> Pig Stack Trace
> ---------------
> ERROR 2997: Unable to recreate exception from backed error: Task
> attempt_201304081613_0048_r_000001_0 failed to report status for 601
> seconds. Killing!
>
> org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to
> recreate exception from backed error: Task
> attempt_201304081613_0048_r_000001_0 failed to report status for 601
> seconds. Killing!
>         at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:217)
>         at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:152)
>         at
>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:383)
>         at org.apache.pig.PigServer.launchPlan(PigServer.java:1270)
>         at
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1255)
>         at org.apache.pig.PigServer.execute(PigServer.java:1245)
>         at org.apache.pig.PigServer.executeBatch(PigServer.java:362)
>         at
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:132)
>         at
>
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:193)
>         at
>
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
>         at org.apache.pig.Main.run(Main.java:555)
>         at org.apache.pig.Main.main(Main.java:111)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
>         at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
>         at java.lang.reflect.Method.invoke(Method.java:611)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>
> ================================================================================
>
>
>
>
>
> On Fri, Apr 12, 2013 at 11:29 AM, Cheolsoo Park <piaozhexiu@gmail.com
> >wrote:
>
> > Did you look at task logs to see why those tasks failed? Since it's a
> > back-end error, the console output doesn't tell you much. Task logs
> should
> > have a stack trace that shows why it failed, and you can go from there.
> >
> >
> >
> > On Fri, Apr 12, 2013 at 8:18 AM, Mua Ban <ba...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I am very new to PIG/Hadoop, I just started writing my first PIG
> script a
> > > couple days ago. I ran into this problem.
> > >
> > > My cluster has 9 nodes. I have to join two data sets big and small,
> each
> > is
> > > collected for 4 weeks. I first take two subsets of my data set (which
> is
> > > for the first week of data), let's call them B1 and S1 for big and
> small
> > > data sets of the first week. The entire data sets of 4 weeks is B4 and
> > S4.
> > >
> > > I ran my script on my cluster to join B1 and S1 and everything is
> fine. I
> > > got my joined data. However, when I ran my script to join B4 and S4,
> the
> > > script failed. B4 is 39GB, S4 is 300MB. B4 is skewed, some id appears
> > more
> > > frequently than others. I tried both 'using skewed' and 'using
> > replicated'
> > > modes for the join operation (by appending them to the end of the below
> > > join clause), they both fail.
> > >
> > > Here is my script and i think it is very simple:
> > >
> > > *big = load 'bigdir/' using PigStorage(',') as (id:chararray,
> > > data:chararray);*
> > > *small = load 'smalldir/' using PigStorage(',') as
> > > (t1:double,t2:double,data:chararray,id:chararray);
> > > *
> > > *J = JOIN big by id LEFT OUTER, small by id;
> > > *
> > > *store J into 'outputdir' using PigStorage(',');
> > > *
> > >
> > > On the web ui of the tracker, I see that the job has 40 reducers (I
> guess
> > > since the total data is about 40GB, and each 1GB will need one reducer
> by
> > > default of PIG and hadoop setting, so this is normal). If I use
> 'parallel
> > > 80' in the join operation above, then I see 80 reducers, and the join
> > > operation still failed.
> > >
> > > I checked file  mapred-default.xml and found this:
> > > <name>mapred.reduce.tasks</name>
> > >   <value>1</value>
> > >
> > > If I set the value of parallel in join operation, it should overwrite
> > this,
> > > right?
> > >
> > >
> > > On the tracker GUI, I see that for different runs, the number of
> > completed
> > > reducers changes from 4 to 10 (out of 40 total reducers). The tracker
> GUI
> > > shows the reason for the failed reducers: "Task
> > > attempt_201304081613_0046_r_000006_0 failed to report status for 600
> > > seconds. Killing!"
> > >
> > > *Could you please help?*
> > > Thank you very much,
> > > -Mua
> > >
> > >
> > >
> >
> --------------------------------------------------------------------------------------------------------------
> > > Here is the error report from the console screen where I ran this
> script:
> > >
> > > job_201304081613_0032   616     0       230     12      32      0   0
> > > 0       big     MAP_ONLY
> > > job_201304081613_0033   705     1       21      6       6       234 2
> > > 34      234             SAMPLER
> > >
> > > Failed Jobs:
> > > JobId   Alias   Feature Message Outputs
> > > job_201304081613_0034   small   SKEWED_JOIN     Message: Job failed!
> > > Error - # of failed Reduce Tasks exceeded allowed limit. FailedCount:
> 1.
> > > LastFailedTask: task_201304081613_0034_r_000012
> > >
> > > Input(s):
> > > Successfully read 364285458 records (39528533645 bytes) from:
> > > "hdfs://d0521b01:24990/user/abc/big/"
> > > Failed to read data from "hdfs://d0521b01:24990/user/abc/small/"
> > >
> > > Output(s):
> > >
> > > Counters:
> > > Total records written : 0
> > > Total bytes written : 0
> > > Spillable Memory Manager spill count : 0
> > > Total bags proactively spilled: 0
> > > Total records proactively spilled: 0
> > >
> > > Job DAG:
> > > job_201304081613_0032   ->      job_201304081613_0033,
> > > job_201304081613_0033   ->      job_201304081613_0034,
> > > job_201304081613_0034   ->      null,
> > > null
> > >
> > >
> > > 2013-04-10 20:11:23,815 [main] WARN
> > >
> > >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > - Encountered Warning
> > > REDUCER_COUNT_LOW 1 time(s).
> > > 2013-04-10 20:11:23,815 [main] INFO
> > >
> > >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > > - Some jobs have faile
> > > d! Stop running all dependent jobs
> > > 2013-04-10 20:11:23,815 [main] ERROR
> > org.apache.pig.tools.grunt.GruntParser
> > > - ERROR 2997: Encountered IOException. java.io.IOException: Er
> > > ror Recovery for block blk_312487981794332936_26563 failed  because
> > > recovery from primary datanode 10.6.25.31:54563 failed 6 times.  Pipel
> > > ine was 10.6.25.31:54563. Aborting...
> > > Details at logfile:
> /homes/abc/pig-flatten/scripts/pig_1365627648226.log
> > > 2013-04-10 20:11:23,818 [main] ERROR
> > org.apache.pig.tools.grunt.GruntParser
> > > - ERROR 2244: Job failed, hadoop does not return any error mes
> > > sage
> > > Details at logfile:
> /homes/abc/pig-flatten/scripts/pig_1365627648226.log
> > >
> >
>

Re: join operation fails on big data set

Posted by Mua Ban <ba...@gmail.com>.
Thank you very much for your reply.

Below is the stack log in the pig_****.log file

Can you please give me some suggestions?

-Mua
------------------
Backend error message
---------------------
Task attempt_201304081613_0048_r_000001_0 failed to report status for 601
seconds. Killing!

Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backed error: Task
attempt_201304081613_0048_r_000001_0 failed to report status for 601
seconds. Killing!

org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to
recreate exception from backed error: Task
attempt_201304081613_0048_r_000001_0 failed to report status for 601
seconds. Killing!
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:217)
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:152)
        at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:383)
        at org.apache.pig.PigServer.launchPlan(PigServer.java:1270)
        at
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1255)
        at org.apache.pig.PigServer.execute(PigServer.java:1245)
        at org.apache.pig.PigServer.executeBatch(PigServer.java:362)
        at
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:132)
        at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:193)
        at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
        at org.apache.pig.Main.run(Main.java:555)
        at org.apache.pig.Main.main(Main.java:111)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
        at java.lang.reflect.Method.invoke(Method.java:611)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
================================================================================





On Fri, Apr 12, 2013 at 11:29 AM, Cheolsoo Park <pi...@gmail.com>wrote:

> Did you look at task logs to see why those tasks failed? Since it's a
> back-end error, the console output doesn't tell you much. Task logs should
> have a stack trace that shows why it failed, and you can go from there.
>
>
>
> On Fri, Apr 12, 2013 at 8:18 AM, Mua Ban <ba...@gmail.com> wrote:
>
> > Hi,
> >
> > I am very new to PIG/Hadoop, I just started writing my first PIG script a
> > couple days ago. I ran into this problem.
> >
> > My cluster has 9 nodes. I have to join two data sets big and small, each
> is
> > collected for 4 weeks. I first take two subsets of my data set (which is
> > for the first week of data), let's call them B1 and S1 for big and small
> > data sets of the first week. The entire data sets of 4 weeks is B4 and
> S4.
> >
> > I ran my script on my cluster to join B1 and S1 and everything is fine. I
> > got my joined data. However, when I ran my script to join B4 and S4, the
> > script failed. B4 is 39GB, S4 is 300MB. B4 is skewed, some id appears
> more
> > frequently than others. I tried both 'using skewed' and 'using
> replicated'
> > modes for the join operation (by appending them to the end of the below
> > join clause), they both fail.
> >
> > Here is my script and i think it is very simple:
> >
> > *big = load 'bigdir/' using PigStorage(',') as (id:chararray,
> > data:chararray);*
> > *small = load 'smalldir/' using PigStorage(',') as
> > (t1:double,t2:double,data:chararray,id:chararray);
> > *
> > *J = JOIN big by id LEFT OUTER, small by id;
> > *
> > *store J into 'outputdir' using PigStorage(',');
> > *
> >
> > On the web ui of the tracker, I see that the job has 40 reducers (I guess
> > since the total data is about 40GB, and each 1GB will need one reducer by
> > default of PIG and hadoop setting, so this is normal). If I use 'parallel
> > 80' in the join operation above, then I see 80 reducers, and the join
> > operation still failed.
> >
> > I checked file  mapred-default.xml and found this:
> > <name>mapred.reduce.tasks</name>
> >   <value>1</value>
> >
> > If I set the value of parallel in join operation, it should overwrite
> this,
> > right?
> >
> >
> > On the tracker GUI, I see that for different runs, the number of
> completed
> > reducers changes from 4 to 10 (out of 40 total reducers). The tracker GUI
> > shows the reason for the failed reducers: "Task
> > attempt_201304081613_0046_r_000006_0 failed to report status for 600
> > seconds. Killing!"
> >
> > *Could you please help?*
> > Thank you very much,
> > -Mua
> >
> >
> >
> --------------------------------------------------------------------------------------------------------------
> > Here is the error report from the console screen where I ran this script:
> >
> > job_201304081613_0032   616     0       230     12      32      0   0
> > 0       big     MAP_ONLY
> > job_201304081613_0033   705     1       21      6       6       234 2
> > 34      234             SAMPLER
> >
> > Failed Jobs:
> > JobId   Alias   Feature Message Outputs
> > job_201304081613_0034   small   SKEWED_JOIN     Message: Job failed!
> > Error - # of failed Reduce Tasks exceeded allowed limit. FailedCount: 1.
> > LastFailedTask: task_201304081613_0034_r_000012
> >
> > Input(s):
> > Successfully read 364285458 records (39528533645 bytes) from:
> > "hdfs://d0521b01:24990/user/abc/big/"
> > Failed to read data from "hdfs://d0521b01:24990/user/abc/small/"
> >
> > Output(s):
> >
> > Counters:
> > Total records written : 0
> > Total bytes written : 0
> > Spillable Memory Manager spill count : 0
> > Total bags proactively spilled: 0
> > Total records proactively spilled: 0
> >
> > Job DAG:
> > job_201304081613_0032   ->      job_201304081613_0033,
> > job_201304081613_0033   ->      job_201304081613_0034,
> > job_201304081613_0034   ->      null,
> > null
> >
> >
> > 2013-04-10 20:11:23,815 [main] WARN
> >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - Encountered Warning
> > REDUCER_COUNT_LOW 1 time(s).
> > 2013-04-10 20:11:23,815 [main] INFO
> >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - Some jobs have faile
> > d! Stop running all dependent jobs
> > 2013-04-10 20:11:23,815 [main] ERROR
> org.apache.pig.tools.grunt.GruntParser
> > - ERROR 2997: Encountered IOException. java.io.IOException: Er
> > ror Recovery for block blk_312487981794332936_26563 failed  because
> > recovery from primary datanode 10.6.25.31:54563 failed 6 times.  Pipel
> > ine was 10.6.25.31:54563. Aborting...
> > Details at logfile: /homes/abc/pig-flatten/scripts/pig_1365627648226.log
> > 2013-04-10 20:11:23,818 [main] ERROR
> org.apache.pig.tools.grunt.GruntParser
> > - ERROR 2244: Job failed, hadoop does not return any error mes
> > sage
> > Details at logfile: /homes/abc/pig-flatten/scripts/pig_1365627648226.log
> >
>

Re: join operation fails on big data set

Posted by Cheolsoo Park <pi...@gmail.com>.
Did you look at task logs to see why those tasks failed? Since it's a
back-end error, the console output doesn't tell you much. Task logs should
have a stack trace that shows why it failed, and you can go from there.



On Fri, Apr 12, 2013 at 8:18 AM, Mua Ban <ba...@gmail.com> wrote:

> Hi,
>
> I am very new to PIG/Hadoop, I just started writing my first PIG script a
> couple days ago. I ran into this problem.
>
> My cluster has 9 nodes. I have to join two data sets big and small, each is
> collected for 4 weeks. I first take two subsets of my data set (which is
> for the first week of data), let's call them B1 and S1 for big and small
> data sets of the first week. The entire data sets of 4 weeks is B4 and S4.
>
> I ran my script on my cluster to join B1 and S1 and everything is fine. I
> got my joined data. However, when I ran my script to join B4 and S4, the
> script failed. B4 is 39GB, S4 is 300MB. B4 is skewed, some id appears more
> frequently than others. I tried both 'using skewed' and 'using replicated'
> modes for the join operation (by appending them to the end of the below
> join clause), they both fail.
>
> Here is my script and i think it is very simple:
>
> *big = load 'bigdir/' using PigStorage(',') as (id:chararray,
> data:chararray);*
> *small = load 'smalldir/' using PigStorage(',') as
> (t1:double,t2:double,data:chararray,id:chararray);
> *
> *J = JOIN big by id LEFT OUTER, small by id;
> *
> *store J into 'outputdir' using PigStorage(',');
> *
>
> On the web ui of the tracker, I see that the job has 40 reducers (I guess
> since the total data is about 40GB, and each 1GB will need one reducer by
> default of PIG and hadoop setting, so this is normal). If I use 'parallel
> 80' in the join operation above, then I see 80 reducers, and the join
> operation still failed.
>
> I checked file  mapred-default.xml and found this:
> <name>mapred.reduce.tasks</name>
>   <value>1</value>
>
> If I set the value of parallel in join operation, it should overwrite this,
> right?
>
>
> On the tracker GUI, I see that for different runs, the number of completed
> reducers changes from 4 to 10 (out of 40 total reducers). The tracker GUI
> shows the reason for the failed reducers: "Task
> attempt_201304081613_0046_r_000006_0 failed to report status for 600
> seconds. Killing!"
>
> *Could you please help?*
> Thank you very much,
> -Mua
>
>
> --------------------------------------------------------------------------------------------------------------
> Here is the error report from the console screen where I ran this script:
>
> job_201304081613_0032   616     0       230     12      32      0   0
> 0       big     MAP_ONLY
> job_201304081613_0033   705     1       21      6       6       234 2
> 34      234             SAMPLER
>
> Failed Jobs:
> JobId   Alias   Feature Message Outputs
> job_201304081613_0034   small   SKEWED_JOIN     Message: Job failed!
> Error - # of failed Reduce Tasks exceeded allowed limit. FailedCount: 1.
> LastFailedTask: task_201304081613_0034_r_000012
>
> Input(s):
> Successfully read 364285458 records (39528533645 bytes) from:
> "hdfs://d0521b01:24990/user/abc/big/"
> Failed to read data from "hdfs://d0521b01:24990/user/abc/small/"
>
> Output(s):
>
> Counters:
> Total records written : 0
> Total bytes written : 0
> Spillable Memory Manager spill count : 0
> Total bags proactively spilled: 0
> Total records proactively spilled: 0
>
> Job DAG:
> job_201304081613_0032   ->      job_201304081613_0033,
> job_201304081613_0033   ->      job_201304081613_0034,
> job_201304081613_0034   ->      null,
> null
>
>
> 2013-04-10 20:11:23,815 [main] WARN
>
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Encountered Warning
> REDUCER_COUNT_LOW 1 time(s).
> 2013-04-10 20:11:23,815 [main] INFO
>
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - Some jobs have faile
> d! Stop running all dependent jobs
> 2013-04-10 20:11:23,815 [main] ERROR org.apache.pig.tools.grunt.GruntParser
> - ERROR 2997: Encountered IOException. java.io.IOException: Er
> ror Recovery for block blk_312487981794332936_26563 failed  because
> recovery from primary datanode 10.6.25.31:54563 failed 6 times.  Pipel
> ine was 10.6.25.31:54563. Aborting...
> Details at logfile: /homes/abc/pig-flatten/scripts/pig_1365627648226.log
> 2013-04-10 20:11:23,818 [main] ERROR org.apache.pig.tools.grunt.GruntParser
> - ERROR 2244: Job failed, hadoop does not return any error mes
> sage
> Details at logfile: /homes/abc/pig-flatten/scripts/pig_1365627648226.log
>