You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Shi Yu <sh...@uchicago.edu> on 2011/03/24 20:01:32 UTC

Program freezes at Map 99% Reduce 33%

I am running a hadoop program processing Tera Byte size data. The code 
was test successfully on a small sample (100G) and it worked. However, 
when trying it on the full problem, the program freezes forever at Map 
99% Reduce 33%. There is no error, and the size of userlog folder is 
clean (<30M) cause otherwise it will generate Giga bytes of error logs.

I checked the log of mapper and reducer, it seems that the reducer is 
waiting for an output from the mapper and it never reaches. What is the 
possible reason of causing this? Most of the configurations are set by 
default. I set "mapred.child.java.opts=-Xmx2000M 
hadoop.job.history.user.location=none". The problem occurs both on 
0.19.2 and 0.20.2. Thanks!

Example of Mapper logs:

2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging 3 
sorted segments
2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to 
the last merge-pass, with 3 segments left of total size: 461743 bytes
2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index: 
(11015008, 10030254, 607594)
2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner: 
Task:attempt_201103231501_0007_m_000286_0 is done. And is in the process 
of commiting
2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner: Task 
'attempt_201103231501_0007_m_000286_0' done.

Example of Reducer logs:

2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts 
and0 dup hosts)
2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201103231501_0007_r_000018_0 Need another 1 map output(s) where 
0 is already in progress
2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts 
and0 dup hosts)


Shi

RE: Program freezes at Map 99% Reduce 33%

Posted by Ke...@thomsonreuters.com.

Good. Data skew should not look stuck. Try sending status updates so at
least you can tell one mapper is still busy. Yes, adding data or
including another field into the key can help reduce data skew.

Kevin

-----Original Message-----
From: Shi Yu [mailto:shiyu@uchicago.edu] 
Sent: Thursday, March 24, 2011 5:56 PM
To: common-user@hadoop.apache.org
Subject: Re: Program freezes at Map 99% Reduce 33%

Hi Kevin,

thanks for the suggestion. I think I found the problem, because my code 
is a chained map / reduce. In the previous iteration there is a 
.lzo_deflate output which is 40 times larger than other files.  That was

because of a special "key" value, which has significant larger 
occurrences than other keys. I used a self-defined partitioner:

  public int getPartition(TextPair key, Text value, int numPartitions) {
             return (key.getFirst().hashCode() & Integer.MAX_VALUE) % 
numPartitions;
         }

maybe all the occurrences of this key are send to a single reducer and 
yields such a huge compressed file. And moreover the  .lzo_deflate file 
not splittable? I removed this huge file and reran the current task, it 
worked smoothly.  I didn't notice this problem before on small test 
because the output was not compressed in that run.

Now I am trying to add some random string to that special key, since the

previous outputs are still input for the next task so there is no reason

to force them to a single reducer.

Shi


On 3/24/2011 4:03 PM, Kevin.Leach@thomsonreuters.com wrote:
> Shi,
>
> This states "Of course, the framework discards the sub-directory of
> unsuccessful task-attempts."
>
http://hadoop-karma.blogspot.com/2011/01/hadoop-cookbook-how-to-write.ht
> ml
>
> So yes, the missing directory is likely a failure.
>
> If you can, narrow the problem down by looking at sections of your
data
> on a smaller number of mappers. Look at just the tail of your input
and
> a couple of mappers and then compare that to your program run without
> hadoop.
>
> Kevin
>
> -----Original Message-----
> From: Shi Yu [mailto:shiyu@uchicago.edu]
> Sent: Thursday, March 24, 2011 3:57 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Program freezes at Map 99% Reduce 33%
>
> Hi Kevin,
>
> thanks for reply.  I could hardly imagine an example of incomplete
> record. The mapper is very simple, just reading line by line as
Strings,
> splitting the line by tab, and outputting a Text Pair for sort and
> secondary sort.  If there were incomplete record, there should be an
> error and the only place could happen is the tab splitting stage.   I
> use LZO codec compressing the Mapper output and Reducer output.
>
> There is one issue which I think might be the reason. The files in the
> log has the following structure . It seems I had 294 mappers.  Notice
> that there is one file missing:
"attempt_201103231501_0007_m_000292_0",
> where is the Nr. 292 mapper output ???? Was it a failed node or
> something?
>
>
> All the logs just hang there for 2 hours (~12:35, while the current
time
> is 14:50). Some folders were visited around 14:45 by me. So no reducer
> nor mapper has been generating any logs in the past two hours.
>
> Shi
>
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:36
> attempt_201103231501_0007_m_000281_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
> attempt_201103231501_0007_m_000282_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
> attempt_201103231501_0007_m_000283_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:35
> attempt_201103231501_0007_m_000284_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
> attempt_201103231501_0007_m_000285_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 13:59
> attempt_201103231501_0007_m_000286_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
> attempt_201103231501_0007_m_000287_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:44
> attempt_201103231501_0007_m_000288_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
> attempt_201103231501_0007_m_000289_0
> drwxr-xr-x 2 sheeyu users 85 2011-03-24 12:37
> attempt_201103231501_0007_m_000289_1
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:22
> attempt_201103231501_0007_m_000290_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:45
> attempt_201103231501_0007_m_000291_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_m_000293_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000000_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000001_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000002_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000003_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000004_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000005_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000006_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000007_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:47
> attempt_201103231501_0007_r_000008_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000009_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000010_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000011_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000012_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000013_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000014_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000015_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000016_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000017_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:42
> attempt_201103231501_0007_r_000018_0
>
> Shi
>
> On 3/24/2011 2:25 PM, Kevin.Leach@thomsonreuters.com wrote:
>> Shi,
>>
>> The key here is the 99% done mapper. Nothing can move on until all
>> mappers complete.
>> Is it possible your data in the larger set has an incomplete record
or
>> some such at the end?
>>
>> Kevin
>>
>> -----Original Message-----
>> From: Shi Yu [mailto:shiyu@uchicago.edu]
>> Sent: Thursday, March 24, 2011 3:02 PM
>> To: hadoop user
>> Subject: Program freezes at Map 99% Reduce 33%
>>
>> I am running a hadoop program processing Tera Byte size data. The
code
>> was test successfully on a small sample (100G) and it worked.
However,
>> when trying it on the full problem, the program freezes forever at
Map
>> 99% Reduce 33%. There is no error, and the size of userlog folder is
>> clean (<30M) cause otherwise it will generate Giga bytes of error
> logs.
>> I checked the log of mapper and reducer, it seems that the reducer is
>> waiting for an output from the mapper and it never reaches. What is
>> the possible reason of causing this? Most of the configurations are
>> set by default. I set "mapred.child.java.opts=-Xmx2000M
>> hadoop.job.history.user.location=none". The problem occurs both on
>> 0.19.2 and 0.20.2. Thanks!
>>
>> Example of Mapper logs:
>>
>> 2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging
>> 3 sorted segments
>> 2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to
>> the last merge-pass, with 3 segments left of total size: 461743 bytes
>> 2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index:
>> (11015008, 10030254, 607594)
>> 2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner:
>> Task:attempt_201103231501_0007_m_000286_0 is done. And is in the
>> process of commiting
>> 2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner:
Task
>> 'attempt_201103231501_0007_m_000286_0' done.
>>
>> Example of Reducer logs:
>>
>> 2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
>> 2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow
hosts
>> and0 dup hosts)
>> 2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201103231501_0007_r_000018_0 Need another 1 map output(s)
>> where 0 is already in progress
>> 2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
>> 2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow
hosts
>> and0 dup hosts)
>>
>>
>> Shi

Re: Program freezes at Map 99% Reduce 33%

Posted by Shi Yu <sh...@uchicago.edu>.

Hi Kevin,

thanks for the suggestion. I think I found the problem, because my code 
is a chained map / reduce. In the previous iteration there is a 
.lzo_deflate output which is 40 times larger than other files.  That was 
because of a special "key" value, which has significant larger 
occurrences than other keys. I used a self-defined partitioner:

  public int getPartition(TextPair key, Text value, int numPartitions) {
             return (key.getFirst().hashCode() & Integer.MAX_VALUE) % 
numPartitions;
         }

maybe all the occurrences of this key are send to a single reducer and 
yields such a huge compressed file. And moreover the  .lzo_deflate file 
not splittable? I removed this huge file and reran the current task, it 
worked smoothly.  I didn't notice this problem before on small test 
because the output was not compressed in that run.

Now I am trying to add some random string to that special key, since the 
previous outputs are still input for the next task so there is no reason 
to force them to a single reducer.

Shi


On 3/24/2011 4:03 PM, Kevin.Leach@thomsonreuters.com wrote:
> Shi,
>
> This states "Of course, the framework discards the sub-directory of
> unsuccessful task-attempts."
> http://hadoop-karma.blogspot.com/2011/01/hadoop-cookbook-how-to-write.ht
> ml
>
> So yes, the missing directory is likely a failure.
>
> If you can, narrow the problem down by looking at sections of your data
> on a smaller number of mappers. Look at just the tail of your input and
> a couple of mappers and then compare that to your program run without
> hadoop.
>
> Kevin
>
> -----Original Message-----
> From: Shi Yu [mailto:shiyu@uchicago.edu]
> Sent: Thursday, March 24, 2011 3:57 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Program freezes at Map 99% Reduce 33%
>
> Hi Kevin,
>
> thanks for reply.  I could hardly imagine an example of incomplete
> record. The mapper is very simple, just reading line by line as Strings,
> splitting the line by tab, and outputting a Text Pair for sort and
> secondary sort.  If there were incomplete record, there should be an
> error and the only place could happen is the tab splitting stage.   I
> use LZO codec compressing the Mapper output and Reducer output.
>
> There is one issue which I think might be the reason. The files in the
> log has the following structure . It seems I had 294 mappers.  Notice
> that there is one file missing: "attempt_201103231501_0007_m_000292_0",
> where is the Nr. 292 mapper output ???? Was it a failed node or
> something?
>
>
> All the logs just hang there for 2 hours (~12:35, while the current time
> is 14:50). Some folders were visited around 14:45 by me. So no reducer
> nor mapper has been generating any logs in the past two hours.
>
> Shi
>
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:36
> attempt_201103231501_0007_m_000281_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
> attempt_201103231501_0007_m_000282_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
> attempt_201103231501_0007_m_000283_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:35
> attempt_201103231501_0007_m_000284_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
> attempt_201103231501_0007_m_000285_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 13:59
> attempt_201103231501_0007_m_000286_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
> attempt_201103231501_0007_m_000287_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:44
> attempt_201103231501_0007_m_000288_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
> attempt_201103231501_0007_m_000289_0
> drwxr-xr-x 2 sheeyu users 85 2011-03-24 12:37
> attempt_201103231501_0007_m_000289_1
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:22
> attempt_201103231501_0007_m_000290_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:45
> attempt_201103231501_0007_m_000291_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_m_000293_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000000_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000001_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000002_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000003_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000004_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000005_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000006_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000007_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:47
> attempt_201103231501_0007_r_000008_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000009_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000010_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000011_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000012_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000013_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000014_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000015_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000016_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
> attempt_201103231501_0007_r_000017_0
> drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:42
> attempt_201103231501_0007_r_000018_0
>
> Shi
>
> On 3/24/2011 2:25 PM, Kevin.Leach@thomsonreuters.com wrote:
>> Shi,
>>
>> The key here is the 99% done mapper. Nothing can move on until all
>> mappers complete.
>> Is it possible your data in the larger set has an incomplete record or
>> some such at the end?
>>
>> Kevin
>>
>> -----Original Message-----
>> From: Shi Yu [mailto:shiyu@uchicago.edu]
>> Sent: Thursday, March 24, 2011 3:02 PM
>> To: hadoop user
>> Subject: Program freezes at Map 99% Reduce 33%
>>
>> I am running a hadoop program processing Tera Byte size data. The code
>> was test successfully on a small sample (100G) and it worked. However,
>> when trying it on the full problem, the program freezes forever at Map
>> 99% Reduce 33%. There is no error, and the size of userlog folder is
>> clean (<30M) cause otherwise it will generate Giga bytes of error
> logs.
>> I checked the log of mapper and reducer, it seems that the reducer is
>> waiting for an output from the mapper and it never reaches. What is
>> the possible reason of causing this? Most of the configurations are
>> set by default. I set "mapred.child.java.opts=-Xmx2000M
>> hadoop.job.history.user.location=none". The problem occurs both on
>> 0.19.2 and 0.20.2. Thanks!
>>
>> Example of Mapper logs:
>>
>> 2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging
>> 3 sorted segments
>> 2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to
>> the last merge-pass, with 3 segments left of total size: 461743 bytes
>> 2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index:
>> (11015008, 10030254, 607594)
>> 2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner:
>> Task:attempt_201103231501_0007_m_000286_0 is done. And is in the
>> process of commiting
>> 2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner: Task
>> 'attempt_201103231501_0007_m_000286_0' done.
>>
>> Example of Reducer logs:
>>
>> 2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
>> 2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
>> and0 dup hosts)
>> 2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201103231501_0007_r_000018_0 Need another 1 map output(s)
>> where 0 is already in progress
>> 2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
>> 2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
>> and0 dup hosts)
>>
>>
>> Shi

RE: Program freezes at Map 99% Reduce 33%

Posted by Ke...@thomsonreuters.com.

Shi,

This states "Of course, the framework discards the sub-directory of
unsuccessful task-attempts."
http://hadoop-karma.blogspot.com/2011/01/hadoop-cookbook-how-to-write.ht
ml 

So yes, the missing directory is likely a failure.

If you can, narrow the problem down by looking at sections of your data
on a smaller number of mappers. Look at just the tail of your input and
a couple of mappers and then compare that to your program run without
hadoop.

Kevin

-----Original Message-----
From: Shi Yu [mailto:shiyu@uchicago.edu] 
Sent: Thursday, March 24, 2011 3:57 PM
To: common-user@hadoop.apache.org
Subject: Re: Program freezes at Map 99% Reduce 33%

Hi Kevin,

thanks for reply.  I could hardly imagine an example of incomplete
record. The mapper is very simple, just reading line by line as Strings,
splitting the line by tab, and outputting a Text Pair for sort and
secondary sort.  If there were incomplete record, there should be an 
error and the only place could happen is the tab splitting stage.   I 
use LZO codec compressing the Mapper output and Reducer output.

There is one issue which I think might be the reason. The files in the
log has the following structure . It seems I had 294 mappers.  Notice
that there is one file missing: "attempt_201103231501_0007_m_000292_0",
where is the Nr. 292 mapper output ???? Was it a failed node or
something?


All the logs just hang there for 2 hours (~12:35, while the current time
is 14:50). Some folders were visited around 14:45 by me. So no reducer
nor mapper has been generating any logs in the past two hours.

Shi

drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:36
attempt_201103231501_0007_m_000281_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000282_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000283_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:35
attempt_201103231501_0007_m_000284_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000285_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 13:59
attempt_201103231501_0007_m_000286_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000287_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:44
attempt_201103231501_0007_m_000288_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37
attempt_201103231501_0007_m_000289_0
drwxr-xr-x 2 sheeyu users 85 2011-03-24 12:37
attempt_201103231501_0007_m_000289_1
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:22
attempt_201103231501_0007_m_000290_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:45
attempt_201103231501_0007_m_000291_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_m_000293_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000000_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000001_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000002_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000003_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000004_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000005_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000006_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000007_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:47
attempt_201103231501_0007_r_000008_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000009_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000010_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000011_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000012_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000013_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000014_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000015_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000016_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07
attempt_201103231501_0007_r_000017_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:42
attempt_201103231501_0007_r_000018_0

Shi

On 3/24/2011 2:25 PM, Kevin.Leach@thomsonreuters.com wrote:
> Shi,
>
> The key here is the 99% done mapper. Nothing can move on until all 
> mappers complete.
> Is it possible your data in the larger set has an incomplete record or

> some such at the end?
>
> Kevin
>
> -----Original Message-----
> From: Shi Yu [mailto:shiyu@uchicago.edu]
> Sent: Thursday, March 24, 2011 3:02 PM
> To: hadoop user
> Subject: Program freezes at Map 99% Reduce 33%
>
> I am running a hadoop program processing Tera Byte size data. The code

> was test successfully on a small sample (100G) and it worked. However,

> when trying it on the full problem, the program freezes forever at Map

> 99% Reduce 33%. There is no error, and the size of userlog folder is 
> clean (<30M) cause otherwise it will generate Giga bytes of error
logs.
>
> I checked the log of mapper and reducer, it seems that the reducer is 
> waiting for an output from the mapper and it never reaches. What is 
> the possible reason of causing this? Most of the configurations are 
> set by default. I set "mapred.child.java.opts=-Xmx2000M 
> hadoop.job.history.user.location=none". The problem occurs both on
> 0.19.2 and 0.20.2. Thanks!
>
> Example of Mapper logs:
>
> 2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging 
> 3 sorted segments
> 2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to 
> the last merge-pass, with 3 segments left of total size: 461743 bytes
> 2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index:
> (11015008, 10030254, 607594)
> 2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner:
> Task:attempt_201103231501_0007_m_000286_0 is done. And is in the 
> process of commiting
> 2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner: Task

> 'attempt_201103231501_0007_m_000286_0' done.
>
> Example of Reducer logs:
>
> 2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
> 2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts

> and0 dup hosts)
> 2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0 Need another 1 map output(s) 
> where 0 is already in progress
> 2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
> 2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts

> and0 dup hosts)
>
>
> Shi

Re: Program freezes at Map 99% Reduce 33%

Posted by Shi Yu <sh...@uchicago.edu>.

Hi Kevin,

thanks for reply.  I could hardly imagine an example of incomplete 
record. The mapper is very simple, just reading line by line as Strings, 
splitting the line by tab, and outputting a Text Pair for sort and 
secondary sort.  If there were incomplete record, there should be an 
error and the only place could happen is the tab splitting stage.   I 
use LZO codec compressing the Mapper output and Reducer output.

There is one issue which I think might be the reason. The files in the 
log has the following structure . It seems I had 294 mappers.  Notice 
that there is one file missing: "attempt_201103231501_0007_m_000292_0", 
where is the Nr. 292 mapper output ???? Was it a failed node or something?


All the logs just hang there for 2 hours (~12:35, while the current time 
is 14:50). Some folders were visited around 14:45 by me. So no reducer 
nor mapper has been generating any logs in the past two hours.

Shi

drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:36 
attempt_201103231501_0007_m_000281_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 
attempt_201103231501_0007_m_000282_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 
attempt_201103231501_0007_m_000283_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:35 
attempt_201103231501_0007_m_000284_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 
attempt_201103231501_0007_m_000285_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 13:59 
attempt_201103231501_0007_m_000286_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 
attempt_201103231501_0007_m_000287_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:44 
attempt_201103231501_0007_m_000288_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:37 
attempt_201103231501_0007_m_000289_0
drwxr-xr-x 2 sheeyu users 85 2011-03-24 12:37 
attempt_201103231501_0007_m_000289_1
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:22 
attempt_201103231501_0007_m_000290_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:45 
attempt_201103231501_0007_m_000291_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 
attempt_201103231501_0007_m_000293_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 
attempt_201103231501_0007_r_000000_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 
attempt_201103231501_0007_r_000001_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 
attempt_201103231501_0007_r_000002_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 
attempt_201103231501_0007_r_000003_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 
attempt_201103231501_0007_r_000004_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 
attempt_201103231501_0007_r_000005_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 
attempt_201103231501_0007_r_000006_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 
attempt_201103231501_0007_r_000007_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:47 
attempt_201103231501_0007_r_000008_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 
attempt_201103231501_0007_r_000009_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 
attempt_201103231501_0007_r_000010_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 
attempt_201103231501_0007_r_000011_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 
attempt_201103231501_0007_r_000012_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 
attempt_201103231501_0007_r_000013_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 
attempt_201103231501_0007_r_000014_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 
attempt_201103231501_0007_r_000015_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 
attempt_201103231501_0007_r_000016_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 12:07 
attempt_201103231501_0007_r_000017_0
drwxr-xr-x 2 sheeyu users 61 2011-03-24 14:42 
attempt_201103231501_0007_r_000018_0

Shi

On 3/24/2011 2:25 PM, Kevin.Leach@thomsonreuters.com wrote:
> Shi,
>
> The key here is the 99% done mapper. Nothing can move on until all
> mappers complete.
> Is it possible your data in the larger set has an incomplete record or
> some such at the end?
>
> Kevin
>
> -----Original Message-----
> From: Shi Yu [mailto:shiyu@uchicago.edu]
> Sent: Thursday, March 24, 2011 3:02 PM
> To: hadoop user
> Subject: Program freezes at Map 99% Reduce 33%
>
> I am running a hadoop program processing Tera Byte size data. The code
> was test successfully on a small sample (100G) and it worked. However,
> when trying it on the full problem, the program freezes forever at Map
> 99% Reduce 33%. There is no error, and the size of userlog folder is
> clean (<30M) cause otherwise it will generate Giga bytes of error logs.
>
> I checked the log of mapper and reducer, it seems that the reducer is
> waiting for an output from the mapper and it never reaches. What is the
> possible reason of causing this? Most of the configurations are set by
> default. I set "mapred.child.java.opts=-Xmx2000M
> hadoop.job.history.user.location=none". The problem occurs both on
> 0.19.2 and 0.20.2. Thanks!
>
> Example of Mapper logs:
>
> 2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging 3
> sorted segments
> 2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to
> the last merge-pass, with 3 segments left of total size: 461743 bytes
> 2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index:
> (11015008, 10030254, 607594)
> 2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner:
> Task:attempt_201103231501_0007_m_000286_0 is done. And is in the process
> of commiting
> 2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner: Task
> 'attempt_201103231501_0007_m_000286_0' done.
>
> Example of Reducer logs:
>
> 2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
> 2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
> and0 dup hosts)
> 2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0 Need another 1 map output(s) where
> 0 is already in progress
> 2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
> 2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
> and0 dup hosts)
>
>
> Shi

RE: Program freezes at Map 99% Reduce 33%

Posted by Ke...@thomsonreuters.com.

Shi,

The key here is the 99% done mapper. Nothing can move on until all
mappers complete.
Is it possible your data in the larger set has an incomplete record or
some such at the end?

Kevin

-----Original Message-----
From: Shi Yu [mailto:shiyu@uchicago.edu] 
Sent: Thursday, March 24, 2011 3:02 PM
To: hadoop user
Subject: Program freezes at Map 99% Reduce 33%

I am running a hadoop program processing Tera Byte size data. The code
was test successfully on a small sample (100G) and it worked. However,
when trying it on the full problem, the program freezes forever at Map
99% Reduce 33%. There is no error, and the size of userlog folder is
clean (<30M) cause otherwise it will generate Giga bytes of error logs.

I checked the log of mapper and reducer, it seems that the reducer is
waiting for an output from the mapper and it never reaches. What is the
possible reason of causing this? Most of the configurations are set by
default. I set "mapred.child.java.opts=-Xmx2000M
hadoop.job.history.user.location=none". The problem occurs both on
0.19.2 and 0.20.2. Thanks!

Example of Mapper logs:

2011-03-24 12:37:22,775 INFO org.apache.hadoop.mapred.Merger: Merging 3
sorted segments
2011-03-24 12:37:22,776 INFO org.apache.hadoop.mapred.Merger: Down to
the last merge-pass, with 3 segments left of total size: 461743 bytes
2011-03-24 12:37:22,885 INFO org.apache.hadoop.mapred.MapTask: Index: 
(11015008, 10030254, 607594)
2011-03-24 12:37:22,889 INFO org.apache.hadoop.mapred.TaskRunner: 
Task:attempt_201103231501_0007_m_000286_0 is done. And is in the process
of commiting
2011-03-24 12:37:22,897 INFO org.apache.hadoop.mapred.TaskRunner: Task
'attempt_201103231501_0007_m_000286_0' done.

Example of Reducer logs:

2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
2011-03-24 13:50:18,484 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
and0 dup hosts)
2011-03-24 13:51:18,544 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201103231501_0007_r_000018_0 Need another 1 map output(s) where
0 is already in progress
2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201103231501_0007_r_000018_0: Got 0 new map-outputs
2011-03-24 13:51:18,545 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201103231501_0007_r_000018_0 Scheduled 0 outputs (0 slow hosts
and0 dup hosts)


Shi