You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Thushara Wijeratna <th...@gmail.com> on 2009/05/29 21:53:58 UTC

child task choking on large (19K) input line

has anyone seen this? basically a child task is killing itself, as a ping
with the parent didn't quite work - the reply from the parent was
unexpected.

hadoop version: 0.19.0
userlogs on slave node:

2009-05-29 13:57:33,551 WARN org.apache.hadoop.mapred.TaskRunner: Parent
died.  Exiting attempt_200905281652_0013_m_000006_1
[root@domU-12-31-38-01-7C-92 attempt_200905281652_0013_m_000006_1]#

tellingly, the last input line processed right before this WARN is 19K. (i
log the full input line in the map function for debugging)

output on map-reduce task:

Task attempt_200905281652_0013_m_000006_2 failed to report status for 600
seconds. Killing!
09/05/29 14:08:01 INFO mapred.JobClient:  map 99% reduce 32%
09/05/29 14:18:05 INFO mapred.JobClient:  map 98% reduce 32%
java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
    at
com.adxpose.data.mr.DailyHeatmapAggregator.run(DailyHeatmapAggregator.java:547)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at
com.adxpose.data.mr.DailyHeatmapAggregator.main(DailyHeatmapAggregator.java:553)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
    at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

i believe this is the code that kills the child:

in org.apache.hadoop.mapred.Task

protected void startCommunicationThread(final TaskUmbilicalProtocol
umbilical) {

....

              if (sendProgress) {
                // we need to send progress update
                updateCounters();
                taskStatus.statusUpdate(getState(),
                                        taskProgress.get(),
                                        taskProgress.toString(),
                                        counters);
                taskFound = umbilical.statusUpdate(taskId, taskStatus);
                taskStatus.clearStatus();
              }
              else {
                // send ping
                taskFound = umbilical.ping(taskId);
              }

              // if Task Tracker is not aware of our task ID (probably
because it died and
              // came back up), kill ourselves
              if (!taskFound) {
                LOG.warn("Parent died.  Exiting "+taskId);
                System.exit(66);
              }

Re: child task choking on large (19K) input line

Posted by Thushara Wijeratna <th...@gmail.com>.

done: https://issues.apache.org/jira/browse/HADOOP-6109
i attached the patch as a file (vs submitting) as that is what the doc
suggested.
could you please review it, and if all's well i'll submit it.

thanks,
thushara

On Wed, Jun 17, 2009 at 11:43 AM, Philip Zeyliger <ph...@cloudera.com>wrote:

> On Fri, Jun 5, 2009 at 1:41 PM, Thushara Wijeratna<th...@gmail.com>
> wrote:
> > i fixed this on my system - Text.append degraded for large buffers (225M
> > line could not be processed for over 20 min).
> > the fix was to substitute a ByteArrayOutputStream and now the map/reduce
> > runs in under 30s.
> >
> > should i submit a patch for this? what is the process?
>
> Please do contribute patches!  The instructions are at
> http://wiki.apache.org/hadoop/HowToContribute .  You'll end up
> submitting a JIRA, writing a patch, including some tests (and, for
> performance issues, some performance numbers and steps to reproduce).
>
> -- Philip
>

Re: child task choking on large (19K) input line

Posted by Philip Zeyliger <ph...@cloudera.com>.

On Fri, Jun 5, 2009 at 1:41 PM, Thushara Wijeratna<th...@gmail.com> wrote:
> i fixed this on my system - Text.append degraded for large buffers (225M
> line could not be processed for over 20 min).
> the fix was to substitute a ByteArrayOutputStream and now the map/reduce
> runs in under 30s.
>
> should i submit a patch for this? what is the process?

Please do contribute patches!  The instructions are at
http://wiki.apache.org/hadoop/HowToContribute .  You'll end up
submitting a JIRA, writing a patch, including some tests (and, for
performance issues, some performance numbers and steps to reproduce).

-- Philip

Re: child task choking on large (19K) input line

Posted by Thushara Wijeratna <th...@gmail.com>.

i fixed this on my system - Text.append degraded for large buffers (225M
line could not be processed for over 20 min).
the fix was to substitute a ByteArrayOutputStream and now the map/reduce
runs in under 30s.

should i submit a patch for this? what is the process?

thanks,
thushara

On Wed, Jun 3, 2009 at 10:43 AM, Thushara Wijeratna <th...@gmail.com>wrote:

> ok guys - i did some more debugging on this, and it seems that reading a
> 19K line from the hadoop file system (using FSDataInputStream class methods)
> could take upwards of 20 minutes. the child task is timing out in
> LineReader.readLine() call - this basically reads the line one page (4096
> bytes) at a time.
>
> my hadoop instance for this test (with 20 min timeout) is all on one host
> (one name node, one data node) on a MacBook Pro (2.6 GHz Core-2 duo, 4G RAM)
>
> i included LOG statements to get an idea of the latencies thus - inside
> LineReader.readLine():
>
>       if (length >= 0) {
>         str.append(buffer, startPosn, length);
>         LOG.info("str.length= " + str.getLength() + " just wrote from " +
> startPosn + " to " + length + " bytes");
>       }
>
>
> and this is the kind of output i get:
> at the beginning:
> 2009-06-03 09:35:56,863 INFO org.apache.hadoop.util.LineReader: str.length=
> 4096 just wrote from 0 to 4096 bytes
> 2009-06-03 09:35:56,864 INFO org.apache.hadoop.util.LineReader: str.length=
> 8192 just wrote from 0 to 4096 bytes
> 2009-06-03 09:35:56,865 INFO org.apache.hadoop.util.LineReader: str.length=
> 12288 just wrote from 0 to 4096 bytes
> 2009-06-03 09:35:56,866 INFO org.apache.hadoop.util.LineReader: str.length=
> 16384 just wrote from 0 to 4096 bytes
> 2009-06-03 09:35:56,866 INFO org.apache.hadoop.util.LineReader: str.length=
> 20480 just wrote from 0 to 4096 bytes
> 2009-06-03 09:35:56,867 INFO org.apache.hadoop.util.LineReader: str.length=
> 24576 just wrote from 0 to 4096 bytes
> 2009-06-03 09:35:56,867 INFO org.apache.hadoop.util.LineReader: str.length=
> 28672 just wrote from 0 to 4096 bytes
> 2009-06-03 09:35:56,868 INFO org.apache.hadoop.util.LineReader: str.length=
> 32768 just wrote from 0 to 4096 bytes
> 2009-06-03 09:35:56,869 INFO org.apache.hadoop.util.LineReader: str.length=
> 36864 just wrote from 0 to 4096 bytes
>
>
>
> in the end:
> 2009-06-03 09:46:02,918 INFO org.apache.hadoop.util.LineReader: str.length=
> 60141568 just wrote from 0 to 4096 bytes
> 2009-06-03 09:46:03,048 INFO org.apache.hadoop.util.LineReader: str.length=
> 60145664 just wrote from 0 to 4096 bytes
> 2009-06-03 09:46:03,118 INFO org.apache.hadoop.util.LineReader: str.length=
> 60149760 just wrote from 0 to 4096 bytes
> 2009-06-03 09:46:03,183 INFO org.apache.hadoop.util.LineReader: str.length=
> 60153856 just wrote from 0 to 4096 bytes
> 2009-06-03 09:46:03,252 INFO org.apache.hadoop.util.LineReader: str.length=
> 60157952 just wrote from 0 to 4096 bytes
> 2009-06-03 09:46:03,317 INFO org.apache.hadoop.util.LineReader: str.length=
> 60162048 just wrote from 0 to 4096 bytes
> 2009-06-03 09:46:03,456 INFO org.apache.hadoop.util.LineReader: str.length=
> 60166144 just wrote from 0 to 4096 bytes
>
> notice the times are degrading in the end - this is the pattern, there is
> about 1 millisecond between the 1st consecutive reads, and in the end there
> is more than 50ms between 2 consecutive reads.
> so it is either the nature of the file system read, or Text.append has a
> potential perf issue.
>
> is this kind of latencies expected, if i upgrade to 0.20, would i see a
> difference? how can i help debug this further?
>
> thanks,
> thushara
>
>
>
> On Fri, May 29, 2009 at 12:53 PM, Thushara Wijeratna <th...@gmail.com>wrote:
>
>> has anyone seen this? basically a child task is killing itself, as a ping
>> with the parent didn't quite work - the reply from the parent was
>> unexpected.
>>
>> hadoop version: 0.19.0
>> userlogs on slave node:
>>
>> 2009-05-29 13:57:33,551 WARN org.apache.hadoop.mapred.TaskRunner: Parent
>> died.  Exiting attempt_200905281652_0013_m_000006_1
>> [root@domU-12-31-38-01-7C-92 attempt_200905281652_0013_m_000006_1]#
>>
>> tellingly, the last input line processed right before this WARN is 19K. (i
>> log the full input line in the map function for debugging)
>>
>> output on map-reduce task:
>>
>> Task attempt_200905281652_0013_m_000006_2 failed to report status for 600
>> seconds. Killing!
>> 09/05/29 14:08:01 INFO mapred.JobClient:  map 99% reduce 32%
>> 09/05/29 14:18:05 INFO mapred.JobClient:  map 98% reduce 32%
>> java.io.IOException: Job failed!
>>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
>>     at
>> com.adxpose.data.mr.DailyHeatmapAggregator.run(DailyHeatmapAggregator.java:547)
>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>     at
>> com.adxpose.data.mr.DailyHeatmapAggregator.main(DailyHeatmapAggregator.java:553)
>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>     at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>     at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>     at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>>     at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>     at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>>
>> i believe this is the code that kills the child:
>>
>> in org.apache.hadoop.mapred.Task
>>
>> protected void startCommunicationThread(final TaskUmbilicalProtocol
>> umbilical) {
>>
>> ....
>>
>>               if (sendProgress) {
>>                 // we need to send progress update
>>                 updateCounters();
>>                 taskStatus.statusUpdate(getState(),
>>                                         taskProgress.get(),
>>                                         taskProgress.toString(),
>>                                         counters);
>>                 taskFound = umbilical.statusUpdate(taskId, taskStatus);
>>                 taskStatus.clearStatus();
>>               }
>>               else {
>>                 // send ping
>>                 taskFound = umbilical.ping(taskId);
>>               }
>>
>>               // if Task Tracker is not aware of our task ID (probably
>> because it died and
>>               // came back up), kill ourselves
>>               if (!taskFound) {
>>                 LOG.warn("Parent died.  Exiting "+taskId);
>>                 System.exit(66);
>>               }
>>
>>
>>
>>
>

Re: child task choking on large (19K) input line

Posted by Thushara Wijeratna <th...@gmail.com>.

ok guys - i did some more debugging on this, and it seems that reading a 19K
line from the hadoop file system (using FSDataInputStream class methods)
could take upwards of 20 minutes. the child task is timing out in
LineReader.readLine() call - this basically reads the line one page (4096
bytes) at a time.

my hadoop instance for this test (with 20 min timeout) is all on one host
(one name node, one data node) on a MacBook Pro (2.6 GHz Core-2 duo, 4G RAM)

i included LOG statements to get an idea of the latencies thus - inside
LineReader.readLine():

      if (length >= 0) {
        str.append(buffer, startPosn, length);
        LOG.info("str.length= " + str.getLength() + " just wrote from " +
startPosn + " to " + length + " bytes");
      }


and this is the kind of output i get:
at the beginning:
2009-06-03 09:35:56,863 INFO org.apache.hadoop.util.LineReader: str.length=
4096 just wrote from 0 to 4096 bytes
2009-06-03 09:35:56,864 INFO org.apache.hadoop.util.LineReader: str.length=
8192 just wrote from 0 to 4096 bytes
2009-06-03 09:35:56,865 INFO org.apache.hadoop.util.LineReader: str.length=
12288 just wrote from 0 to 4096 bytes
2009-06-03 09:35:56,866 INFO org.apache.hadoop.util.LineReader: str.length=
16384 just wrote from 0 to 4096 bytes
2009-06-03 09:35:56,866 INFO org.apache.hadoop.util.LineReader: str.length=
20480 just wrote from 0 to 4096 bytes
2009-06-03 09:35:56,867 INFO org.apache.hadoop.util.LineReader: str.length=
24576 just wrote from 0 to 4096 bytes
2009-06-03 09:35:56,867 INFO org.apache.hadoop.util.LineReader: str.length=
28672 just wrote from 0 to 4096 bytes
2009-06-03 09:35:56,868 INFO org.apache.hadoop.util.LineReader: str.length=
32768 just wrote from 0 to 4096 bytes
2009-06-03 09:35:56,869 INFO org.apache.hadoop.util.LineReader: str.length=
36864 just wrote from 0 to 4096 bytes



in the end:
2009-06-03 09:46:02,918 INFO org.apache.hadoop.util.LineReader: str.length=
60141568 just wrote from 0 to 4096 bytes
2009-06-03 09:46:03,048 INFO org.apache.hadoop.util.LineReader: str.length=
60145664 just wrote from 0 to 4096 bytes
2009-06-03 09:46:03,118 INFO org.apache.hadoop.util.LineReader: str.length=
60149760 just wrote from 0 to 4096 bytes
2009-06-03 09:46:03,183 INFO org.apache.hadoop.util.LineReader: str.length=
60153856 just wrote from 0 to 4096 bytes
2009-06-03 09:46:03,252 INFO org.apache.hadoop.util.LineReader: str.length=
60157952 just wrote from 0 to 4096 bytes
2009-06-03 09:46:03,317 INFO org.apache.hadoop.util.LineReader: str.length=
60162048 just wrote from 0 to 4096 bytes
2009-06-03 09:46:03,456 INFO org.apache.hadoop.util.LineReader: str.length=
60166144 just wrote from 0 to 4096 bytes

notice the times are degrading in the end - this is the pattern, there is
about 1 millisecond between the 1st consecutive reads, and in the end there
is more than 50ms between 2 consecutive reads.
so it is either the nature of the file system read, or Text.append has a
potential perf issue.

is this kind of latencies expected, if i upgrade to 0.20, would i see a
difference? how can i help debug this further?

thanks,
thushara


On Fri, May 29, 2009 at 12:53 PM, Thushara Wijeratna <th...@gmail.com>wrote:

> has anyone seen this? basically a child task is killing itself, as a ping
> with the parent didn't quite work - the reply from the parent was
> unexpected.
>
> hadoop version: 0.19.0
> userlogs on slave node:
>
> 2009-05-29 13:57:33,551 WARN org.apache.hadoop.mapred.TaskRunner: Parent
> died.  Exiting attempt_200905281652_0013_m_000006_1
> [root@domU-12-31-38-01-7C-92 attempt_200905281652_0013_m_000006_1]#
>
> tellingly, the last input line processed right before this WARN is 19K. (i
> log the full input line in the map function for debugging)
>
> output on map-reduce task:
>
> Task attempt_200905281652_0013_m_000006_2 failed to report status for 600
> seconds. Killing!
> 09/05/29 14:08:01 INFO mapred.JobClient:  map 99% reduce 32%
> 09/05/29 14:18:05 INFO mapred.JobClient:  map 98% reduce 32%
> java.io.IOException: Job failed!
>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
>     at
> com.adxpose.data.mr.DailyHeatmapAggregator.run(DailyHeatmapAggregator.java:547)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>     at
> com.adxpose.data.mr.DailyHeatmapAggregator.main(DailyHeatmapAggregator.java:553)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>     at java.lang.reflect.Method.invoke(Method.java:597)
>     at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>     at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>     at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
> i believe this is the code that kills the child:
>
> in org.apache.hadoop.mapred.Task
>
> protected void startCommunicationThread(final TaskUmbilicalProtocol
> umbilical) {
>
> ....
>
>               if (sendProgress) {
>                 // we need to send progress update
>                 updateCounters();
>                 taskStatus.statusUpdate(getState(),
>                                         taskProgress.get(),
>                                         taskProgress.toString(),
>                                         counters);
>                 taskFound = umbilical.statusUpdate(taskId, taskStatus);
>                 taskStatus.clearStatus();
>               }
>               else {
>                 // send ping
>                 taskFound = umbilical.ping(taskId);
>               }
>
>               // if Task Tracker is not aware of our task ID (probably
> because it died and
>               // came back up), kill ourselves
>               if (!taskFound) {
>                 LOG.warn("Parent died.  Exiting "+taskId);
>                 System.exit(66);
>               }
>
>
>
>