You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "David J. O'Dell" <do...@videoegg.com> on 2009/01/28 17:25:53 UTC

sudden instability in 0.18.2

We've been running 0.18.2 for over a month on an 8 node cluster.
Last week we added 4 more nodes to the cluster and have experienced 2
failures to the tasktrackers since then.
The namenodes are running fine but all jobs submitted will die when
submitted with this error on the tasktrackers.

2009-01-28 08:07:55,556 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAction: attempt_200901280756_0012_m_000074_2
2009-01-28 08:07:55,682 WARN org.apache.hadoop.mapred.TaskRunner:
attempt_200901280756_0012_m_000074_2 Child Error
java.io.IOException: Task process exit with nonzero status of 1.
        at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)

I tried running the tasktrackers in debug mode but the entries above are
all that show up in the logs.
As of now my cluster is down.

-- 
David O'Dell
Director, Operations
e: dodell@videoegg.com
t:  (415) 738-5152
180 Townsend St., Third Floor
San Francisco, CA 94107

Re: sudden instability in 0.18.2

Posted by Aaron Kimball <aa...@cloudera.com>.

Wow. How many subdirectories were there? how many jobs do you run a day?

- Aaron

On Wed, Jan 28, 2009 at 12:13 PM, David J. O'Dell <do...@videoegg.com>wrote:

> It was failing on all the nodes both new and old.
> The problem was there were too many subdirectories under
> $HADOOP_HOME/logs/userlogs
> The fix was just to delete the subdirs and change this setting from 24
> hours(the default) to 2 hours.
> mapred.userlog.retain.hours
>
> Would have been nice if there was an error message that pointed to this.
>
>
> Aaron Kimball wrote:
> > Hi David,
> >
> > If your tasks are failing on only the new nodes, it's likely that you're
> > missing a library or something on those machines. See this Hadoop
> tutorial
> > http://public.yahoo.com/gogate/hadoop-tutorial/html/module5.html about
> > "distributing debug scripts." These will allow you to capture stdout/err
> and
> > the syslog from tasks that fail.
> >
> > - Aaron
> >
> > On Wed, Jan 28, 2009 at 9:40 AM, Sagar Naik <sn...@attributor.com>
> wrote:
> >
> >
> >> Pl check which nodes have these failures.
> >>
> >> I guess the new tasktrackers/machines  are not configured correctly.
> >> As a result, the map-task will die and the remaining map-tasks will be
> >> sucked onto these machines
> >>
> >>
> >> -Sagar
> >>
> >>
> >> David J. O'Dell wrote:
> >>
> >>
> >>> We've been running 0.18.2 for over a month on an 8 node cluster.
> >>> Last week we added 4 more nodes to the cluster and have experienced 2
> >>> failures to the tasktrackers since then.
> >>> The namenodes are running fine but all jobs submitted will die when
> >>> submitted with this error on the tasktrackers.
> >>>
> >>> 2009-01-28 08:07:55,556 INFO org.apache.hadoop.mapred.TaskTracker:
> >>> LaunchTaskAction: attempt_200901280756_0012_m_000074_2
> >>> 2009-01-28 08:07:55,682 WARN org.apache.hadoop.mapred.TaskRunner:
> >>> attempt_200901280756_0012_m_000074_2 Child Error
> >>> java.io.IOException: Task process exit with nonzero status of 1.
> >>>        at
> >>> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >>>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> >>>
> >>> I tried running the tasktrackers in debug mode but the entries above
> are
> >>> all that show up in the logs.
> >>> As of now my cluster is down.
> >>>
> >>>
> >>>
> >>>
>
> --
> David O'Dell
> Director, Operations
> e: dodell@videoegg.com
> t:  (415) 738-5152
> 180 Townsend St., Third Floor
> San Francisco, CA 94107
>
>

Re: sudden instability in 0.18.2

Posted by "David J. O'Dell" <do...@videoegg.com>.

It was failing on all the nodes both new and old.
The problem was there were too many subdirectories under
$HADOOP_HOME/logs/userlogs
The fix was just to delete the subdirs and change this setting from 24
hours(the default) to 2 hours.
mapred.userlog.retain.hours

Would have been nice if there was an error message that pointed to this.


Aaron Kimball wrote:
> Hi David,
>
> If your tasks are failing on only the new nodes, it's likely that you're
> missing a library or something on those machines. See this Hadoop tutorial
> http://public.yahoo.com/gogate/hadoop-tutorial/html/module5.html about
> "distributing debug scripts." These will allow you to capture stdout/err and
> the syslog from tasks that fail.
>
> - Aaron
>
> On Wed, Jan 28, 2009 at 9:40 AM, Sagar Naik <sn...@attributor.com> wrote:
>
>   
>> Pl check which nodes have these failures.
>>
>> I guess the new tasktrackers/machines  are not configured correctly.
>> As a result, the map-task will die and the remaining map-tasks will be
>> sucked onto these machines
>>
>>
>> -Sagar
>>
>>
>> David J. O'Dell wrote:
>>
>>     
>>> We've been running 0.18.2 for over a month on an 8 node cluster.
>>> Last week we added 4 more nodes to the cluster and have experienced 2
>>> failures to the tasktrackers since then.
>>> The namenodes are running fine but all jobs submitted will die when
>>> submitted with this error on the tasktrackers.
>>>
>>> 2009-01-28 08:07:55,556 INFO org.apache.hadoop.mapred.TaskTracker:
>>> LaunchTaskAction: attempt_200901280756_0012_m_000074_2
>>> 2009-01-28 08:07:55,682 WARN org.apache.hadoop.mapred.TaskRunner:
>>> attempt_200901280756_0012_m_000074_2 Child Error
>>> java.io.IOException: Task process exit with nonzero status of 1.
>>>        at
>>> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>>>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>>>
>>> I tried running the tasktrackers in debug mode but the entries above are
>>> all that show up in the logs.
>>> As of now my cluster is down.
>>>
>>>
>>>
>>>       

-- 
David O'Dell
Director, Operations
e: dodell@videoegg.com
t:  (415) 738-5152
180 Townsend St., Third Floor
San Francisco, CA 94107

Re: sudden instability in 0.18.2

Posted by Aaron Kimball <aa...@cloudera.com>.

Hi David,

If your tasks are failing on only the new nodes, it's likely that you're
missing a library or something on those machines. See this Hadoop tutorial
http://public.yahoo.com/gogate/hadoop-tutorial/html/module5.html about
"distributing debug scripts." These will allow you to capture stdout/err and
the syslog from tasks that fail.

- Aaron

On Wed, Jan 28, 2009 at 9:40 AM, Sagar Naik <sn...@attributor.com> wrote:

> Pl check which nodes have these failures.
>
> I guess the new tasktrackers/machines  are not configured correctly.
> As a result, the map-task will die and the remaining map-tasks will be
> sucked onto these machines
>
>
> -Sagar
>
>
> David J. O'Dell wrote:
>
>> We've been running 0.18.2 for over a month on an 8 node cluster.
>> Last week we added 4 more nodes to the cluster and have experienced 2
>> failures to the tasktrackers since then.
>> The namenodes are running fine but all jobs submitted will die when
>> submitted with this error on the tasktrackers.
>>
>> 2009-01-28 08:07:55,556 INFO org.apache.hadoop.mapred.TaskTracker:
>> LaunchTaskAction: attempt_200901280756_0012_m_000074_2
>> 2009-01-28 08:07:55,682 WARN org.apache.hadoop.mapred.TaskRunner:
>> attempt_200901280756_0012_m_000074_2 Child Error
>> java.io.IOException: Task process exit with nonzero status of 1.
>>        at
>> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>>
>> I tried running the tasktrackers in debug mode but the entries above are
>> all that show up in the logs.
>> As of now my cluster is down.
>>
>>
>>
>

Re: sudden instability in 0.18.2

Posted by Sagar Naik <sn...@attributor.com>.

Pl check which nodes have these failures.

I guess the new tasktrackers/machines  are not configured correctly.
As a result, the map-task will die and the remaining map-tasks will be 
sucked onto these machines


-Sagar

David J. O'Dell wrote:
> We've been running 0.18.2 for over a month on an 8 node cluster.
> Last week we added 4 more nodes to the cluster and have experienced 2
> failures to the tasktrackers since then.
> The namenodes are running fine but all jobs submitted will die when
> submitted with this error on the tasktrackers.
>
> 2009-01-28 08:07:55,556 INFO org.apache.hadoop.mapred.TaskTracker:
> LaunchTaskAction: attempt_200901280756_0012_m_000074_2
> 2009-01-28 08:07:55,682 WARN org.apache.hadoop.mapred.TaskRunner:
> attempt_200901280756_0012_m_000074_2 Child Error
> java.io.IOException: Task process exit with nonzero status of 1.
>         at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
>         at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
>
> I tried running the tasktrackers in debug mode but the entries above are
> all that show up in the logs.
> As of now my cluster is down.
>
>