You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Krishna Rao <kr...@gmail.com> on 2014/03/27 10:22:48 UTC

Job froze for hours because of an unresponsive disk on one of the task trackers

Hi,

we have a daily Hive script that usually takes a few hours to run. The
other day I notice one of the jobs was taking in excess of a few hours.
Digging into it I saw that there were 3 attempts to launch a job on a
single node:

Task Id Start Time Finish Time
Error
task_201312241250_46714_r_000048 Error launching task
task_201312241250_46714_r_000049 Error launching task
task_201312241250_46714_r_000050 Error launching task

I later found out that this node had a dodgy/unresponsive disk (still being
tested right now).

We've seen tasks fail in the past, but re-submitted to another node and
succeeding. So, shouldn't this task have been kicked off on another node
after the first failure? Is there anything I could be missing in terms of
configuration that should be set?

We're using CDH4.4.0.

Cheers,

Krishna

Re: Job froze for hours because of an unresponsive disk on one of the task trackers

Posted by Krishna Rao <kr...@gmail.com>.

I noticed, but none of the jobs ended up being re-submitted! And all 3 of
those jobs failed on the same node. All we know is that the disk on that
node became unresponsive.


On 27 March 2014 09:33, Dieter De Witte <dr...@gmail.com> wrote:

> The ids of the tasks are different so the node got killed after failing on
> 3 different(!) reduce tasks. The reduce task 48 will probably have been
> resubmitted to another node.
>
>
> 2014-03-27 10:22 GMT+01:00 Krishna Rao <kr...@gmail.com>:
>
> Hi,
>>
>> we have a daily Hive script that usually takes a few hours to run. The
>> other day I notice one of the jobs was taking in excess of a few hours.
>> Digging into it I saw that there were 3 attempts to launch a job on a
>> single node:
>>
>> Task Id Start Time Finish Time
>> Error
>> task_201312241250_46714_r_000048 Error launching task
>> task_201312241250_46714_r_000049 Error launching task
>> task_201312241250_46714_r_000050 Error launching task
>>
>> I later found out that this node had a dodgy/unresponsive disk (still
>> being tested right now).
>>
>> We've seen tasks fail in the past, but re-submitted to another node and
>> succeeding. So, shouldn't this task have been kicked off on another node
>> after the first failure? Is there anything I could be missing in terms of
>> configuration that should be set?
>>
>> We're using CDH4.4.0.
>>
>> Cheers,
>>
>> Krishna
>>
>
>

Re: Job froze for hours because of an unresponsive disk on one of the task trackers

Posted by Krishna Rao <kr...@gmail.com>.

I noticed, but none of the jobs ended up being re-submitted! And all 3 of
those jobs failed on the same node. All we know is that the disk on that
node became unresponsive.


On 27 March 2014 09:33, Dieter De Witte <dr...@gmail.com> wrote:

> The ids of the tasks are different so the node got killed after failing on
> 3 different(!) reduce tasks. The reduce task 48 will probably have been
> resubmitted to another node.
>
>
> 2014-03-27 10:22 GMT+01:00 Krishna Rao <kr...@gmail.com>:
>
> Hi,
>>
>> we have a daily Hive script that usually takes a few hours to run. The
>> other day I notice one of the jobs was taking in excess of a few hours.
>> Digging into it I saw that there were 3 attempts to launch a job on a
>> single node:
>>
>> Task Id Start Time Finish Time
>> Error
>> task_201312241250_46714_r_000048 Error launching task
>> task_201312241250_46714_r_000049 Error launching task
>> task_201312241250_46714_r_000050 Error launching task
>>
>> I later found out that this node had a dodgy/unresponsive disk (still
>> being tested right now).
>>
>> We've seen tasks fail in the past, but re-submitted to another node and
>> succeeding. So, shouldn't this task have been kicked off on another node
>> after the first failure? Is there anything I could be missing in terms of
>> configuration that should be set?
>>
>> We're using CDH4.4.0.
>>
>> Cheers,
>>
>> Krishna
>>
>
>

Re: Job froze for hours because of an unresponsive disk on one of the task trackers

Posted by Krishna Rao <kr...@gmail.com>.

I noticed, but none of the jobs ended up being re-submitted! And all 3 of
those jobs failed on the same node. All we know is that the disk on that
node became unresponsive.


On 27 March 2014 09:33, Dieter De Witte <dr...@gmail.com> wrote:

> The ids of the tasks are different so the node got killed after failing on
> 3 different(!) reduce tasks. The reduce task 48 will probably have been
> resubmitted to another node.
>
>
> 2014-03-27 10:22 GMT+01:00 Krishna Rao <kr...@gmail.com>:
>
> Hi,
>>
>> we have a daily Hive script that usually takes a few hours to run. The
>> other day I notice one of the jobs was taking in excess of a few hours.
>> Digging into it I saw that there were 3 attempts to launch a job on a
>> single node:
>>
>> Task Id Start Time Finish Time
>> Error
>> task_201312241250_46714_r_000048 Error launching task
>> task_201312241250_46714_r_000049 Error launching task
>> task_201312241250_46714_r_000050 Error launching task
>>
>> I later found out that this node had a dodgy/unresponsive disk (still
>> being tested right now).
>>
>> We've seen tasks fail in the past, but re-submitted to another node and
>> succeeding. So, shouldn't this task have been kicked off on another node
>> after the first failure? Is there anything I could be missing in terms of
>> configuration that should be set?
>>
>> We're using CDH4.4.0.
>>
>> Cheers,
>>
>> Krishna
>>
>
>

Re: Job froze for hours because of an unresponsive disk on one of the task trackers

Posted by Krishna Rao <kr...@gmail.com>.

I noticed, but none of the jobs ended up being re-submitted! And all 3 of
those jobs failed on the same node. All we know is that the disk on that
node became unresponsive.


On 27 March 2014 09:33, Dieter De Witte <dr...@gmail.com> wrote:

> The ids of the tasks are different so the node got killed after failing on
> 3 different(!) reduce tasks. The reduce task 48 will probably have been
> resubmitted to another node.
>
>
> 2014-03-27 10:22 GMT+01:00 Krishna Rao <kr...@gmail.com>:
>
> Hi,
>>
>> we have a daily Hive script that usually takes a few hours to run. The
>> other day I notice one of the jobs was taking in excess of a few hours.
>> Digging into it I saw that there were 3 attempts to launch a job on a
>> single node:
>>
>> Task Id Start Time Finish Time
>> Error
>> task_201312241250_46714_r_000048 Error launching task
>> task_201312241250_46714_r_000049 Error launching task
>> task_201312241250_46714_r_000050 Error launching task
>>
>> I later found out that this node had a dodgy/unresponsive disk (still
>> being tested right now).
>>
>> We've seen tasks fail in the past, but re-submitted to another node and
>> succeeding. So, shouldn't this task have been kicked off on another node
>> after the first failure? Is there anything I could be missing in terms of
>> configuration that should be set?
>>
>> We're using CDH4.4.0.
>>
>> Cheers,
>>
>> Krishna
>>
>
>

Re: Job froze for hours because of an unresponsive disk on one of the task trackers

Posted by Dieter De Witte <dr...@gmail.com>.

The ids of the tasks are different so the node got killed after failing on
3 different(!) reduce tasks. The reduce task 48 will probably have been
resubmitted to another node.


2014-03-27 10:22 GMT+01:00 Krishna Rao <kr...@gmail.com>:

> Hi,
>
> we have a daily Hive script that usually takes a few hours to run. The
> other day I notice one of the jobs was taking in excess of a few hours.
> Digging into it I saw that there were 3 attempts to launch a job on a
> single node:
>
> Task Id Start Time Finish Time
> Error
> task_201312241250_46714_r_000048 Error launching task
> task_201312241250_46714_r_000049 Error launching task
> task_201312241250_46714_r_000050 Error launching task
>
> I later found out that this node had a dodgy/unresponsive disk (still
> being tested right now).
>
> We've seen tasks fail in the past, but re-submitted to another node and
> succeeding. So, shouldn't this task have been kicked off on another node
> after the first failure? Is there anything I could be missing in terms of
> configuration that should be set?
>
> We're using CDH4.4.0.
>
> Cheers,
>
> Krishna
>

Re: Job froze for hours because of an unresponsive disk on one of the task trackers

Posted by Dieter De Witte <dr...@gmail.com>.

The ids of the tasks are different so the node got killed after failing on
3 different(!) reduce tasks. The reduce task 48 will probably have been
resubmitted to another node.


2014-03-27 10:22 GMT+01:00 Krishna Rao <kr...@gmail.com>:

> Hi,
>
> we have a daily Hive script that usually takes a few hours to run. The
> other day I notice one of the jobs was taking in excess of a few hours.
> Digging into it I saw that there were 3 attempts to launch a job on a
> single node:
>
> Task Id Start Time Finish Time
> Error
> task_201312241250_46714_r_000048 Error launching task
> task_201312241250_46714_r_000049 Error launching task
> task_201312241250_46714_r_000050 Error launching task
>
> I later found out that this node had a dodgy/unresponsive disk (still
> being tested right now).
>
> We've seen tasks fail in the past, but re-submitted to another node and
> succeeding. So, shouldn't this task have been kicked off on another node
> after the first failure? Is there anything I could be missing in terms of
> configuration that should be set?
>
> We're using CDH4.4.0.
>
> Cheers,
>
> Krishna
>

Re: Job froze for hours because of an unresponsive disk on one of the task trackers

Posted by Dieter De Witte <dr...@gmail.com>.

The ids of the tasks are different so the node got killed after failing on
3 different(!) reduce tasks. The reduce task 48 will probably have been
resubmitted to another node.


2014-03-27 10:22 GMT+01:00 Krishna Rao <kr...@gmail.com>:

> Hi,
>
> we have a daily Hive script that usually takes a few hours to run. The
> other day I notice one of the jobs was taking in excess of a few hours.
> Digging into it I saw that there were 3 attempts to launch a job on a
> single node:
>
> Task Id Start Time Finish Time
> Error
> task_201312241250_46714_r_000048 Error launching task
> task_201312241250_46714_r_000049 Error launching task
> task_201312241250_46714_r_000050 Error launching task
>
> I later found out that this node had a dodgy/unresponsive disk (still
> being tested right now).
>
> We've seen tasks fail in the past, but re-submitted to another node and
> succeeding. So, shouldn't this task have been kicked off on another node
> after the first failure? Is there anything I could be missing in terms of
> configuration that should be set?
>
> We're using CDH4.4.0.
>
> Cheers,
>
> Krishna
>

Re: Job froze for hours because of an unresponsive disk on one of the task trackers

Posted by Dieter De Witte <dr...@gmail.com>.

The ids of the tasks are different so the node got killed after failing on
3 different(!) reduce tasks. The reduce task 48 will probably have been
resubmitted to another node.


2014-03-27 10:22 GMT+01:00 Krishna Rao <kr...@gmail.com>:

> Hi,
>
> we have a daily Hive script that usually takes a few hours to run. The
> other day I notice one of the jobs was taking in excess of a few hours.
> Digging into it I saw that there were 3 attempts to launch a job on a
> single node:
>
> Task Id Start Time Finish Time
> Error
> task_201312241250_46714_r_000048 Error launching task
> task_201312241250_46714_r_000049 Error launching task
> task_201312241250_46714_r_000050 Error launching task
>
> I later found out that this node had a dodgy/unresponsive disk (still
> being tested right now).
>
> We've seen tasks fail in the past, but re-submitted to another node and
> succeeding. So, shouldn't this task have been kicked off on another node
> after the first failure? Is there anything I could be missing in terms of
> configuration that should be set?
>
> We're using CDH4.4.0.
>
> Cheers,
>
> Krishna
>