You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Sergey <se...@gmail.com> on 2015/11/11 16:37:03 UTC

Re-execution of map task

Hi experts!

I see strange behaviour of Hadoop while execution of my tasks.
It re-runs task attempt which has completed with SUCCEEDED status
(see the log below about attempt_1447029285980_0001_m_000012_0).

I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4 and
than 2000.

The same story with some other tasks..
A also see on screen after execution of task that some times map progress
is decreasing...

I don't use preemption, speculative execution and don't see any exceptions,
time-outs in yarn log
(except last line "Container killed on request. Exit code is 143").

How to catch the reason?

I use version 2.6.0 in Azure cloud (HDInsight)


2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
attempt_1447029285980_0001_m_000012_0 is : 1.0
2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement from
attempt_1447029285980_0001_m_000012_0
2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from RUNNING
to SUCCESS_CONTAINER_CLEANUP
2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
container_e04_1447029285980_0001_01_002951 taskAttempt
attempt_1447029285980_0001_m_000012_0
2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING
attempt_1447029285980_0001_m_000012_0
2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
Opening proxy : 10.0.0.8:30050
2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with
attempt attempt_1447029285980_0001_m_000012_0
2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to SUCCEEDED
2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks: 4
2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1 AssignedMaps:8
AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
HostLocal:0 RackLocal:16
2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
completed container container_e04_1447029285980_0001_01_002951
2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated
containers 1
2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to
reduce
2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
report from attempt_1447029285980_0001_m_000012_0: Container killed by the
ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

Re: Re-execution of map task

Posted by Varun Saxena <vs...@gmail.com>.

The reason is that RM(ResourceManager) thinks Node Manager(10.0.0.5:30050 in
your case) is in an unhealthy state(either heartbeat from Nodemanager has
not come to resourcemanager for some time, a HB has come out of order which
might indicate restart, etc.)  and reports this to Application Master.
This is the reason for log like "TaskAttempt killed because it ran on
unusable node"
When an unusable node is reported, killing of an already successful map
task is the expected behavior.

Can you check if this NodeManager went down ?
Or can you see any unusual log in active resourcemanager's logs for this
node(10.0.0.5:30050) ? Something like deactivating node, unhealthy, etc.

Regards,
Varun Saxena.


On Thu, Nov 12, 2015 at 12:32 AM, Sergey <se...@gmail.com> wrote:

> Hi Varun,
> thank you!
>
> But the main question is the reason...
>
> I did check log files with yarn logs and local logs on nodes.
> And don't see reasons why Hadoop can think that nodes gone unusable.
>
>
> Regards,
> Sergey.
>
>
>
>
> 2015-11-11 21:23 GMT+03:00 Varun Saxena <vs...@gmail.com>:
>
>> Hi Sergey,
>>
>> This indicates that one or more of your Node Managers' may have gone
>> down. RM indicates this to AM on allocate response.
>> If a map task ran on such a node, its output is considered unusable even
>> though the map task has been marked as success previously.
>> Such a map task is then KILLED and a new attempt is launched.
>>
>> Regards,
>> Varun Saxena.
>>
>> On Wed, Nov 11, 2015 at 11:44 PM, Sergey <se...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> yes, there are several "failed" map, because of 600 sec time-out.
>>>
>>> I also found a lot messages like this in the log:
>>>
>>> 2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler]
>>> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed
>>> because it ran on unusable node 10.0.0.5:30050.
>>> AttemptId:attempt_1447029285980_0001_m_000043_1
>>>
>>> Different nodes got unusable status very often.
>>>
>>> Do you know something about possible reason? Maybe changing some
>>> time-out params in communication between nodes could help?
>>>
>>> As I already said I work in the cloud on Azure HDInsight.
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2015-11-11 20:33 GMT+03:00 Namikaze Minato <ll...@gmail.com>:
>>>
>>>> Hi.
>>>>
>>>> Do you also have "failed" map attempts?
>>>> Killed map attempts won't help us understand why your job is failing.
>>>>
>>>> Regards,
>>>> LLoyd
>>>>
>>>>
>>>> On 11 November 2015 at 16:37, Sergey <se...@gmail.com> wrote:
>>>> >
>>>> > Hi experts!
>>>> >
>>>> > I see strange behaviour of Hadoop while execution of my tasks.
>>>> > It re-runs task attempt which has completed with SUCCEEDED status
>>>> > (see the log below about attempt_1447029285980_0001_m_000012_0).
>>>> >
>>>> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4
>>>> and
>>>> > than 2000.
>>>> >
>>>> > The same story with some other tasks..
>>>> > A also see on screen after execution of task that some times map
>>>> progress is
>>>> > decreasing...
>>>> >
>>>> > I don't use preemption, speculative execution and don't see any
>>>> exceptions,
>>>> > time-outs in yarn log
>>>> > (except last line "Container killed on request. Exit code is 143").
>>>> >
>>>> > How to catch the reason?
>>>> >
>>>> > I use version 2.6.0 in Azure cloud (HDInsight)
>>>> >
>>>> >
>>>> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
>>>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of
>>>> TaskAttempt
>>>> > attempt_1447029285980_0001_m_000012_0 is : 1.0
>>>> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
>>>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done
>>>> acknowledgement from
>>>> > attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>>>> RUNNING
>>>> > to SUCCESS_CONTAINER_CLEANUP
>>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>>>> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
>>>> > container_e04_1447029285980_0001_01_002951 taskAttempt
>>>> > attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>>>> KILLING
>>>> > attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>>> >
>>>> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
>>>> > Opening proxy : 10.0.0.8:30050
>>>> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>>>> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
>>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded
>>>> with
>>>> > attempt attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
>>>> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to
>>>> SUCCEEDED
>>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed
>>>> Tasks: 4
>>>> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
>>>> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1
>>>> AssignedMaps:8
>>>> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
>>>> > HostLocal:0 RackLocal:16
>>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
>>>> > completed container container_e04_1447029285980_0001_01_002951
>>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got
>>>> allocated
>>>> > containers 1
>>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned
>>>> to
>>>> > reduce
>>>> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>>> Diagnostics
>>>> > report from attempt_1447029285980_0001_m_000012_0: Container killed
>>>> by the
>>>> > ApplicationMaster.
>>>> > Container killed on request. Exit code is 143
>>>> > Container exited with a non-zero exit code 143
>>>>
>>>
>>>
>>
>

Re: Re-execution of map task

Posted by Varun Saxena <vs...@gmail.com>.

The reason is that RM(ResourceManager) thinks Node Manager(10.0.0.5:30050 in
your case) is in an unhealthy state(either heartbeat from Nodemanager has
not come to resourcemanager for some time, a HB has come out of order which
might indicate restart, etc.)  and reports this to Application Master.
This is the reason for log like "TaskAttempt killed because it ran on
unusable node"
When an unusable node is reported, killing of an already successful map
task is the expected behavior.

Can you check if this NodeManager went down ?
Or can you see any unusual log in active resourcemanager's logs for this
node(10.0.0.5:30050) ? Something like deactivating node, unhealthy, etc.

Regards,
Varun Saxena.


On Thu, Nov 12, 2015 at 12:32 AM, Sergey <se...@gmail.com> wrote:

> Hi Varun,
> thank you!
>
> But the main question is the reason...
>
> I did check log files with yarn logs and local logs on nodes.
> And don't see reasons why Hadoop can think that nodes gone unusable.
>
>
> Regards,
> Sergey.
>
>
>
>
> 2015-11-11 21:23 GMT+03:00 Varun Saxena <vs...@gmail.com>:
>
>> Hi Sergey,
>>
>> This indicates that one or more of your Node Managers' may have gone
>> down. RM indicates this to AM on allocate response.
>> If a map task ran on such a node, its output is considered unusable even
>> though the map task has been marked as success previously.
>> Such a map task is then KILLED and a new attempt is launched.
>>
>> Regards,
>> Varun Saxena.
>>
>> On Wed, Nov 11, 2015 at 11:44 PM, Sergey <se...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> yes, there are several "failed" map, because of 600 sec time-out.
>>>
>>> I also found a lot messages like this in the log:
>>>
>>> 2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler]
>>> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed
>>> because it ran on unusable node 10.0.0.5:30050.
>>> AttemptId:attempt_1447029285980_0001_m_000043_1
>>>
>>> Different nodes got unusable status very often.
>>>
>>> Do you know something about possible reason? Maybe changing some
>>> time-out params in communication between nodes could help?
>>>
>>> As I already said I work in the cloud on Azure HDInsight.
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2015-11-11 20:33 GMT+03:00 Namikaze Minato <ll...@gmail.com>:
>>>
>>>> Hi.
>>>>
>>>> Do you also have "failed" map attempts?
>>>> Killed map attempts won't help us understand why your job is failing.
>>>>
>>>> Regards,
>>>> LLoyd
>>>>
>>>>
>>>> On 11 November 2015 at 16:37, Sergey <se...@gmail.com> wrote:
>>>> >
>>>> > Hi experts!
>>>> >
>>>> > I see strange behaviour of Hadoop while execution of my tasks.
>>>> > It re-runs task attempt which has completed with SUCCEEDED status
>>>> > (see the log below about attempt_1447029285980_0001_m_000012_0).
>>>> >
>>>> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4
>>>> and
>>>> > than 2000.
>>>> >
>>>> > The same story with some other tasks..
>>>> > A also see on screen after execution of task that some times map
>>>> progress is
>>>> > decreasing...
>>>> >
>>>> > I don't use preemption, speculative execution and don't see any
>>>> exceptions,
>>>> > time-outs in yarn log
>>>> > (except last line "Container killed on request. Exit code is 143").
>>>> >
>>>> > How to catch the reason?
>>>> >
>>>> > I use version 2.6.0 in Azure cloud (HDInsight)
>>>> >
>>>> >
>>>> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
>>>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of
>>>> TaskAttempt
>>>> > attempt_1447029285980_0001_m_000012_0 is : 1.0
>>>> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
>>>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done
>>>> acknowledgement from
>>>> > attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>>>> RUNNING
>>>> > to SUCCESS_CONTAINER_CLEANUP
>>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>>>> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
>>>> > container_e04_1447029285980_0001_01_002951 taskAttempt
>>>> > attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>>>> KILLING
>>>> > attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>>> >
>>>> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
>>>> > Opening proxy : 10.0.0.8:30050
>>>> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>>>> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
>>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded
>>>> with
>>>> > attempt attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
>>>> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to
>>>> SUCCEEDED
>>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed
>>>> Tasks: 4
>>>> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
>>>> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1
>>>> AssignedMaps:8
>>>> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
>>>> > HostLocal:0 RackLocal:16
>>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
>>>> > completed container container_e04_1447029285980_0001_01_002951
>>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got
>>>> allocated
>>>> > containers 1
>>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned
>>>> to
>>>> > reduce
>>>> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>>> Diagnostics
>>>> > report from attempt_1447029285980_0001_m_000012_0: Container killed
>>>> by the
>>>> > ApplicationMaster.
>>>> > Container killed on request. Exit code is 143
>>>> > Container exited with a non-zero exit code 143
>>>>
>>>
>>>
>>
>

Re: Re-execution of map task

Posted by Varun Saxena <vs...@gmail.com>.

The reason is that RM(ResourceManager) thinks Node Manager(10.0.0.5:30050 in
your case) is in an unhealthy state(either heartbeat from Nodemanager has
not come to resourcemanager for some time, a HB has come out of order which
might indicate restart, etc.)  and reports this to Application Master.
This is the reason for log like "TaskAttempt killed because it ran on
unusable node"
When an unusable node is reported, killing of an already successful map
task is the expected behavior.

Can you check if this NodeManager went down ?
Or can you see any unusual log in active resourcemanager's logs for this
node(10.0.0.5:30050) ? Something like deactivating node, unhealthy, etc.

Regards,
Varun Saxena.


On Thu, Nov 12, 2015 at 12:32 AM, Sergey <se...@gmail.com> wrote:

> Hi Varun,
> thank you!
>
> But the main question is the reason...
>
> I did check log files with yarn logs and local logs on nodes.
> And don't see reasons why Hadoop can think that nodes gone unusable.
>
>
> Regards,
> Sergey.
>
>
>
>
> 2015-11-11 21:23 GMT+03:00 Varun Saxena <vs...@gmail.com>:
>
>> Hi Sergey,
>>
>> This indicates that one or more of your Node Managers' may have gone
>> down. RM indicates this to AM on allocate response.
>> If a map task ran on such a node, its output is considered unusable even
>> though the map task has been marked as success previously.
>> Such a map task is then KILLED and a new attempt is launched.
>>
>> Regards,
>> Varun Saxena.
>>
>> On Wed, Nov 11, 2015 at 11:44 PM, Sergey <se...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> yes, there are several "failed" map, because of 600 sec time-out.
>>>
>>> I also found a lot messages like this in the log:
>>>
>>> 2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler]
>>> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed
>>> because it ran on unusable node 10.0.0.5:30050.
>>> AttemptId:attempt_1447029285980_0001_m_000043_1
>>>
>>> Different nodes got unusable status very often.
>>>
>>> Do you know something about possible reason? Maybe changing some
>>> time-out params in communication between nodes could help?
>>>
>>> As I already said I work in the cloud on Azure HDInsight.
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2015-11-11 20:33 GMT+03:00 Namikaze Minato <ll...@gmail.com>:
>>>
>>>> Hi.
>>>>
>>>> Do you also have "failed" map attempts?
>>>> Killed map attempts won't help us understand why your job is failing.
>>>>
>>>> Regards,
>>>> LLoyd
>>>>
>>>>
>>>> On 11 November 2015 at 16:37, Sergey <se...@gmail.com> wrote:
>>>> >
>>>> > Hi experts!
>>>> >
>>>> > I see strange behaviour of Hadoop while execution of my tasks.
>>>> > It re-runs task attempt which has completed with SUCCEEDED status
>>>> > (see the log below about attempt_1447029285980_0001_m_000012_0).
>>>> >
>>>> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4
>>>> and
>>>> > than 2000.
>>>> >
>>>> > The same story with some other tasks..
>>>> > A also see on screen after execution of task that some times map
>>>> progress is
>>>> > decreasing...
>>>> >
>>>> > I don't use preemption, speculative execution and don't see any
>>>> exceptions,
>>>> > time-outs in yarn log
>>>> > (except last line "Container killed on request. Exit code is 143").
>>>> >
>>>> > How to catch the reason?
>>>> >
>>>> > I use version 2.6.0 in Azure cloud (HDInsight)
>>>> >
>>>> >
>>>> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
>>>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of
>>>> TaskAttempt
>>>> > attempt_1447029285980_0001_m_000012_0 is : 1.0
>>>> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
>>>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done
>>>> acknowledgement from
>>>> > attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>>>> RUNNING
>>>> > to SUCCESS_CONTAINER_CLEANUP
>>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>>>> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
>>>> > container_e04_1447029285980_0001_01_002951 taskAttempt
>>>> > attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>>>> KILLING
>>>> > attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>>> >
>>>> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
>>>> > Opening proxy : 10.0.0.8:30050
>>>> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>>>> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
>>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded
>>>> with
>>>> > attempt attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
>>>> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to
>>>> SUCCEEDED
>>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed
>>>> Tasks: 4
>>>> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
>>>> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1
>>>> AssignedMaps:8
>>>> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
>>>> > HostLocal:0 RackLocal:16
>>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
>>>> > completed container container_e04_1447029285980_0001_01_002951
>>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got
>>>> allocated
>>>> > containers 1
>>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned
>>>> to
>>>> > reduce
>>>> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>>> Diagnostics
>>>> > report from attempt_1447029285980_0001_m_000012_0: Container killed
>>>> by the
>>>> > ApplicationMaster.
>>>> > Container killed on request. Exit code is 143
>>>> > Container exited with a non-zero exit code 143
>>>>
>>>
>>>
>>
>

Re: Re-execution of map task

Posted by Varun Saxena <vs...@gmail.com>.

The reason is that RM(ResourceManager) thinks Node Manager(10.0.0.5:30050 in
your case) is in an unhealthy state(either heartbeat from Nodemanager has
not come to resourcemanager for some time, a HB has come out of order which
might indicate restart, etc.)  and reports this to Application Master.
This is the reason for log like "TaskAttempt killed because it ran on
unusable node"
When an unusable node is reported, killing of an already successful map
task is the expected behavior.

Can you check if this NodeManager went down ?
Or can you see any unusual log in active resourcemanager's logs for this
node(10.0.0.5:30050) ? Something like deactivating node, unhealthy, etc.

Regards,
Varun Saxena.


On Thu, Nov 12, 2015 at 12:32 AM, Sergey <se...@gmail.com> wrote:

> Hi Varun,
> thank you!
>
> But the main question is the reason...
>
> I did check log files with yarn logs and local logs on nodes.
> And don't see reasons why Hadoop can think that nodes gone unusable.
>
>
> Regards,
> Sergey.
>
>
>
>
> 2015-11-11 21:23 GMT+03:00 Varun Saxena <vs...@gmail.com>:
>
>> Hi Sergey,
>>
>> This indicates that one or more of your Node Managers' may have gone
>> down. RM indicates this to AM on allocate response.
>> If a map task ran on such a node, its output is considered unusable even
>> though the map task has been marked as success previously.
>> Such a map task is then KILLED and a new attempt is launched.
>>
>> Regards,
>> Varun Saxena.
>>
>> On Wed, Nov 11, 2015 at 11:44 PM, Sergey <se...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> yes, there are several "failed" map, because of 600 sec time-out.
>>>
>>> I also found a lot messages like this in the log:
>>>
>>> 2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler]
>>> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed
>>> because it ran on unusable node 10.0.0.5:30050.
>>> AttemptId:attempt_1447029285980_0001_m_000043_1
>>>
>>> Different nodes got unusable status very often.
>>>
>>> Do you know something about possible reason? Maybe changing some
>>> time-out params in communication between nodes could help?
>>>
>>> As I already said I work in the cloud on Azure HDInsight.
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2015-11-11 20:33 GMT+03:00 Namikaze Minato <ll...@gmail.com>:
>>>
>>>> Hi.
>>>>
>>>> Do you also have "failed" map attempts?
>>>> Killed map attempts won't help us understand why your job is failing.
>>>>
>>>> Regards,
>>>> LLoyd
>>>>
>>>>
>>>> On 11 November 2015 at 16:37, Sergey <se...@gmail.com> wrote:
>>>> >
>>>> > Hi experts!
>>>> >
>>>> > I see strange behaviour of Hadoop while execution of my tasks.
>>>> > It re-runs task attempt which has completed with SUCCEEDED status
>>>> > (see the log below about attempt_1447029285980_0001_m_000012_0).
>>>> >
>>>> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4
>>>> and
>>>> > than 2000.
>>>> >
>>>> > The same story with some other tasks..
>>>> > A also see on screen after execution of task that some times map
>>>> progress is
>>>> > decreasing...
>>>> >
>>>> > I don't use preemption, speculative execution and don't see any
>>>> exceptions,
>>>> > time-outs in yarn log
>>>> > (except last line "Container killed on request. Exit code is 143").
>>>> >
>>>> > How to catch the reason?
>>>> >
>>>> > I use version 2.6.0 in Azure cloud (HDInsight)
>>>> >
>>>> >
>>>> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
>>>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of
>>>> TaskAttempt
>>>> > attempt_1447029285980_0001_m_000012_0 is : 1.0
>>>> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
>>>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done
>>>> acknowledgement from
>>>> > attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>>>> RUNNING
>>>> > to SUCCESS_CONTAINER_CLEANUP
>>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>>>> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
>>>> > container_e04_1447029285980_0001_01_002951 taskAttempt
>>>> > attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>>>> KILLING
>>>> > attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>>> >
>>>> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
>>>> > Opening proxy : 10.0.0.8:30050
>>>> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>>>> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
>>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded
>>>> with
>>>> > attempt attempt_1447029285980_0001_m_000012_0
>>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
>>>> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to
>>>> SUCCEEDED
>>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed
>>>> Tasks: 4
>>>> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
>>>> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1
>>>> AssignedMaps:8
>>>> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
>>>> > HostLocal:0 RackLocal:16
>>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
>>>> > completed container container_e04_1447029285980_0001_01_002951
>>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got
>>>> allocated
>>>> > containers 1
>>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned
>>>> to
>>>> > reduce
>>>> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
>>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>>> Diagnostics
>>>> > report from attempt_1447029285980_0001_m_000012_0: Container killed
>>>> by the
>>>> > ApplicationMaster.
>>>> > Container killed on request. Exit code is 143
>>>> > Container exited with a non-zero exit code 143
>>>>
>>>
>>>
>>
>

Re: Re-execution of map task

Posted by Sergey <se...@gmail.com>.

Hi Varun,
thank you!

But the main question is the reason...

I did check log files with yarn logs and local logs on nodes.
And don't see reasons why Hadoop can think that nodes gone unusable.


Regards,
Sergey.




2015-11-11 21:23 GMT+03:00 Varun Saxena <vs...@gmail.com>:

> Hi Sergey,
>
> This indicates that one or more of your Node Managers' may have gone down.
> RM indicates this to AM on allocate response.
> If a map task ran on such a node, its output is considered unusable even
> though the map task has been marked as success previously.
> Such a map task is then KILLED and a new attempt is launched.
>
> Regards,
> Varun Saxena.
>
> On Wed, Nov 11, 2015 at 11:44 PM, Sergey <se...@gmail.com> wrote:
>
>> Hi,
>>
>> yes, there are several "failed" map, because of 600 sec time-out.
>>
>> I also found a lot messages like this in the log:
>>
>> 2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler]
>> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed
>> because it ran on unusable node 10.0.0.5:30050.
>> AttemptId:attempt_1447029285980_0001_m_000043_1
>>
>> Different nodes got unusable status very often.
>>
>> Do you know something about possible reason? Maybe changing some time-out
>> params in communication between nodes could help?
>>
>> As I already said I work in the cloud on Azure HDInsight.
>>
>>
>>
>>
>>
>>
>> 2015-11-11 20:33 GMT+03:00 Namikaze Minato <ll...@gmail.com>:
>>
>>> Hi.
>>>
>>> Do you also have "failed" map attempts?
>>> Killed map attempts won't help us understand why your job is failing.
>>>
>>> Regards,
>>> LLoyd
>>>
>>>
>>> On 11 November 2015 at 16:37, Sergey <se...@gmail.com> wrote:
>>> >
>>> > Hi experts!
>>> >
>>> > I see strange behaviour of Hadoop while execution of my tasks.
>>> > It re-runs task attempt which has completed with SUCCEEDED status
>>> > (see the log below about attempt_1447029285980_0001_m_000012_0).
>>> >
>>> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4
>>> and
>>> > than 2000.
>>> >
>>> > The same story with some other tasks..
>>> > A also see on screen after execution of task that some times map
>>> progress is
>>> > decreasing...
>>> >
>>> > I don't use preemption, speculative execution and don't see any
>>> exceptions,
>>> > time-outs in yarn log
>>> > (except last line "Container killed on request. Exit code is 143").
>>> >
>>> > How to catch the reason?
>>> >
>>> > I use version 2.6.0 in Azure cloud (HDInsight)
>>> >
>>> >
>>> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
>>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of
>>> TaskAttempt
>>> > attempt_1447029285980_0001_m_000012_0 is : 1.0
>>> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
>>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement
>>> from
>>> > attempt_1447029285980_0001_m_000012_0
>>> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>>> RUNNING
>>> > to SUCCESS_CONTAINER_CLEANUP
>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>>> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
>>> > container_e04_1447029285980_0001_01_002951 taskAttempt
>>> > attempt_1447029285980_0001_m_000012_0
>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>>> KILLING
>>> > attempt_1447029285980_0001_m_000012_0
>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>> >
>>> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
>>> > Opening proxy : 10.0.0.8:30050
>>> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>>> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded
>>> with
>>> > attempt attempt_1447029285980_0001_m_000012_0
>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
>>> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to
>>> SUCCEEDED
>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed
>>> Tasks: 4
>>> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
>>> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1
>>> AssignedMaps:8
>>> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
>>> > HostLocal:0 RackLocal:16
>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
>>> > completed container container_e04_1447029285980_0001_01_002951
>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got
>>> allocated
>>> > containers 1
>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to
>>> > reduce
>>> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>> Diagnostics
>>> > report from attempt_1447029285980_0001_m_000012_0: Container killed by
>>> the
>>> > ApplicationMaster.
>>> > Container killed on request. Exit code is 143
>>> > Container exited with a non-zero exit code 143
>>>
>>
>>
>

Re: Re-execution of map task

Posted by Sergey <se...@gmail.com>.

Hi Varun,
thank you!

But the main question is the reason...

I did check log files with yarn logs and local logs on nodes.
And don't see reasons why Hadoop can think that nodes gone unusable.


Regards,
Sergey.




2015-11-11 21:23 GMT+03:00 Varun Saxena <vs...@gmail.com>:

> Hi Sergey,
>
> This indicates that one or more of your Node Managers' may have gone down.
> RM indicates this to AM on allocate response.
> If a map task ran on such a node, its output is considered unusable even
> though the map task has been marked as success previously.
> Such a map task is then KILLED and a new attempt is launched.
>
> Regards,
> Varun Saxena.
>
> On Wed, Nov 11, 2015 at 11:44 PM, Sergey <se...@gmail.com> wrote:
>
>> Hi,
>>
>> yes, there are several "failed" map, because of 600 sec time-out.
>>
>> I also found a lot messages like this in the log:
>>
>> 2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler]
>> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed
>> because it ran on unusable node 10.0.0.5:30050.
>> AttemptId:attempt_1447029285980_0001_m_000043_1
>>
>> Different nodes got unusable status very often.
>>
>> Do you know something about possible reason? Maybe changing some time-out
>> params in communication between nodes could help?
>>
>> As I already said I work in the cloud on Azure HDInsight.
>>
>>
>>
>>
>>
>>
>> 2015-11-11 20:33 GMT+03:00 Namikaze Minato <ll...@gmail.com>:
>>
>>> Hi.
>>>
>>> Do you also have "failed" map attempts?
>>> Killed map attempts won't help us understand why your job is failing.
>>>
>>> Regards,
>>> LLoyd
>>>
>>>
>>> On 11 November 2015 at 16:37, Sergey <se...@gmail.com> wrote:
>>> >
>>> > Hi experts!
>>> >
>>> > I see strange behaviour of Hadoop while execution of my tasks.
>>> > It re-runs task attempt which has completed with SUCCEEDED status
>>> > (see the log below about attempt_1447029285980_0001_m_000012_0).
>>> >
>>> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4
>>> and
>>> > than 2000.
>>> >
>>> > The same story with some other tasks..
>>> > A also see on screen after execution of task that some times map
>>> progress is
>>> > decreasing...
>>> >
>>> > I don't use preemption, speculative execution and don't see any
>>> exceptions,
>>> > time-outs in yarn log
>>> > (except last line "Container killed on request. Exit code is 143").
>>> >
>>> > How to catch the reason?
>>> >
>>> > I use version 2.6.0 in Azure cloud (HDInsight)
>>> >
>>> >
>>> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
>>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of
>>> TaskAttempt
>>> > attempt_1447029285980_0001_m_000012_0 is : 1.0
>>> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
>>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement
>>> from
>>> > attempt_1447029285980_0001_m_000012_0
>>> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>>> RUNNING
>>> > to SUCCESS_CONTAINER_CLEANUP
>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>>> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
>>> > container_e04_1447029285980_0001_01_002951 taskAttempt
>>> > attempt_1447029285980_0001_m_000012_0
>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>>> KILLING
>>> > attempt_1447029285980_0001_m_000012_0
>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>> >
>>> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
>>> > Opening proxy : 10.0.0.8:30050
>>> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>>> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded
>>> with
>>> > attempt attempt_1447029285980_0001_m_000012_0
>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
>>> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to
>>> SUCCEEDED
>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed
>>> Tasks: 4
>>> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
>>> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1
>>> AssignedMaps:8
>>> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
>>> > HostLocal:0 RackLocal:16
>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
>>> > completed container container_e04_1447029285980_0001_01_002951
>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got
>>> allocated
>>> > containers 1
>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to
>>> > reduce
>>> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>> Diagnostics
>>> > report from attempt_1447029285980_0001_m_000012_0: Container killed by
>>> the
>>> > ApplicationMaster.
>>> > Container killed on request. Exit code is 143
>>> > Container exited with a non-zero exit code 143
>>>
>>
>>
>

Re: Re-execution of map task

Posted by Sergey <se...@gmail.com>.

Hi Varun,
thank you!

But the main question is the reason...

I did check log files with yarn logs and local logs on nodes.
And don't see reasons why Hadoop can think that nodes gone unusable.


Regards,
Sergey.




2015-11-11 21:23 GMT+03:00 Varun Saxena <vs...@gmail.com>:

> Hi Sergey,
>
> This indicates that one or more of your Node Managers' may have gone down.
> RM indicates this to AM on allocate response.
> If a map task ran on such a node, its output is considered unusable even
> though the map task has been marked as success previously.
> Such a map task is then KILLED and a new attempt is launched.
>
> Regards,
> Varun Saxena.
>
> On Wed, Nov 11, 2015 at 11:44 PM, Sergey <se...@gmail.com> wrote:
>
>> Hi,
>>
>> yes, there are several "failed" map, because of 600 sec time-out.
>>
>> I also found a lot messages like this in the log:
>>
>> 2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler]
>> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed
>> because it ran on unusable node 10.0.0.5:30050.
>> AttemptId:attempt_1447029285980_0001_m_000043_1
>>
>> Different nodes got unusable status very often.
>>
>> Do you know something about possible reason? Maybe changing some time-out
>> params in communication between nodes could help?
>>
>> As I already said I work in the cloud on Azure HDInsight.
>>
>>
>>
>>
>>
>>
>> 2015-11-11 20:33 GMT+03:00 Namikaze Minato <ll...@gmail.com>:
>>
>>> Hi.
>>>
>>> Do you also have "failed" map attempts?
>>> Killed map attempts won't help us understand why your job is failing.
>>>
>>> Regards,
>>> LLoyd
>>>
>>>
>>> On 11 November 2015 at 16:37, Sergey <se...@gmail.com> wrote:
>>> >
>>> > Hi experts!
>>> >
>>> > I see strange behaviour of Hadoop while execution of my tasks.
>>> > It re-runs task attempt which has completed with SUCCEEDED status
>>> > (see the log below about attempt_1447029285980_0001_m_000012_0).
>>> >
>>> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4
>>> and
>>> > than 2000.
>>> >
>>> > The same story with some other tasks..
>>> > A also see on screen after execution of task that some times map
>>> progress is
>>> > decreasing...
>>> >
>>> > I don't use preemption, speculative execution and don't see any
>>> exceptions,
>>> > time-outs in yarn log
>>> > (except last line "Container killed on request. Exit code is 143").
>>> >
>>> > How to catch the reason?
>>> >
>>> > I use version 2.6.0 in Azure cloud (HDInsight)
>>> >
>>> >
>>> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
>>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of
>>> TaskAttempt
>>> > attempt_1447029285980_0001_m_000012_0 is : 1.0
>>> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
>>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement
>>> from
>>> > attempt_1447029285980_0001_m_000012_0
>>> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>>> RUNNING
>>> > to SUCCESS_CONTAINER_CLEANUP
>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>>> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
>>> > container_e04_1447029285980_0001_01_002951 taskAttempt
>>> > attempt_1447029285980_0001_m_000012_0
>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>>> KILLING
>>> > attempt_1447029285980_0001_m_000012_0
>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>> >
>>> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
>>> > Opening proxy : 10.0.0.8:30050
>>> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>>> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded
>>> with
>>> > attempt attempt_1447029285980_0001_m_000012_0
>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
>>> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to
>>> SUCCEEDED
>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed
>>> Tasks: 4
>>> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
>>> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1
>>> AssignedMaps:8
>>> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
>>> > HostLocal:0 RackLocal:16
>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
>>> > completed container container_e04_1447029285980_0001_01_002951
>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got
>>> allocated
>>> > containers 1
>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to
>>> > reduce
>>> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>> Diagnostics
>>> > report from attempt_1447029285980_0001_m_000012_0: Container killed by
>>> the
>>> > ApplicationMaster.
>>> > Container killed on request. Exit code is 143
>>> > Container exited with a non-zero exit code 143
>>>
>>
>>
>

Re: Re-execution of map task

Posted by Sergey <se...@gmail.com>.

Hi Varun,
thank you!

But the main question is the reason...

I did check log files with yarn logs and local logs on nodes.
And don't see reasons why Hadoop can think that nodes gone unusable.


Regards,
Sergey.




2015-11-11 21:23 GMT+03:00 Varun Saxena <vs...@gmail.com>:

> Hi Sergey,
>
> This indicates that one or more of your Node Managers' may have gone down.
> RM indicates this to AM on allocate response.
> If a map task ran on such a node, its output is considered unusable even
> though the map task has been marked as success previously.
> Such a map task is then KILLED and a new attempt is launched.
>
> Regards,
> Varun Saxena.
>
> On Wed, Nov 11, 2015 at 11:44 PM, Sergey <se...@gmail.com> wrote:
>
>> Hi,
>>
>> yes, there are several "failed" map, because of 600 sec time-out.
>>
>> I also found a lot messages like this in the log:
>>
>> 2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler]
>> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed
>> because it ran on unusable node 10.0.0.5:30050.
>> AttemptId:attempt_1447029285980_0001_m_000043_1
>>
>> Different nodes got unusable status very often.
>>
>> Do you know something about possible reason? Maybe changing some time-out
>> params in communication between nodes could help?
>>
>> As I already said I work in the cloud on Azure HDInsight.
>>
>>
>>
>>
>>
>>
>> 2015-11-11 20:33 GMT+03:00 Namikaze Minato <ll...@gmail.com>:
>>
>>> Hi.
>>>
>>> Do you also have "failed" map attempts?
>>> Killed map attempts won't help us understand why your job is failing.
>>>
>>> Regards,
>>> LLoyd
>>>
>>>
>>> On 11 November 2015 at 16:37, Sergey <se...@gmail.com> wrote:
>>> >
>>> > Hi experts!
>>> >
>>> > I see strange behaviour of Hadoop while execution of my tasks.
>>> > It re-runs task attempt which has completed with SUCCEEDED status
>>> > (see the log below about attempt_1447029285980_0001_m_000012_0).
>>> >
>>> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4
>>> and
>>> > than 2000.
>>> >
>>> > The same story with some other tasks..
>>> > A also see on screen after execution of task that some times map
>>> progress is
>>> > decreasing...
>>> >
>>> > I don't use preemption, speculative execution and don't see any
>>> exceptions,
>>> > time-outs in yarn log
>>> > (except last line "Container killed on request. Exit code is 143").
>>> >
>>> > How to catch the reason?
>>> >
>>> > I use version 2.6.0 in Azure cloud (HDInsight)
>>> >
>>> >
>>> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
>>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of
>>> TaskAttempt
>>> > attempt_1447029285980_0001_m_000012_0 is : 1.0
>>> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
>>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement
>>> from
>>> > attempt_1447029285980_0001_m_000012_0
>>> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>>> RUNNING
>>> > to SUCCESS_CONTAINER_CLEANUP
>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>>> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
>>> > container_e04_1447029285980_0001_01_002951 taskAttempt
>>> > attempt_1447029285980_0001_m_000012_0
>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>>> KILLING
>>> > attempt_1447029285980_0001_m_000012_0
>>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>>> >
>>> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
>>> > Opening proxy : 10.0.0.8:30050
>>> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>>> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded
>>> with
>>> > attempt attempt_1447029285980_0001_m_000012_0
>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
>>> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to
>>> SUCCEEDED
>>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed
>>> Tasks: 4
>>> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
>>> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1
>>> AssignedMaps:8
>>> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
>>> > HostLocal:0 RackLocal:16
>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
>>> > completed container container_e04_1447029285980_0001_01_002951
>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got
>>> allocated
>>> > containers 1
>>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to
>>> > reduce
>>> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
>>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>>> Diagnostics
>>> > report from attempt_1447029285980_0001_m_000012_0: Container killed by
>>> the
>>> > ApplicationMaster.
>>> > Container killed on request. Exit code is 143
>>> > Container exited with a non-zero exit code 143
>>>
>>
>>
>

Re: Re-execution of map task

Posted by Varun Saxena <vs...@gmail.com>.

Hi Sergey,

This indicates that one or more of your Node Managers' may have gone down.
RM indicates this to AM on allocate response.
If a map task ran on such a node, its output is considered unusable even
though the map task has been marked as success previously.
Such a map task is then KILLED and a new attempt is launched.

Regards,
Varun Saxena.

On Wed, Nov 11, 2015 at 11:44 PM, Sergey <se...@gmail.com> wrote:

> Hi,
>
> yes, there are several "failed" map, because of 600 sec time-out.
>
> I also found a lot messages like this in the log:
>
> 2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed
> because it ran on unusable node 10.0.0.5:30050.
> AttemptId:attempt_1447029285980_0001_m_000043_1
>
> Different nodes got unusable status very often.
>
> Do you know something about possible reason? Maybe changing some time-out
> params in communication between nodes could help?
>
> As I already said I work in the cloud on Azure HDInsight.
>
>
>
>
>
>
> 2015-11-11 20:33 GMT+03:00 Namikaze Minato <ll...@gmail.com>:
>
>> Hi.
>>
>> Do you also have "failed" map attempts?
>> Killed map attempts won't help us understand why your job is failing.
>>
>> Regards,
>> LLoyd
>>
>>
>> On 11 November 2015 at 16:37, Sergey <se...@gmail.com> wrote:
>> >
>> > Hi experts!
>> >
>> > I see strange behaviour of Hadoop while execution of my tasks.
>> > It re-runs task attempt which has completed with SUCCEEDED status
>> > (see the log below about attempt_1447029285980_0001_m_000012_0).
>> >
>> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4
>> and
>> > than 2000.
>> >
>> > The same story with some other tasks..
>> > A also see on screen after execution of task that some times map
>> progress is
>> > decreasing...
>> >
>> > I don't use preemption, speculative execution and don't see any
>> exceptions,
>> > time-outs in yarn log
>> > (except last line "Container killed on request. Exit code is 143").
>> >
>> > How to catch the reason?
>> >
>> > I use version 2.6.0 in Azure cloud (HDInsight)
>> >
>> >
>> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of
>> TaskAttempt
>> > attempt_1447029285980_0001_m_000012_0 is : 1.0
>> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement
>> from
>> > attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>> RUNNING
>> > to SUCCESS_CONTAINER_CLEANUP
>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
>> > container_e04_1447029285980_0001_01_002951 taskAttempt
>> > attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>> KILLING
>> > attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>> > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
>> > Opening proxy : 10.0.0.8:30050
>> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded
>> with
>> > attempt attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
>> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to
>> SUCCEEDED
>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed
>> Tasks: 4
>> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
>> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1
>> AssignedMaps:8
>> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
>> > HostLocal:0 RackLocal:16
>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
>> > completed container container_e04_1447029285980_0001_01_002951
>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got
>> allocated
>> > containers 1
>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to
>> > reduce
>> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
>> > report from attempt_1447029285980_0001_m_000012_0: Container killed by
>> the
>> > ApplicationMaster.
>> > Container killed on request. Exit code is 143
>> > Container exited with a non-zero exit code 143
>>
>
>

Re: Re-execution of map task

Posted by Varun Saxena <vs...@gmail.com>.

Hi Sergey,

This indicates that one or more of your Node Managers' may have gone down.
RM indicates this to AM on allocate response.
If a map task ran on such a node, its output is considered unusable even
though the map task has been marked as success previously.
Such a map task is then KILLED and a new attempt is launched.

Regards,
Varun Saxena.

On Wed, Nov 11, 2015 at 11:44 PM, Sergey <se...@gmail.com> wrote:

> Hi,
>
> yes, there are several "failed" map, because of 600 sec time-out.
>
> I also found a lot messages like this in the log:
>
> 2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed
> because it ran on unusable node 10.0.0.5:30050.
> AttemptId:attempt_1447029285980_0001_m_000043_1
>
> Different nodes got unusable status very often.
>
> Do you know something about possible reason? Maybe changing some time-out
> params in communication between nodes could help?
>
> As I already said I work in the cloud on Azure HDInsight.
>
>
>
>
>
>
> 2015-11-11 20:33 GMT+03:00 Namikaze Minato <ll...@gmail.com>:
>
>> Hi.
>>
>> Do you also have "failed" map attempts?
>> Killed map attempts won't help us understand why your job is failing.
>>
>> Regards,
>> LLoyd
>>
>>
>> On 11 November 2015 at 16:37, Sergey <se...@gmail.com> wrote:
>> >
>> > Hi experts!
>> >
>> > I see strange behaviour of Hadoop while execution of my tasks.
>> > It re-runs task attempt which has completed with SUCCEEDED status
>> > (see the log below about attempt_1447029285980_0001_m_000012_0).
>> >
>> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4
>> and
>> > than 2000.
>> >
>> > The same story with some other tasks..
>> > A also see on screen after execution of task that some times map
>> progress is
>> > decreasing...
>> >
>> > I don't use preemption, speculative execution and don't see any
>> exceptions,
>> > time-outs in yarn log
>> > (except last line "Container killed on request. Exit code is 143").
>> >
>> > How to catch the reason?
>> >
>> > I use version 2.6.0 in Azure cloud (HDInsight)
>> >
>> >
>> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of
>> TaskAttempt
>> > attempt_1447029285980_0001_m_000012_0 is : 1.0
>> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement
>> from
>> > attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>> RUNNING
>> > to SUCCESS_CONTAINER_CLEANUP
>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
>> > container_e04_1447029285980_0001_01_002951 taskAttempt
>> > attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>> KILLING
>> > attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>> > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
>> > Opening proxy : 10.0.0.8:30050
>> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded
>> with
>> > attempt attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
>> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to
>> SUCCEEDED
>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed
>> Tasks: 4
>> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
>> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1
>> AssignedMaps:8
>> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
>> > HostLocal:0 RackLocal:16
>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
>> > completed container container_e04_1447029285980_0001_01_002951
>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got
>> allocated
>> > containers 1
>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to
>> > reduce
>> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
>> > report from attempt_1447029285980_0001_m_000012_0: Container killed by
>> the
>> > ApplicationMaster.
>> > Container killed on request. Exit code is 143
>> > Container exited with a non-zero exit code 143
>>
>
>

Re: Re-execution of map task

Posted by Varun Saxena <vs...@gmail.com>.

Hi Sergey,

This indicates that one or more of your Node Managers' may have gone down.
RM indicates this to AM on allocate response.
If a map task ran on such a node, its output is considered unusable even
though the map task has been marked as success previously.
Such a map task is then KILLED and a new attempt is launched.

Regards,
Varun Saxena.

On Wed, Nov 11, 2015 at 11:44 PM, Sergey <se...@gmail.com> wrote:

> Hi,
>
> yes, there are several "failed" map, because of 600 sec time-out.
>
> I also found a lot messages like this in the log:
>
> 2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed
> because it ran on unusable node 10.0.0.5:30050.
> AttemptId:attempt_1447029285980_0001_m_000043_1
>
> Different nodes got unusable status very often.
>
> Do you know something about possible reason? Maybe changing some time-out
> params in communication between nodes could help?
>
> As I already said I work in the cloud on Azure HDInsight.
>
>
>
>
>
>
> 2015-11-11 20:33 GMT+03:00 Namikaze Minato <ll...@gmail.com>:
>
>> Hi.
>>
>> Do you also have "failed" map attempts?
>> Killed map attempts won't help us understand why your job is failing.
>>
>> Regards,
>> LLoyd
>>
>>
>> On 11 November 2015 at 16:37, Sergey <se...@gmail.com> wrote:
>> >
>> > Hi experts!
>> >
>> > I see strange behaviour of Hadoop while execution of my tasks.
>> > It re-runs task attempt which has completed with SUCCEEDED status
>> > (see the log below about attempt_1447029285980_0001_m_000012_0).
>> >
>> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4
>> and
>> > than 2000.
>> >
>> > The same story with some other tasks..
>> > A also see on screen after execution of task that some times map
>> progress is
>> > decreasing...
>> >
>> > I don't use preemption, speculative execution and don't see any
>> exceptions,
>> > time-outs in yarn log
>> > (except last line "Container killed on request. Exit code is 143").
>> >
>> > How to catch the reason?
>> >
>> > I use version 2.6.0 in Azure cloud (HDInsight)
>> >
>> >
>> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of
>> TaskAttempt
>> > attempt_1447029285980_0001_m_000012_0 is : 1.0
>> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement
>> from
>> > attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>> RUNNING
>> > to SUCCESS_CONTAINER_CLEANUP
>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
>> > container_e04_1447029285980_0001_01_002951 taskAttempt
>> > attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>> KILLING
>> > attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>> > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
>> > Opening proxy : 10.0.0.8:30050
>> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded
>> with
>> > attempt attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
>> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to
>> SUCCEEDED
>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed
>> Tasks: 4
>> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
>> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1
>> AssignedMaps:8
>> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
>> > HostLocal:0 RackLocal:16
>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
>> > completed container container_e04_1447029285980_0001_01_002951
>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got
>> allocated
>> > containers 1
>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to
>> > reduce
>> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
>> > report from attempt_1447029285980_0001_m_000012_0: Container killed by
>> the
>> > ApplicationMaster.
>> > Container killed on request. Exit code is 143
>> > Container exited with a non-zero exit code 143
>>
>
>

Re: Re-execution of map task

Posted by Varun Saxena <vs...@gmail.com>.

Hi Sergey,

This indicates that one or more of your Node Managers' may have gone down.
RM indicates this to AM on allocate response.
If a map task ran on such a node, its output is considered unusable even
though the map task has been marked as success previously.
Such a map task is then KILLED and a new attempt is launched.

Regards,
Varun Saxena.

On Wed, Nov 11, 2015 at 11:44 PM, Sergey <se...@gmail.com> wrote:

> Hi,
>
> yes, there are several "failed" map, because of 600 sec time-out.
>
> I also found a lot messages like this in the log:
>
> 2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed
> because it ran on unusable node 10.0.0.5:30050.
> AttemptId:attempt_1447029285980_0001_m_000043_1
>
> Different nodes got unusable status very often.
>
> Do you know something about possible reason? Maybe changing some time-out
> params in communication between nodes could help?
>
> As I already said I work in the cloud on Azure HDInsight.
>
>
>
>
>
>
> 2015-11-11 20:33 GMT+03:00 Namikaze Minato <ll...@gmail.com>:
>
>> Hi.
>>
>> Do you also have "failed" map attempts?
>> Killed map attempts won't help us understand why your job is failing.
>>
>> Regards,
>> LLoyd
>>
>>
>> On 11 November 2015 at 16:37, Sergey <se...@gmail.com> wrote:
>> >
>> > Hi experts!
>> >
>> > I see strange behaviour of Hadoop while execution of my tasks.
>> > It re-runs task attempt which has completed with SUCCEEDED status
>> > (see the log below about attempt_1447029285980_0001_m_000012_0).
>> >
>> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4
>> and
>> > than 2000.
>> >
>> > The same story with some other tasks..
>> > A also see on screen after execution of task that some times map
>> progress is
>> > decreasing...
>> >
>> > I don't use preemption, speculative execution and don't see any
>> exceptions,
>> > time-outs in yarn log
>> > (except last line "Container killed on request. Exit code is 143").
>> >
>> > How to catch the reason?
>> >
>> > I use version 2.6.0 in Azure cloud (HDInsight)
>> >
>> >
>> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of
>> TaskAttempt
>> > attempt_1447029285980_0001_m_000012_0 is : 1.0
>> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
>> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement
>> from
>> > attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>> RUNNING
>> > to SUCCESS_CONTAINER_CLEANUP
>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
>> > container_e04_1447029285980_0001_01_002951 taskAttempt
>> > attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
>> KILLING
>> > attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
>> > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
>> > Opening proxy : 10.0.0.8:30050
>> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
>> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
>> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded
>> with
>> > attempt attempt_1447029285980_0001_m_000012_0
>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
>> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to
>> SUCCEEDED
>> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed
>> Tasks: 4
>> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
>> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1
>> AssignedMaps:8
>> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
>> > HostLocal:0 RackLocal:16
>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
>> > completed container container_e04_1447029285980_0001_01_002951
>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got
>> allocated
>> > containers 1
>> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
>> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to
>> > reduce
>> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
>> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
>> > report from attempt_1447029285980_0001_m_000012_0: Container killed by
>> the
>> > ApplicationMaster.
>> > Container killed on request. Exit code is 143
>> > Container exited with a non-zero exit code 143
>>
>
>

Re: Re-execution of map task

Posted by Sergey <se...@gmail.com>.

Hi,

yes, there are several "failed" map, because of 600 sec time-out.

I also found a lot messages like this in the log:

2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed
because it ran on unusable node 10.0.0.5:30050.
AttemptId:attempt_1447029285980_0001_m_000043_1

Different nodes got unusable status very often.

Do you know something about possible reason? Maybe changing some time-out
params in communication between nodes could help?

As I already said I work in the cloud on Azure HDInsight.






2015-11-11 20:33 GMT+03:00 Namikaze Minato <ll...@gmail.com>:

> Hi.
>
> Do you also have "failed" map attempts?
> Killed map attempts won't help us understand why your job is failing.
>
> Regards,
> LLoyd
>
> On 11 November 2015 at 16:37, Sergey <se...@gmail.com> wrote:
> >
> > Hi experts!
> >
> > I see strange behaviour of Hadoop while execution of my tasks.
> > It re-runs task attempt which has completed with SUCCEEDED status
> > (see the log below about attempt_1447029285980_0001_m_000012_0).
> >
> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4 and
> > than 2000.
> >
> > The same story with some other tasks..
> > A also see on screen after execution of task that some times map
> progress is
> > decreasing...
> >
> > I don't use preemption, speculative execution and don't see any
> exceptions,
> > time-outs in yarn log
> > (except last line "Container killed on request. Exit code is 143").
> >
> > How to catch the reason?
> >
> > I use version 2.6.0 in Azure cloud (HDInsight)
> >
> >
> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
> > attempt_1447029285980_0001_m_000012_0 is : 1.0
> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement
> from
> > attempt_1447029285980_0001_m_000012_0
> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
> RUNNING
> > to SUCCESS_CONTAINER_CLEANUP
> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
> > container_e04_1447029285980_0001_01_002951 taskAttempt
> > attempt_1447029285980_0001_m_000012_0
> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
> KILLING
> > attempt_1447029285980_0001_m_000012_0
> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
> > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
> > Opening proxy : 10.0.0.8:30050
> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with
> > attempt attempt_1447029285980_0001_m_000012_0
> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to
> SUCCEEDED
> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed
> Tasks: 4
> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1 AssignedMaps:8
> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
> > HostLocal:0 RackLocal:16
> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
> > completed container container_e04_1447029285980_0001_01_002951
> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated
> > containers 1
> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to
> > reduce
> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
> > report from attempt_1447029285980_0001_m_000012_0: Container killed by
> the
> > ApplicationMaster.
> > Container killed on request. Exit code is 143
> > Container exited with a non-zero exit code 143
>

Re: Re-execution of map task

Posted by Sergey <se...@gmail.com>.

Hi,

yes, there are several "failed" map, because of 600 sec time-out.

I also found a lot messages like this in the log:

2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed
because it ran on unusable node 10.0.0.5:30050.
AttemptId:attempt_1447029285980_0001_m_000043_1

Different nodes got unusable status very often.

Do you know something about possible reason? Maybe changing some time-out
params in communication between nodes could help?

As I already said I work in the cloud on Azure HDInsight.






2015-11-11 20:33 GMT+03:00 Namikaze Minato <ll...@gmail.com>:

> Hi.
>
> Do you also have "failed" map attempts?
> Killed map attempts won't help us understand why your job is failing.
>
> Regards,
> LLoyd
>
> On 11 November 2015 at 16:37, Sergey <se...@gmail.com> wrote:
> >
> > Hi experts!
> >
> > I see strange behaviour of Hadoop while execution of my tasks.
> > It re-runs task attempt which has completed with SUCCEEDED status
> > (see the log below about attempt_1447029285980_0001_m_000012_0).
> >
> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4 and
> > than 2000.
> >
> > The same story with some other tasks..
> > A also see on screen after execution of task that some times map
> progress is
> > decreasing...
> >
> > I don't use preemption, speculative execution and don't see any
> exceptions,
> > time-outs in yarn log
> > (except last line "Container killed on request. Exit code is 143").
> >
> > How to catch the reason?
> >
> > I use version 2.6.0 in Azure cloud (HDInsight)
> >
> >
> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
> > attempt_1447029285980_0001_m_000012_0 is : 1.0
> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement
> from
> > attempt_1447029285980_0001_m_000012_0
> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
> RUNNING
> > to SUCCESS_CONTAINER_CLEANUP
> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
> > container_e04_1447029285980_0001_01_002951 taskAttempt
> > attempt_1447029285980_0001_m_000012_0
> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
> KILLING
> > attempt_1447029285980_0001_m_000012_0
> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
> > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
> > Opening proxy : 10.0.0.8:30050
> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with
> > attempt attempt_1447029285980_0001_m_000012_0
> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to
> SUCCEEDED
> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed
> Tasks: 4
> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1 AssignedMaps:8
> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
> > HostLocal:0 RackLocal:16
> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
> > completed container container_e04_1447029285980_0001_01_002951
> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated
> > containers 1
> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to
> > reduce
> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
> > report from attempt_1447029285980_0001_m_000012_0: Container killed by
> the
> > ApplicationMaster.
> > Container killed on request. Exit code is 143
> > Container exited with a non-zero exit code 143
>

Re: Re-execution of map task

Posted by Sergey <se...@gmail.com>.

Hi,

yes, there are several "failed" map, because of 600 sec time-out.

I also found a lot messages like this in the log:

2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed
because it ran on unusable node 10.0.0.5:30050.
AttemptId:attempt_1447029285980_0001_m_000043_1

Different nodes got unusable status very often.

Do you know something about possible reason? Maybe changing some time-out
params in communication between nodes could help?

As I already said I work in the cloud on Azure HDInsight.






2015-11-11 20:33 GMT+03:00 Namikaze Minato <ll...@gmail.com>:

> Hi.
>
> Do you also have "failed" map attempts?
> Killed map attempts won't help us understand why your job is failing.
>
> Regards,
> LLoyd
>
> On 11 November 2015 at 16:37, Sergey <se...@gmail.com> wrote:
> >
> > Hi experts!
> >
> > I see strange behaviour of Hadoop while execution of my tasks.
> > It re-runs task attempt which has completed with SUCCEEDED status
> > (see the log below about attempt_1447029285980_0001_m_000012_0).
> >
> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4 and
> > than 2000.
> >
> > The same story with some other tasks..
> > A also see on screen after execution of task that some times map
> progress is
> > decreasing...
> >
> > I don't use preemption, speculative execution and don't see any
> exceptions,
> > time-outs in yarn log
> > (except last line "Container killed on request. Exit code is 143").
> >
> > How to catch the reason?
> >
> > I use version 2.6.0 in Azure cloud (HDInsight)
> >
> >
> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
> > attempt_1447029285980_0001_m_000012_0 is : 1.0
> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement
> from
> > attempt_1447029285980_0001_m_000012_0
> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
> RUNNING
> > to SUCCESS_CONTAINER_CLEANUP
> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
> > container_e04_1447029285980_0001_01_002951 taskAttempt
> > attempt_1447029285980_0001_m_000012_0
> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
> KILLING
> > attempt_1447029285980_0001_m_000012_0
> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
> > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
> > Opening proxy : 10.0.0.8:30050
> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with
> > attempt attempt_1447029285980_0001_m_000012_0
> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to
> SUCCEEDED
> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed
> Tasks: 4
> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1 AssignedMaps:8
> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
> > HostLocal:0 RackLocal:16
> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
> > completed container container_e04_1447029285980_0001_01_002951
> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated
> > containers 1
> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to
> > reduce
> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
> > report from attempt_1447029285980_0001_m_000012_0: Container killed by
> the
> > ApplicationMaster.
> > Container killed on request. Exit code is 143
> > Container exited with a non-zero exit code 143
>

Re: Re-execution of map task

Posted by Sergey <se...@gmail.com>.

Hi,

yes, there are several "failed" map, because of 600 sec time-out.

I also found a lot messages like this in the log:

2015-11-09 22:00:35,882 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: TaskAttempt killed
because it ran on unusable node 10.0.0.5:30050.
AttemptId:attempt_1447029285980_0001_m_000043_1

Different nodes got unusable status very often.

Do you know something about possible reason? Maybe changing some time-out
params in communication between nodes could help?

As I already said I work in the cloud on Azure HDInsight.






2015-11-11 20:33 GMT+03:00 Namikaze Minato <ll...@gmail.com>:

> Hi.
>
> Do you also have "failed" map attempts?
> Killed map attempts won't help us understand why your job is failing.
>
> Regards,
> LLoyd
>
> On 11 November 2015 at 16:37, Sergey <se...@gmail.com> wrote:
> >
> > Hi experts!
> >
> > I see strange behaviour of Hadoop while execution of my tasks.
> > It re-runs task attempt which has completed with SUCCEEDED status
> > (see the log below about attempt_1447029285980_0001_m_000012_0).
> >
> > I don't know why but this tasks repeats in attempts numbers 0,1,2,3,4 and
> > than 2000.
> >
> > The same story with some other tasks..
> > A also see on screen after execution of task that some times map
> progress is
> > decreasing...
> >
> > I don't use preemption, speculative execution and don't see any
> exceptions,
> > time-outs in yarn log
> > (except last line "Container killed on request. Exit code is 143").
> >
> > How to catch the reason?
> >
> > I use version 2.6.0 in Azure cloud (HDInsight)
> >
> >
> > 2015-11-09 19:57:45,584 INFO [IPC Server handler 17 on 53153]
> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt
> > attempt_1447029285980_0001_m_000012_0 is : 1.0
> > 2015-11-09 19:57:45,592 INFO [IPC Server handler 12 on 53153]
> > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Done acknowledgement
> from
> > attempt_1447029285980_0001_m_000012_0
> > 2015-11-09 19:57:45,592 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
> RUNNING
> > to SUCCESS_CONTAINER_CLEANUP
> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
> > Processing the event EventType: CONTAINER_REMOTE_CLEANUP for container
> > container_e04_1447029285980_0001_01_002951 taskAttempt
> > attempt_1447029285980_0001_m_000012_0
> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
> > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl:
> KILLING
> > attempt_1447029285980_0001_m_000012_0
> > 2015-11-09 19:57:45,593 INFO [ContainerLauncher #4]
> > org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy:
> > Opening proxy : 10.0.0.8:30050
> > 2015-11-09 19:57:45,906 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl:
> > attempt_1447029285980_0001_m_000012_0 TaskAttempt Transitioned from
> > SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl: Task succeeded with
> > attempt attempt_1447029285980_0001_m_000012_0
> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskImpl:
> > task_1447029285980_0001_m_000012 Task Transitioned from RUNNING to
> SUCCEEDED
> > 2015-11-09 19:57:45,907 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed
> Tasks: 4
> > 2015-11-09 19:57:46,553 INFO [RMCommunicator Allocator]
> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before
> > Scheduling: PendingReds:0 ScheduledMaps:35 ScheduledReds:1 AssignedMaps:8
> > AssignedReds:0 CompletedMaps:4 CompletedReds:0 ContAlloc:16 ContRel:0
> > HostLocal:0 RackLocal:16
> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Received
> > completed container container_e04_1447029285980_0001_01_002951
> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated
> > containers 1
> > 2015-11-09 19:57:48,575 INFO [RMCommunicator Allocator]
> > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to
> > reduce
> > 2015-11-09 19:57:48,575 INFO [AsyncDispatcher event handler]
> > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
> > report from attempt_1447029285980_0001_m_000012_0: Container killed by
> the
> > ApplicationMaster.
> > Container killed on request. Exit code is 143
> > Container exited with a non-zero exit code 143
>