You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Krishna Kishore Bonagiri <wr...@gmail.com> on 2014/03/04 15:53:09 UTC

Node manager or Resource Manager crash

Hi,
  I am running an application on a 2-node cluster, which tries to acquire
all the containers that are available on one of those nodes and remaining
containers from the other node in the cluster. When I run this application
continuously in a loop, one of the NM or RM is getting killed at a random
point. There is no corresponding message in the log files.

One of the times that NM had got killed today, the tail of the it's log is
like this:

2014-03-04 02:42:44,386 DEBUG
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
isredeng:52867 sending out status for 16 containers
2014-03-04 02:42:44,386 DEBUG
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's
health-status : true,


And at the time of NM's crash, the RM's log has the following entries:

2014-03-04 02:42:40,371 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing
isredeng:52867 of type STATUS_UPDATE
2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher:
Dispatching the event
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType:
NODE_UPDATE
2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server
Responder: responding to
org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from
9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes.
2014-03-04 02:42:40,371 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
nodeUpdate: isredeng:52867 clusterResources:
<memory:16384, vCores:16>
2014-03-04 02:42:40,371 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Node being looked for scheduling isredeng:52867
availableResource: <memory:0, vCores:-8>
2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server:  got #151


Note: the name of the node on which NM has got killed is isredeng, does it
indicate anything from the above message as to why it got killed?

Thanks,
Kishore

Re: Node manager or Resource Manager crash

Posted by Krishna Kishore Bonagiri <wr...@gmail.com>.

Vinod,

  One more observation I can share is that all the times the NM or RM is
getting killed, I see the following kind of messages in the NM's log

2014-03-05 05:33:23,824 DEBUG
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's
health-status : true,
2014-03-05 05:33:23,824 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(2132631259) connection to isredeng/9.70.137.184:8031 from kbonagir sending
#5391
2014-03-05 05:33:23,826 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(2132631259) connection to isredeng/9.70.137.184:8031 from kbonagir got
value #5391
2014-03-05 05:33:23,826 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
Call: nodeHeartbeat took 2ms


Does that give any clue? Something going wrong while it is getting a node's
health?

Thanks,
Kishore



On Tue, Mar 4, 2014 at 10:51 PM, Vinod Kumar Vavilapalli <vinodkv@apache.org
> wrote:

> I remember you asking this question before. Check if your OS' OOM killer
> is killing it.
>
> +Vinod
>
> On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri <
> write2kishore@gmail.com> wrote:
>
> Hi,
>   I am running an application on a 2-node cluster, which tries to acquire
> all the containers that are available on one of those nodes and remaining
> containers from the other node in the cluster. When I run this application
> continuously in a loop, one of the NM or RM is getting killed at a random
> point. There is no corresponding message in the log files.
>
> One of the times that NM had got killed today, the tail of the it's log is
> like this:
>
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
> isredeng:52867 sending out status for 16 containers
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's
> health-status : true,
>
>
> And at the time of NM's crash, the RM's log has the following entries:
>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing
> isredeng:52867 of type STATUS_UPDATE
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType:
> NODE_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server
> Responder: responding to
> org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from
> 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes.
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> nodeUpdate: isredeng:52867 clusterResources:
> <memory:16384, vCores:16>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Node being looked for scheduling isredeng:52867
> availableResource: <memory:0, vCores:-8>
> 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server:  got #151
>
>
> Note: the name of the node on which NM has got killed is isredeng, does it
> indicate anything from the above message as to why it got killed?
>
> Thanks,
> Kishore
>
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Node manager or Resource Manager crash

Posted by Krishna Kishore Bonagiri <wr...@gmail.com>.

Vinod,

  One more observation I can share is that all the times the NM or RM is
getting killed, I see the following kind of messages in the NM's log

2014-03-05 05:33:23,824 DEBUG
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's
health-status : true,
2014-03-05 05:33:23,824 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(2132631259) connection to isredeng/9.70.137.184:8031 from kbonagir sending
#5391
2014-03-05 05:33:23,826 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(2132631259) connection to isredeng/9.70.137.184:8031 from kbonagir got
value #5391
2014-03-05 05:33:23,826 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
Call: nodeHeartbeat took 2ms


Does that give any clue? Something going wrong while it is getting a node's
health?

Thanks,
Kishore



On Tue, Mar 4, 2014 at 10:51 PM, Vinod Kumar Vavilapalli <vinodkv@apache.org
> wrote:

> I remember you asking this question before. Check if your OS' OOM killer
> is killing it.
>
> +Vinod
>
> On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri <
> write2kishore@gmail.com> wrote:
>
> Hi,
>   I am running an application on a 2-node cluster, which tries to acquire
> all the containers that are available on one of those nodes and remaining
> containers from the other node in the cluster. When I run this application
> continuously in a loop, one of the NM or RM is getting killed at a random
> point. There is no corresponding message in the log files.
>
> One of the times that NM had got killed today, the tail of the it's log is
> like this:
>
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
> isredeng:52867 sending out status for 16 containers
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's
> health-status : true,
>
>
> And at the time of NM's crash, the RM's log has the following entries:
>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing
> isredeng:52867 of type STATUS_UPDATE
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType:
> NODE_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server
> Responder: responding to
> org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from
> 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes.
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> nodeUpdate: isredeng:52867 clusterResources:
> <memory:16384, vCores:16>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Node being looked for scheduling isredeng:52867
> availableResource: <memory:0, vCores:-8>
> 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server:  got #151
>
>
> Note: the name of the node on which NM has got killed is isredeng, does it
> indicate anything from the above message as to why it got killed?
>
> Thanks,
> Kishore
>
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Node manager or Resource Manager crash

Posted by Krishna Kishore Bonagiri <wr...@gmail.com>.

Yes Vinod, I was asking this question sometime back, and I got back to
resolve the issue again.

I tried to see if the OOM is killing but it is not. I have checked the free
swap  space on my box while my test is going on, but it doesn't seem to be
the issue. Also, I  have verified if OOM score is going high for any of
these process because that is when OOM killer kills them, but they are not
going high too.

Thanks,
Kishore


On Tue, Mar 4, 2014 at 10:51 PM, Vinod Kumar Vavilapalli <vinodkv@apache.org
> wrote:

> I remember you asking this question before. Check if your OS' OOM killer
> is killing it.
>
> +Vinod
>
> On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri <
> write2kishore@gmail.com> wrote:
>
> Hi,
>   I am running an application on a 2-node cluster, which tries to acquire
> all the containers that are available on one of those nodes and remaining
> containers from the other node in the cluster. When I run this application
> continuously in a loop, one of the NM or RM is getting killed at a random
> point. There is no corresponding message in the log files.
>
> One of the times that NM had got killed today, the tail of the it's log is
> like this:
>
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
> isredeng:52867 sending out status for 16 containers
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's
> health-status : true,
>
>
> And at the time of NM's crash, the RM's log has the following entries:
>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing
> isredeng:52867 of type STATUS_UPDATE
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType:
> NODE_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server
> Responder: responding to
> org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from
> 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes.
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> nodeUpdate: isredeng:52867 clusterResources:
> <memory:16384, vCores:16>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Node being looked for scheduling isredeng:52867
> availableResource: <memory:0, vCores:-8>
> 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server:  got #151
>
>
> Note: the name of the node on which NM has got killed is isredeng, does it
> indicate anything from the above message as to why it got killed?
>
> Thanks,
> Kishore
>
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Node manager or Resource Manager crash

Posted by Krishna Kishore Bonagiri <wr...@gmail.com>.

Yes Vinod, I was asking this question sometime back, and I got back to
resolve the issue again.

I tried to see if the OOM is killing but it is not. I have checked the free
swap  space on my box while my test is going on, but it doesn't seem to be
the issue. Also, I  have verified if OOM score is going high for any of
these process because that is when OOM killer kills them, but they are not
going high too.

Thanks,
Kishore


On Tue, Mar 4, 2014 at 10:51 PM, Vinod Kumar Vavilapalli <vinodkv@apache.org
> wrote:

> I remember you asking this question before. Check if your OS' OOM killer
> is killing it.
>
> +Vinod
>
> On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri <
> write2kishore@gmail.com> wrote:
>
> Hi,
>   I am running an application on a 2-node cluster, which tries to acquire
> all the containers that are available on one of those nodes and remaining
> containers from the other node in the cluster. When I run this application
> continuously in a loop, one of the NM or RM is getting killed at a random
> point. There is no corresponding message in the log files.
>
> One of the times that NM had got killed today, the tail of the it's log is
> like this:
>
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
> isredeng:52867 sending out status for 16 containers
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's
> health-status : true,
>
>
> And at the time of NM's crash, the RM's log has the following entries:
>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing
> isredeng:52867 of type STATUS_UPDATE
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType:
> NODE_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server
> Responder: responding to
> org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from
> 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes.
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> nodeUpdate: isredeng:52867 clusterResources:
> <memory:16384, vCores:16>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Node being looked for scheduling isredeng:52867
> availableResource: <memory:0, vCores:-8>
> 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server:  got #151
>
>
> Note: the name of the node on which NM has got killed is isredeng, does it
> indicate anything from the above message as to why it got killed?
>
> Thanks,
> Kishore
>
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Node manager or Resource Manager crash

Posted by Krishna Kishore Bonagiri <wr...@gmail.com>.

Vinod,

  One more observation I can share is that all the times the NM or RM is
getting killed, I see the following kind of messages in the NM's log

2014-03-05 05:33:23,824 DEBUG
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's
health-status : true,
2014-03-05 05:33:23,824 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(2132631259) connection to isredeng/9.70.137.184:8031 from kbonagir sending
#5391
2014-03-05 05:33:23,826 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(2132631259) connection to isredeng/9.70.137.184:8031 from kbonagir got
value #5391
2014-03-05 05:33:23,826 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
Call: nodeHeartbeat took 2ms


Does that give any clue? Something going wrong while it is getting a node's
health?

Thanks,
Kishore



On Tue, Mar 4, 2014 at 10:51 PM, Vinod Kumar Vavilapalli <vinodkv@apache.org
> wrote:

> I remember you asking this question before. Check if your OS' OOM killer
> is killing it.
>
> +Vinod
>
> On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri <
> write2kishore@gmail.com> wrote:
>
> Hi,
>   I am running an application on a 2-node cluster, which tries to acquire
> all the containers that are available on one of those nodes and remaining
> containers from the other node in the cluster. When I run this application
> continuously in a loop, one of the NM or RM is getting killed at a random
> point. There is no corresponding message in the log files.
>
> One of the times that NM had got killed today, the tail of the it's log is
> like this:
>
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
> isredeng:52867 sending out status for 16 containers
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's
> health-status : true,
>
>
> And at the time of NM's crash, the RM's log has the following entries:
>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing
> isredeng:52867 of type STATUS_UPDATE
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType:
> NODE_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server
> Responder: responding to
> org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from
> 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes.
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> nodeUpdate: isredeng:52867 clusterResources:
> <memory:16384, vCores:16>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Node being looked for scheduling isredeng:52867
> availableResource: <memory:0, vCores:-8>
> 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server:  got #151
>
>
> Note: the name of the node on which NM has got killed is isredeng, does it
> indicate anything from the above message as to why it got killed?
>
> Thanks,
> Kishore
>
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Node manager or Resource Manager crash

Posted by Krishna Kishore Bonagiri <wr...@gmail.com>.

Yes Vinod, I was asking this question sometime back, and I got back to
resolve the issue again.

I tried to see if the OOM is killing but it is not. I have checked the free
swap  space on my box while my test is going on, but it doesn't seem to be
the issue. Also, I  have verified if OOM score is going high for any of
these process because that is when OOM killer kills them, but they are not
going high too.

Thanks,
Kishore


On Tue, Mar 4, 2014 at 10:51 PM, Vinod Kumar Vavilapalli <vinodkv@apache.org
> wrote:

> I remember you asking this question before. Check if your OS' OOM killer
> is killing it.
>
> +Vinod
>
> On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri <
> write2kishore@gmail.com> wrote:
>
> Hi,
>   I am running an application on a 2-node cluster, which tries to acquire
> all the containers that are available on one of those nodes and remaining
> containers from the other node in the cluster. When I run this application
> continuously in a loop, one of the NM or RM is getting killed at a random
> point. There is no corresponding message in the log files.
>
> One of the times that NM had got killed today, the tail of the it's log is
> like this:
>
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
> isredeng:52867 sending out status for 16 containers
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's
> health-status : true,
>
>
> And at the time of NM's crash, the RM's log has the following entries:
>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing
> isredeng:52867 of type STATUS_UPDATE
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType:
> NODE_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server
> Responder: responding to
> org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from
> 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes.
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> nodeUpdate: isredeng:52867 clusterResources:
> <memory:16384, vCores:16>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Node being looked for scheduling isredeng:52867
> availableResource: <memory:0, vCores:-8>
> 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server:  got #151
>
>
> Note: the name of the node on which NM has got killed is isredeng, does it
> indicate anything from the above message as to why it got killed?
>
> Thanks,
> Kishore
>
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Node manager or Resource Manager crash

Posted by Krishna Kishore Bonagiri <wr...@gmail.com>.

Yes Vinod, I was asking this question sometime back, and I got back to
resolve the issue again.

I tried to see if the OOM is killing but it is not. I have checked the free
swap  space on my box while my test is going on, but it doesn't seem to be
the issue. Also, I  have verified if OOM score is going high for any of
these process because that is when OOM killer kills them, but they are not
going high too.

Thanks,
Kishore


On Tue, Mar 4, 2014 at 10:51 PM, Vinod Kumar Vavilapalli <vinodkv@apache.org
> wrote:

> I remember you asking this question before. Check if your OS' OOM killer
> is killing it.
>
> +Vinod
>
> On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri <
> write2kishore@gmail.com> wrote:
>
> Hi,
>   I am running an application on a 2-node cluster, which tries to acquire
> all the containers that are available on one of those nodes and remaining
> containers from the other node in the cluster. When I run this application
> continuously in a loop, one of the NM or RM is getting killed at a random
> point. There is no corresponding message in the log files.
>
> One of the times that NM had got killed today, the tail of the it's log is
> like this:
>
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
> isredeng:52867 sending out status for 16 containers
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's
> health-status : true,
>
>
> And at the time of NM's crash, the RM's log has the following entries:
>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing
> isredeng:52867 of type STATUS_UPDATE
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType:
> NODE_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server
> Responder: responding to
> org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from
> 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes.
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> nodeUpdate: isredeng:52867 clusterResources:
> <memory:16384, vCores:16>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Node being looked for scheduling isredeng:52867
> availableResource: <memory:0, vCores:-8>
> 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server:  got #151
>
>
> Note: the name of the node on which NM has got killed is isredeng, does it
> indicate anything from the above message as to why it got killed?
>
> Thanks,
> Kishore
>
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Node manager or Resource Manager crash

Posted by Krishna Kishore Bonagiri <wr...@gmail.com>.

Vinod,

  One more observation I can share is that all the times the NM or RM is
getting killed, I see the following kind of messages in the NM's log

2014-03-05 05:33:23,824 DEBUG
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's
health-status : true,
2014-03-05 05:33:23,824 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(2132631259) connection to isredeng/9.70.137.184:8031 from kbonagir sending
#5391
2014-03-05 05:33:23,826 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(2132631259) connection to isredeng/9.70.137.184:8031 from kbonagir got
value #5391
2014-03-05 05:33:23,826 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
Call: nodeHeartbeat took 2ms


Does that give any clue? Something going wrong while it is getting a node's
health?

Thanks,
Kishore



On Tue, Mar 4, 2014 at 10:51 PM, Vinod Kumar Vavilapalli <vinodkv@apache.org
> wrote:

> I remember you asking this question before. Check if your OS' OOM killer
> is killing it.
>
> +Vinod
>
> On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri <
> write2kishore@gmail.com> wrote:
>
> Hi,
>   I am running an application on a 2-node cluster, which tries to acquire
> all the containers that are available on one of those nodes and remaining
> containers from the other node in the cluster. When I run this application
> continuously in a loop, one of the NM or RM is getting killed at a random
> point. There is no corresponding message in the log files.
>
> One of the times that NM had got killed today, the tail of the it's log is
> like this:
>
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl:
> isredeng:52867 sending out status for 16 containers
> 2014-03-04 02:42:44,386 DEBUG
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's
> health-status : true,
>
>
> And at the time of NM's crash, the RM's log has the following entries:
>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing
> isredeng:52867 of type STATUS_UPDATE
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType:
> NODE_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server
> Responder: responding to
> org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from
> 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes.
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> nodeUpdate: isredeng:52867 clusterResources:
> <memory:16384, vCores:16>
> 2014-03-04 02:42:40,371 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
> Node being looked for scheduling isredeng:52867
> availableResource: <memory:0, vCores:-8>
> 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server:  got #151
>
>
> Note: the name of the node on which NM has got killed is isredeng, does it
> indicate anything from the above message as to why it got killed?
>
> Thanks,
> Kishore
>
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Node manager or Resource Manager crash

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.

I remember you asking this question before. Check if your OS' OOM killer is killing it.

+Vinod

On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri <wr...@gmail.com> wrote:

> Hi,
>   I am running an application on a 2-node cluster, which tries to acquire all the containers that are available on one of those nodes and remaining containers from the other node in the cluster. When I run this application continuously in a loop, one of the NM or RM is getting killed at a random point. There is no corresponding message in the log files.
> 
> One of the times that NM had got killed today, the tail of the it's log is like this:
> 
> 2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: isredeng:52867 sending out status for 16 containers
> 2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's health-status : true,
> 
> 
> And at the time of NM's crash, the RM's log has the following entries:
> 
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing isredeng:52867 of type STATUS_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType: NODE_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server Responder: responding to org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 
> 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes.
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: nodeUpdate: isredeng:52867 clusterResources: 
> <memory:16384, vCores:16>
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Node being looked for scheduling isredeng:52867 
> availableResource: <memory:0, vCores:-8>
> 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server:  got #151
> 
> 
> Note: the name of the node on which NM has got killed is isredeng, does it indicate anything from the above message as to why it got killed?
> 
> Thanks,
> Kishore
> 
> 
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Node manager or Resource Manager crash

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.

I remember you asking this question before. Check if your OS' OOM killer is killing it.

+Vinod

On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri <wr...@gmail.com> wrote:

> Hi,
>   I am running an application on a 2-node cluster, which tries to acquire all the containers that are available on one of those nodes and remaining containers from the other node in the cluster. When I run this application continuously in a loop, one of the NM or RM is getting killed at a random point. There is no corresponding message in the log files.
> 
> One of the times that NM had got killed today, the tail of the it's log is like this:
> 
> 2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: isredeng:52867 sending out status for 16 containers
> 2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's health-status : true,
> 
> 
> And at the time of NM's crash, the RM's log has the following entries:
> 
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing isredeng:52867 of type STATUS_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType: NODE_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server Responder: responding to org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 
> 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes.
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: nodeUpdate: isredeng:52867 clusterResources: 
> <memory:16384, vCores:16>
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Node being looked for scheduling isredeng:52867 
> availableResource: <memory:0, vCores:-8>
> 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server:  got #151
> 
> 
> Note: the name of the node on which NM has got killed is isredeng, does it indicate anything from the above message as to why it got killed?
> 
> Thanks,
> Kishore
> 
> 
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Node manager or Resource Manager crash

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.

I remember you asking this question before. Check if your OS' OOM killer is killing it.

+Vinod

On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri <wr...@gmail.com> wrote:

> Hi,
>   I am running an application on a 2-node cluster, which tries to acquire all the containers that are available on one of those nodes and remaining containers from the other node in the cluster. When I run this application continuously in a loop, one of the NM or RM is getting killed at a random point. There is no corresponding message in the log files.
> 
> One of the times that NM had got killed today, the tail of the it's log is like this:
> 
> 2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: isredeng:52867 sending out status for 16 containers
> 2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's health-status : true,
> 
> 
> And at the time of NM's crash, the RM's log has the following entries:
> 
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing isredeng:52867 of type STATUS_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType: NODE_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server Responder: responding to org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 
> 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes.
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: nodeUpdate: isredeng:52867 clusterResources: 
> <memory:16384, vCores:16>
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Node being looked for scheduling isredeng:52867 
> availableResource: <memory:0, vCores:-8>
> 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server:  got #151
> 
> 
> Note: the name of the node on which NM has got killed is isredeng, does it indicate anything from the above message as to why it got killed?
> 
> Thanks,
> Kishore
> 
> 
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Node manager or Resource Manager crash

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.

I remember you asking this question before. Check if your OS' OOM killer is killing it.

+Vinod

On Mar 4, 2014, at 6:53 AM, Krishna Kishore Bonagiri <wr...@gmail.com> wrote:

> Hi,
>   I am running an application on a 2-node cluster, which tries to acquire all the containers that are available on one of those nodes and remaining containers from the other node in the cluster. When I run this application continuously in a loop, one of the NM or RM is getting killed at a random point. There is no corresponding message in the log files.
> 
> One of the times that NM had got killed today, the tail of the it's log is like this:
> 
> 2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: isredeng:52867 sending out status for 16 containers
> 2014-03-04 02:42:44,386 DEBUG org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node's health-status : true,
> 
> 
> And at the time of NM's crash, the RM's log has the following entries:
> 
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Processing isredeng:52867 of type STATUS_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.event.AsyncDispatcher: Dispatching the event 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType: NODE_UPDATE
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.ipc.Server: IPC Server Responder: responding to org.apache.hadoop.yarn.server.api.ResourceTrackerPB.nodeHeartbeat from 
> 9.70.137.184:33696 Call#14060 Retry#0 Wrote 40 bytes.
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: nodeUpdate: isredeng:52867 clusterResources: 
> <memory:16384, vCores:16>
> 2014-03-04 02:42:40,371 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Node being looked for scheduling isredeng:52867 
> availableResource: <memory:0, vCores:-8>
> 2014-03-04 02:42:40,393 DEBUG org.apache.hadoop.ipc.Server:  got #151
> 
> 
> Note: the name of the node on which NM has got killed is isredeng, does it indicate anything from the above message as to why it got killed?
> 
> Thanks,
> Kishore
> 
> 
> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.