You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Dmitry Sivachenko <tr...@gmail.com> on 2015/09/23 00:27:29 UTC

node remains unused after reboot

Hello!

I am using hadoop-2.7.1. I have a large map job running (total cores available on the cluster about 3000, total tasks 35000).
In the middle of this process one server reboots.

After reboot, nodemanager starts successfully end registers with resource manager:
2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests

In YARN web-interface I see this host as active, but VCores used remains zero (see screenshot).
But the map job mentioned is still running and have about 12000 pending tasks.

Why this host does not receive tasks to run?

PS: I recently upgraded from 2.4.1 and I did not notice such a problem with 2.4.1: new tasks were spawning immediately after reboot.

Thanks!

Re: node remains unused after reboot

Posted by Dmitry Sivachenko <tr...@gmail.com>.

> On 23 сент. 2015 г., at 22:08, Naganarasimha Garla <na...@gmail.com> wrote:
> 
> Sorry for the late Reply, thought of providing you some search strings for blackListing hence got lil delayed.
> As varun mentioned it looks more like app blacklisting case.  mapreduce.job.maxtaskfailures.per.tracker which is by default 3, so probability as per the scenario mentioned by you is that the node is getting black listed.
> You can search for Info logs with string as "Blacklisted host <host>" from RMContainerRequestor class.
> 


Thanks for the hint!
Yes, I see from the logs that this node was blacklisted.

Re: node remains unused after reboot

Posted by Dmitry Sivachenko <tr...@gmail.com>.

> On 23 сент. 2015 г., at 22:08, Naganarasimha Garla <na...@gmail.com> wrote:
> 
> Sorry for the late Reply, thought of providing you some search strings for blackListing hence got lil delayed.
> As varun mentioned it looks more like app blacklisting case.  mapreduce.job.maxtaskfailures.per.tracker which is by default 3, so probability as per the scenario mentioned by you is that the node is getting black listed.
> You can search for Info logs with string as "Blacklisted host <host>" from RMContainerRequestor class.
> 


Thanks for the hint!
Yes, I see from the logs that this node was blacklisted.

Re: node remains unused after reboot

Posted by Dmitry Sivachenko <tr...@gmail.com>.

> On 23 сент. 2015 г., at 22:08, Naganarasimha Garla <na...@gmail.com> wrote:
> 
> Sorry for the late Reply, thought of providing you some search strings for blackListing hence got lil delayed.
> As varun mentioned it looks more like app blacklisting case.  mapreduce.job.maxtaskfailures.per.tracker which is by default 3, so probability as per the scenario mentioned by you is that the node is getting black listed.
> You can search for Info logs with string as "Blacklisted host <host>" from RMContainerRequestor class.
> 


Thanks for the hint!
Yes, I see from the logs that this node was blacklisted.

Re: node remains unused after reboot

Posted by Dmitry Sivachenko <tr...@gmail.com>.

> On 23 сент. 2015 г., at 22:08, Naganarasimha Garla <na...@gmail.com> wrote:
> 
> Sorry for the late Reply, thought of providing you some search strings for blackListing hence got lil delayed.
> As varun mentioned it looks more like app blacklisting case.  mapreduce.job.maxtaskfailures.per.tracker which is by default 3, so probability as per the scenario mentioned by you is that the node is getting black listed.
> You can search for Info logs with string as "Blacklisted host <host>" from RMContainerRequestor class.
> 


Thanks for the hint!
Yes, I see from the logs that this node was blacklisted.

Re: node remains unused after reboot

Posted by Naganarasimha Garla <na...@gmail.com>.

Sorry for the late Reply, thought of providing you some search strings for
blackListing hence got lil delayed.
As varun mentioned it looks more like app blacklisting case.
mapreduce.job.maxtaskfailures.per.tracker which is by default 3, so
probability as per the scenario mentioned by you is that the node is
getting black listed.
You can search for Info logs with string as "*Blacklisted host <host>*"
from RMContainerRequestor class.

*What does these mean?*
As per the defect in *YARN-3990, *if there are more events clogged (got
from the logs as *Size of event-queue is 14000*) then there is possibility
that events are getting delayed and hence there is delay in assignment but
as per descriptions shared by you, it seems like not this case. But how
many finished applications were there ?  more nodes and more
apps(finished/running) can cause this.

+ Naga

On Wed, Sep 23, 2015 at 11:39 PM, Varun Vasudev <vv...@apache.org> wrote:

> Hi Dmitry,
>
> Did you check the MR AM logs to see if the node was blacklisted for too
> many container failures?
>
> -Varun
>
>
>
> On 9/23/15, 12:26 PM, "Dmitry Sivachenko" <tr...@gmail.com> wrote:
>
> >
> >> On 23 сент. 2015 г., at 7:02, Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com> wrote:
> >>
> >> Hi Dmitry,
> >> Seems to be an interesting case, would like some more clarifications in
> this regard :
> >> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have
> same resource capacity ? by 3000 cores if same config then expecting around
> 100 nodes, am i correct ?
> >
> >
> >I have 1 NN (and 1 SNN).
> >To be precise, I have 113 32-core machines assigned to run jobs
> (113*32=3616 total VCores)
> >
> >
> >> 2. How many applications are running and how many have got finished
> (basically available in RM) ? By 35000 you mean finished and running
> applications ?
> >
> >There were 1 application running at that time (with 35000 map tasks)
> >
> >
> >> 3. Weather after some time, tasks are getting assigned ? Also is it
> only this host not getting assigned or no other host also gets any
> containers assigned ?
> >
> >
> >This machine were excluded from running tasks for that job.  It got tasks
> assigned after almost 1.5 hours when first job (during which machine
> failed) was finished and next job was started, see timestampts:
> >
> >
> >
> >2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl
> (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying
> ContainerManager to unblock new container-requests
> >2015-09-23 02:29:33,301 INFO  [Socket Reader #1 for port 10007]
> ipc.Server (Server.java:saslProcess(1316)) - Auth successful for
> appattempt_1441808341485_1975_000001 (auth:SIMPLE)
> >
> >
> >Previous job (during which that node rebooted) did not run more tasks on
> this host.
> >
> >
> >>
> >> I suspect this issue might be similar to YARN-3990, hence the above
> questions. Further you can check the RM logs and inform weather you see
> some similar logs as below
> >> 2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size
> of event-queue is 14000 | AsyncDispatcher.java:235
> >> 2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size
> of event-queue is 15000 | AsyncDispatcher.java:235
> >
> >
> >There were 2 of these:
> >2015-09-23 00:54:39,623 INFO  [AsyncDispatcher event handler]
> event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of
> event-queue is 1000
> >2015-09-23 01:06:24,623 INFO  [AsyncDispatcher event handler]
> event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of
> event-queue is 1000
> >
> >
> >What does these mean?
> >
> >
> >>
> >>
> >> Regards,
> >> + Naga
> >>
> >>
> >> From: Dmitry Sivachenko [trtrmitya@gmail.com]
> >> Sent: Wednesday, September 23, 2015 03:57
> >> To: user@hadoop.apache.org
> >> Subject: node remains unused after reboot
> >>
> >> Hello!
> >>
> >> I am using hadoop-2.7.1. I have a large map job running (total cores
> available on the cluster about 3000, total tasks 35000).
> >> In the middle of this process one server reboots.
> >>
> >> After reboot, nodemanager starts successfully end registers with
> resource manager:
> >> 2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl
> (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying
> ContainerManager to unblock new container-requests
> >>
> >> In YARN web-interface I see this host as active, but VCores used
> remains zero (see screenshot).
> >> But the map job mentioned is still running and have about 12000 pending
> tasks.
> >>
> >> Why this host does not receive tasks to run?
> >>
> >> PS: I recently upgraded from 2.4.1 and I did not notice such a problem
> with 2.4.1: new tasks were spawning immediately after reboot.
> >>
> >> Thanks!
> >>
> >>
> >>
> >>
> >> <Screen Shot 2015-09-23 at 1.22.10.png>
> >
>
>

Re: node remains unused after reboot

Posted by Naganarasimha Garla <na...@gmail.com>.

Sorry for the late Reply, thought of providing you some search strings for
blackListing hence got lil delayed.
As varun mentioned it looks more like app blacklisting case.
mapreduce.job.maxtaskfailures.per.tracker which is by default 3, so
probability as per the scenario mentioned by you is that the node is
getting black listed.
You can search for Info logs with string as "*Blacklisted host <host>*"
from RMContainerRequestor class.

*What does these mean?*
As per the defect in *YARN-3990, *if there are more events clogged (got
from the logs as *Size of event-queue is 14000*) then there is possibility
that events are getting delayed and hence there is delay in assignment but
as per descriptions shared by you, it seems like not this case. But how
many finished applications were there ?  more nodes and more
apps(finished/running) can cause this.

+ Naga

On Wed, Sep 23, 2015 at 11:39 PM, Varun Vasudev <vv...@apache.org> wrote:

> Hi Dmitry,
>
> Did you check the MR AM logs to see if the node was blacklisted for too
> many container failures?
>
> -Varun
>
>
>
> On 9/23/15, 12:26 PM, "Dmitry Sivachenko" <tr...@gmail.com> wrote:
>
> >
> >> On 23 сент. 2015 г., at 7:02, Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com> wrote:
> >>
> >> Hi Dmitry,
> >> Seems to be an interesting case, would like some more clarifications in
> this regard :
> >> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have
> same resource capacity ? by 3000 cores if same config then expecting around
> 100 nodes, am i correct ?
> >
> >
> >I have 1 NN (and 1 SNN).
> >To be precise, I have 113 32-core machines assigned to run jobs
> (113*32=3616 total VCores)
> >
> >
> >> 2. How many applications are running and how many have got finished
> (basically available in RM) ? By 35000 you mean finished and running
> applications ?
> >
> >There were 1 application running at that time (with 35000 map tasks)
> >
> >
> >> 3. Weather after some time, tasks are getting assigned ? Also is it
> only this host not getting assigned or no other host also gets any
> containers assigned ?
> >
> >
> >This machine were excluded from running tasks for that job.  It got tasks
> assigned after almost 1.5 hours when first job (during which machine
> failed) was finished and next job was started, see timestampts:
> >
> >
> >
> >2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl
> (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying
> ContainerManager to unblock new container-requests
> >2015-09-23 02:29:33,301 INFO  [Socket Reader #1 for port 10007]
> ipc.Server (Server.java:saslProcess(1316)) - Auth successful for
> appattempt_1441808341485_1975_000001 (auth:SIMPLE)
> >
> >
> >Previous job (during which that node rebooted) did not run more tasks on
> this host.
> >
> >
> >>
> >> I suspect this issue might be similar to YARN-3990, hence the above
> questions. Further you can check the RM logs and inform weather you see
> some similar logs as below
> >> 2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size
> of event-queue is 14000 | AsyncDispatcher.java:235
> >> 2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size
> of event-queue is 15000 | AsyncDispatcher.java:235
> >
> >
> >There were 2 of these:
> >2015-09-23 00:54:39,623 INFO  [AsyncDispatcher event handler]
> event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of
> event-queue is 1000
> >2015-09-23 01:06:24,623 INFO  [AsyncDispatcher event handler]
> event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of
> event-queue is 1000
> >
> >
> >What does these mean?
> >
> >
> >>
> >>
> >> Regards,
> >> + Naga
> >>
> >>
> >> From: Dmitry Sivachenko [trtrmitya@gmail.com]
> >> Sent: Wednesday, September 23, 2015 03:57
> >> To: user@hadoop.apache.org
> >> Subject: node remains unused after reboot
> >>
> >> Hello!
> >>
> >> I am using hadoop-2.7.1. I have a large map job running (total cores
> available on the cluster about 3000, total tasks 35000).
> >> In the middle of this process one server reboots.
> >>
> >> After reboot, nodemanager starts successfully end registers with
> resource manager:
> >> 2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl
> (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying
> ContainerManager to unblock new container-requests
> >>
> >> In YARN web-interface I see this host as active, but VCores used
> remains zero (see screenshot).
> >> But the map job mentioned is still running and have about 12000 pending
> tasks.
> >>
> >> Why this host does not receive tasks to run?
> >>
> >> PS: I recently upgraded from 2.4.1 and I did not notice such a problem
> with 2.4.1: new tasks were spawning immediately after reboot.
> >>
> >> Thanks!
> >>
> >>
> >>
> >>
> >> <Screen Shot 2015-09-23 at 1.22.10.png>
> >
>
>

Re: node remains unused after reboot

Posted by Naganarasimha Garla <na...@gmail.com>.

Sorry for the late Reply, thought of providing you some search strings for
blackListing hence got lil delayed.
As varun mentioned it looks more like app blacklisting case.
mapreduce.job.maxtaskfailures.per.tracker which is by default 3, so
probability as per the scenario mentioned by you is that the node is
getting black listed.
You can search for Info logs with string as "*Blacklisted host <host>*"
from RMContainerRequestor class.

*What does these mean?*
As per the defect in *YARN-3990, *if there are more events clogged (got
from the logs as *Size of event-queue is 14000*) then there is possibility
that events are getting delayed and hence there is delay in assignment but
as per descriptions shared by you, it seems like not this case. But how
many finished applications were there ?  more nodes and more
apps(finished/running) can cause this.

+ Naga

On Wed, Sep 23, 2015 at 11:39 PM, Varun Vasudev <vv...@apache.org> wrote:

> Hi Dmitry,
>
> Did you check the MR AM logs to see if the node was blacklisted for too
> many container failures?
>
> -Varun
>
>
>
> On 9/23/15, 12:26 PM, "Dmitry Sivachenko" <tr...@gmail.com> wrote:
>
> >
> >> On 23 сент. 2015 г., at 7:02, Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com> wrote:
> >>
> >> Hi Dmitry,
> >> Seems to be an interesting case, would like some more clarifications in
> this regard :
> >> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have
> same resource capacity ? by 3000 cores if same config then expecting around
> 100 nodes, am i correct ?
> >
> >
> >I have 1 NN (and 1 SNN).
> >To be precise, I have 113 32-core machines assigned to run jobs
> (113*32=3616 total VCores)
> >
> >
> >> 2. How many applications are running and how many have got finished
> (basically available in RM) ? By 35000 you mean finished and running
> applications ?
> >
> >There were 1 application running at that time (with 35000 map tasks)
> >
> >
> >> 3. Weather after some time, tasks are getting assigned ? Also is it
> only this host not getting assigned or no other host also gets any
> containers assigned ?
> >
> >
> >This machine were excluded from running tasks for that job.  It got tasks
> assigned after almost 1.5 hours when first job (during which machine
> failed) was finished and next job was started, see timestampts:
> >
> >
> >
> >2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl
> (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying
> ContainerManager to unblock new container-requests
> >2015-09-23 02:29:33,301 INFO  [Socket Reader #1 for port 10007]
> ipc.Server (Server.java:saslProcess(1316)) - Auth successful for
> appattempt_1441808341485_1975_000001 (auth:SIMPLE)
> >
> >
> >Previous job (during which that node rebooted) did not run more tasks on
> this host.
> >
> >
> >>
> >> I suspect this issue might be similar to YARN-3990, hence the above
> questions. Further you can check the RM logs and inform weather you see
> some similar logs as below
> >> 2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size
> of event-queue is 14000 | AsyncDispatcher.java:235
> >> 2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size
> of event-queue is 15000 | AsyncDispatcher.java:235
> >
> >
> >There were 2 of these:
> >2015-09-23 00:54:39,623 INFO  [AsyncDispatcher event handler]
> event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of
> event-queue is 1000
> >2015-09-23 01:06:24,623 INFO  [AsyncDispatcher event handler]
> event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of
> event-queue is 1000
> >
> >
> >What does these mean?
> >
> >
> >>
> >>
> >> Regards,
> >> + Naga
> >>
> >>
> >> From: Dmitry Sivachenko [trtrmitya@gmail.com]
> >> Sent: Wednesday, September 23, 2015 03:57
> >> To: user@hadoop.apache.org
> >> Subject: node remains unused after reboot
> >>
> >> Hello!
> >>
> >> I am using hadoop-2.7.1. I have a large map job running (total cores
> available on the cluster about 3000, total tasks 35000).
> >> In the middle of this process one server reboots.
> >>
> >> After reboot, nodemanager starts successfully end registers with
> resource manager:
> >> 2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl
> (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying
> ContainerManager to unblock new container-requests
> >>
> >> In YARN web-interface I see this host as active, but VCores used
> remains zero (see screenshot).
> >> But the map job mentioned is still running and have about 12000 pending
> tasks.
> >>
> >> Why this host does not receive tasks to run?
> >>
> >> PS: I recently upgraded from 2.4.1 and I did not notice such a problem
> with 2.4.1: new tasks were spawning immediately after reboot.
> >>
> >> Thanks!
> >>
> >>
> >>
> >>
> >> <Screen Shot 2015-09-23 at 1.22.10.png>
> >
>
>

Re: node remains unused after reboot

Posted by Naganarasimha Garla <na...@gmail.com>.

Sorry for the late Reply, thought of providing you some search strings for
blackListing hence got lil delayed.
As varun mentioned it looks more like app blacklisting case.
mapreduce.job.maxtaskfailures.per.tracker which is by default 3, so
probability as per the scenario mentioned by you is that the node is
getting black listed.
You can search for Info logs with string as "*Blacklisted host <host>*"
from RMContainerRequestor class.

*What does these mean?*
As per the defect in *YARN-3990, *if there are more events clogged (got
from the logs as *Size of event-queue is 14000*) then there is possibility
that events are getting delayed and hence there is delay in assignment but
as per descriptions shared by you, it seems like not this case. But how
many finished applications were there ?  more nodes and more
apps(finished/running) can cause this.

+ Naga

On Wed, Sep 23, 2015 at 11:39 PM, Varun Vasudev <vv...@apache.org> wrote:

> Hi Dmitry,
>
> Did you check the MR AM logs to see if the node was blacklisted for too
> many container failures?
>
> -Varun
>
>
>
> On 9/23/15, 12:26 PM, "Dmitry Sivachenko" <tr...@gmail.com> wrote:
>
> >
> >> On 23 сент. 2015 г., at 7:02, Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com> wrote:
> >>
> >> Hi Dmitry,
> >> Seems to be an interesting case, would like some more clarifications in
> this regard :
> >> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have
> same resource capacity ? by 3000 cores if same config then expecting around
> 100 nodes, am i correct ?
> >
> >
> >I have 1 NN (and 1 SNN).
> >To be precise, I have 113 32-core machines assigned to run jobs
> (113*32=3616 total VCores)
> >
> >
> >> 2. How many applications are running and how many have got finished
> (basically available in RM) ? By 35000 you mean finished and running
> applications ?
> >
> >There were 1 application running at that time (with 35000 map tasks)
> >
> >
> >> 3. Weather after some time, tasks are getting assigned ? Also is it
> only this host not getting assigned or no other host also gets any
> containers assigned ?
> >
> >
> >This machine were excluded from running tasks for that job.  It got tasks
> assigned after almost 1.5 hours when first job (during which machine
> failed) was finished and next job was started, see timestampts:
> >
> >
> >
> >2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl
> (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying
> ContainerManager to unblock new container-requests
> >2015-09-23 02:29:33,301 INFO  [Socket Reader #1 for port 10007]
> ipc.Server (Server.java:saslProcess(1316)) - Auth successful for
> appattempt_1441808341485_1975_000001 (auth:SIMPLE)
> >
> >
> >Previous job (during which that node rebooted) did not run more tasks on
> this host.
> >
> >
> >>
> >> I suspect this issue might be similar to YARN-3990, hence the above
> questions. Further you can check the RM logs and inform weather you see
> some similar logs as below
> >> 2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size
> of event-queue is 14000 | AsyncDispatcher.java:235
> >> 2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size
> of event-queue is 15000 | AsyncDispatcher.java:235
> >
> >
> >There were 2 of these:
> >2015-09-23 00:54:39,623 INFO  [AsyncDispatcher event handler]
> event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of
> event-queue is 1000
> >2015-09-23 01:06:24,623 INFO  [AsyncDispatcher event handler]
> event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of
> event-queue is 1000
> >
> >
> >What does these mean?
> >
> >
> >>
> >>
> >> Regards,
> >> + Naga
> >>
> >>
> >> From: Dmitry Sivachenko [trtrmitya@gmail.com]
> >> Sent: Wednesday, September 23, 2015 03:57
> >> To: user@hadoop.apache.org
> >> Subject: node remains unused after reboot
> >>
> >> Hello!
> >>
> >> I am using hadoop-2.7.1. I have a large map job running (total cores
> available on the cluster about 3000, total tasks 35000).
> >> In the middle of this process one server reboots.
> >>
> >> After reboot, nodemanager starts successfully end registers with
> resource manager:
> >> 2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl
> (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying
> ContainerManager to unblock new container-requests
> >>
> >> In YARN web-interface I see this host as active, but VCores used
> remains zero (see screenshot).
> >> But the map job mentioned is still running and have about 12000 pending
> tasks.
> >>
> >> Why this host does not receive tasks to run?
> >>
> >> PS: I recently upgraded from 2.4.1 and I did not notice such a problem
> with 2.4.1: new tasks were spawning immediately after reboot.
> >>
> >> Thanks!
> >>
> >>
> >>
> >>
> >> <Screen Shot 2015-09-23 at 1.22.10.png>
> >
>
>

Re: node remains unused after reboot

Posted by Varun Vasudev <vv...@apache.org>.

Hi Dmitry,

Did you check the MR AM logs to see if the node was blacklisted for too many container failures?

-Varun



On 9/23/15, 12:26 PM, "Dmitry Sivachenko" <tr...@gmail.com> wrote:

>
>> On 23 сент. 2015 г., at 7:02, Naganarasimha G R (Naga) <ga...@huawei.com> wrote:
>> 
>> Hi Dmitry,
>> Seems to be an interesting case, would like some more clarifications in this regard :
>> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have same resource capacity ? by 3000 cores if same config then expecting around 100 nodes, am i correct ?
>
>
>I have 1 NN (and 1 SNN).
>To be precise, I have 113 32-core machines assigned to run jobs (113*32=3616 total VCores)
>
>
>> 2. How many applications are running and how many have got finished (basically available in RM) ? By 35000 you mean finished and running applications ?
>
>There were 1 application running at that time (with 35000 map tasks)
>
>
>> 3. Weather after some time, tasks are getting assigned ? Also is it only this host not getting assigned or no other host also gets any containers assigned ?
>
>
>This machine were excluded from running tasks for that job.  It got tasks assigned after almost 1.5 hours when first job (during which machine failed) was finished and next job was started, see timestampts:
>
>
>
>2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests
>2015-09-23 02:29:33,301 INFO  [Socket Reader #1 for port 10007] ipc.Server (Server.java:saslProcess(1316)) - Auth successful for appattempt_1441808341485_1975_000001 (auth:SIMPLE)
>
>
>Previous job (during which that node rebooted) did not run more tasks on this host.
>
>
>> 
>> I suspect this issue might be similar to YARN-3990, hence the above questions. Further you can check the RM logs and inform weather you see some similar logs as below
>> 2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size of event-queue is 14000 | AsyncDispatcher.java:235
>> 2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size of event-queue is 15000 | AsyncDispatcher.java:235
>
>
>There were 2 of these:
>2015-09-23 00:54:39,623 INFO  [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000
>2015-09-23 01:06:24,623 INFO  [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000
>
>
>What does these mean?
>
>
>> 
>> 
>> Regards,
>> + Naga
>> 
>> 
>> From: Dmitry Sivachenko [trtrmitya@gmail.com]
>> Sent: Wednesday, September 23, 2015 03:57
>> To: user@hadoop.apache.org
>> Subject: node remains unused after reboot
>> 
>> Hello!
>> 
>> I am using hadoop-2.7.1. I have a large map job running (total cores available on the cluster about 3000, total tasks 35000).
>> In the middle of this process one server reboots.
>> 
>> After reboot, nodemanager starts successfully end registers with resource manager:
>> 2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests
>> 
>> In YARN web-interface I see this host as active, but VCores used remains zero (see screenshot).
>> But the map job mentioned is still running and have about 12000 pending tasks.
>> 
>> Why this host does not receive tasks to run?
>> 
>> PS: I recently upgraded from 2.4.1 and I did not notice such a problem with 2.4.1: new tasks were spawning immediately after reboot.
>> 
>> Thanks!
>> 
>> 
>> 
>> 
>> <Screen Shot 2015-09-23 at 1.22.10.png>
>

Re: node remains unused after reboot

Posted by Varun Vasudev <vv...@apache.org>.

Hi Dmitry,

Did you check the MR AM logs to see if the node was blacklisted for too many container failures?

-Varun



On 9/23/15, 12:26 PM, "Dmitry Sivachenko" <tr...@gmail.com> wrote:

>
>> On 23 сент. 2015 г., at 7:02, Naganarasimha G R (Naga) <ga...@huawei.com> wrote:
>> 
>> Hi Dmitry,
>> Seems to be an interesting case, would like some more clarifications in this regard :
>> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have same resource capacity ? by 3000 cores if same config then expecting around 100 nodes, am i correct ?
>
>
>I have 1 NN (and 1 SNN).
>To be precise, I have 113 32-core machines assigned to run jobs (113*32=3616 total VCores)
>
>
>> 2. How many applications are running and how many have got finished (basically available in RM) ? By 35000 you mean finished and running applications ?
>
>There were 1 application running at that time (with 35000 map tasks)
>
>
>> 3. Weather after some time, tasks are getting assigned ? Also is it only this host not getting assigned or no other host also gets any containers assigned ?
>
>
>This machine were excluded from running tasks for that job.  It got tasks assigned after almost 1.5 hours when first job (during which machine failed) was finished and next job was started, see timestampts:
>
>
>
>2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests
>2015-09-23 02:29:33,301 INFO  [Socket Reader #1 for port 10007] ipc.Server (Server.java:saslProcess(1316)) - Auth successful for appattempt_1441808341485_1975_000001 (auth:SIMPLE)
>
>
>Previous job (during which that node rebooted) did not run more tasks on this host.
>
>
>> 
>> I suspect this issue might be similar to YARN-3990, hence the above questions. Further you can check the RM logs and inform weather you see some similar logs as below
>> 2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size of event-queue is 14000 | AsyncDispatcher.java:235
>> 2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size of event-queue is 15000 | AsyncDispatcher.java:235
>
>
>There were 2 of these:
>2015-09-23 00:54:39,623 INFO  [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000
>2015-09-23 01:06:24,623 INFO  [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000
>
>
>What does these mean?
>
>
>> 
>> 
>> Regards,
>> + Naga
>> 
>> 
>> From: Dmitry Sivachenko [trtrmitya@gmail.com]
>> Sent: Wednesday, September 23, 2015 03:57
>> To: user@hadoop.apache.org
>> Subject: node remains unused after reboot
>> 
>> Hello!
>> 
>> I am using hadoop-2.7.1. I have a large map job running (total cores available on the cluster about 3000, total tasks 35000).
>> In the middle of this process one server reboots.
>> 
>> After reboot, nodemanager starts successfully end registers with resource manager:
>> 2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests
>> 
>> In YARN web-interface I see this host as active, but VCores used remains zero (see screenshot).
>> But the map job mentioned is still running and have about 12000 pending tasks.
>> 
>> Why this host does not receive tasks to run?
>> 
>> PS: I recently upgraded from 2.4.1 and I did not notice such a problem with 2.4.1: new tasks were spawning immediately after reboot.
>> 
>> Thanks!
>> 
>> 
>> 
>> 
>> <Screen Shot 2015-09-23 at 1.22.10.png>
>

Re: node remains unused after reboot

Posted by Varun Vasudev <vv...@apache.org>.

Hi Dmitry,

Did you check the MR AM logs to see if the node was blacklisted for too many container failures?

-Varun



On 9/23/15, 12:26 PM, "Dmitry Sivachenko" <tr...@gmail.com> wrote:

>
>> On 23 сент. 2015 г., at 7:02, Naganarasimha G R (Naga) <ga...@huawei.com> wrote:
>> 
>> Hi Dmitry,
>> Seems to be an interesting case, would like some more clarifications in this regard :
>> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have same resource capacity ? by 3000 cores if same config then expecting around 100 nodes, am i correct ?
>
>
>I have 1 NN (and 1 SNN).
>To be precise, I have 113 32-core machines assigned to run jobs (113*32=3616 total VCores)
>
>
>> 2. How many applications are running and how many have got finished (basically available in RM) ? By 35000 you mean finished and running applications ?
>
>There were 1 application running at that time (with 35000 map tasks)
>
>
>> 3. Weather after some time, tasks are getting assigned ? Also is it only this host not getting assigned or no other host also gets any containers assigned ?
>
>
>This machine were excluded from running tasks for that job.  It got tasks assigned after almost 1.5 hours when first job (during which machine failed) was finished and next job was started, see timestampts:
>
>
>
>2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests
>2015-09-23 02:29:33,301 INFO  [Socket Reader #1 for port 10007] ipc.Server (Server.java:saslProcess(1316)) - Auth successful for appattempt_1441808341485_1975_000001 (auth:SIMPLE)
>
>
>Previous job (during which that node rebooted) did not run more tasks on this host.
>
>
>> 
>> I suspect this issue might be similar to YARN-3990, hence the above questions. Further you can check the RM logs and inform weather you see some similar logs as below
>> 2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size of event-queue is 14000 | AsyncDispatcher.java:235
>> 2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size of event-queue is 15000 | AsyncDispatcher.java:235
>
>
>There were 2 of these:
>2015-09-23 00:54:39,623 INFO  [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000
>2015-09-23 01:06:24,623 INFO  [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000
>
>
>What does these mean?
>
>
>> 
>> 
>> Regards,
>> + Naga
>> 
>> 
>> From: Dmitry Sivachenko [trtrmitya@gmail.com]
>> Sent: Wednesday, September 23, 2015 03:57
>> To: user@hadoop.apache.org
>> Subject: node remains unused after reboot
>> 
>> Hello!
>> 
>> I am using hadoop-2.7.1. I have a large map job running (total cores available on the cluster about 3000, total tasks 35000).
>> In the middle of this process one server reboots.
>> 
>> After reboot, nodemanager starts successfully end registers with resource manager:
>> 2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests
>> 
>> In YARN web-interface I see this host as active, but VCores used remains zero (see screenshot).
>> But the map job mentioned is still running and have about 12000 pending tasks.
>> 
>> Why this host does not receive tasks to run?
>> 
>> PS: I recently upgraded from 2.4.1 and I did not notice such a problem with 2.4.1: new tasks were spawning immediately after reboot.
>> 
>> Thanks!
>> 
>> 
>> 
>> 
>> <Screen Shot 2015-09-23 at 1.22.10.png>
>

Re: node remains unused after reboot

Posted by Varun Vasudev <vv...@apache.org>.

Hi Dmitry,

Did you check the MR AM logs to see if the node was blacklisted for too many container failures?

-Varun



On 9/23/15, 12:26 PM, "Dmitry Sivachenko" <tr...@gmail.com> wrote:

>
>> On 23 сент. 2015 г., at 7:02, Naganarasimha G R (Naga) <ga...@huawei.com> wrote:
>> 
>> Hi Dmitry,
>> Seems to be an interesting case, would like some more clarifications in this regard :
>> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have same resource capacity ? by 3000 cores if same config then expecting around 100 nodes, am i correct ?
>
>
>I have 1 NN (and 1 SNN).
>To be precise, I have 113 32-core machines assigned to run jobs (113*32=3616 total VCores)
>
>
>> 2. How many applications are running and how many have got finished (basically available in RM) ? By 35000 you mean finished and running applications ?
>
>There were 1 application running at that time (with 35000 map tasks)
>
>
>> 3. Weather after some time, tasks are getting assigned ? Also is it only this host not getting assigned or no other host also gets any containers assigned ?
>
>
>This machine were excluded from running tasks for that job.  It got tasks assigned after almost 1.5 hours when first job (during which machine failed) was finished and next job was started, see timestampts:
>
>
>
>2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests
>2015-09-23 02:29:33,301 INFO  [Socket Reader #1 for port 10007] ipc.Server (Server.java:saslProcess(1316)) - Auth successful for appattempt_1441808341485_1975_000001 (auth:SIMPLE)
>
>
>Previous job (during which that node rebooted) did not run more tasks on this host.
>
>
>> 
>> I suspect this issue might be similar to YARN-3990, hence the above questions. Further you can check the RM logs and inform weather you see some similar logs as below
>> 2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size of event-queue is 14000 | AsyncDispatcher.java:235
>> 2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size of event-queue is 15000 | AsyncDispatcher.java:235
>
>
>There were 2 of these:
>2015-09-23 00:54:39,623 INFO  [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000
>2015-09-23 01:06:24,623 INFO  [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000
>
>
>What does these mean?
>
>
>> 
>> 
>> Regards,
>> + Naga
>> 
>> 
>> From: Dmitry Sivachenko [trtrmitya@gmail.com]
>> Sent: Wednesday, September 23, 2015 03:57
>> To: user@hadoop.apache.org
>> Subject: node remains unused after reboot
>> 
>> Hello!
>> 
>> I am using hadoop-2.7.1. I have a large map job running (total cores available on the cluster about 3000, total tasks 35000).
>> In the middle of this process one server reboots.
>> 
>> After reboot, nodemanager starts successfully end registers with resource manager:
>> 2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests
>> 
>> In YARN web-interface I see this host as active, but VCores used remains zero (see screenshot).
>> But the map job mentioned is still running and have about 12000 pending tasks.
>> 
>> Why this host does not receive tasks to run?
>> 
>> PS: I recently upgraded from 2.4.1 and I did not notice such a problem with 2.4.1: new tasks were spawning immediately after reboot.
>> 
>> Thanks!
>> 
>> 
>> 
>> 
>> <Screen Shot 2015-09-23 at 1.22.10.png>
>

Re: node remains unused after reboot

Posted by Dmitry Sivachenko <tr...@gmail.com>.

> On 23 сент. 2015 г., at 7:02, Naganarasimha G R (Naga) <ga...@huawei.com> wrote:
> 
> Hi Dmitry,
> Seems to be an interesting case, would like some more clarifications in this regard :
> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have same resource capacity ? by 3000 cores if same config then expecting around 100 nodes, am i correct ?


I have 1 NN (and 1 SNN).
To be precise, I have 113 32-core machines assigned to run jobs (113*32=3616 total VCores)


> 2. How many applications are running and how many have got finished (basically available in RM) ? By 35000 you mean finished and running applications ?

There were 1 application running at that time (with 35000 map tasks)


> 3. Weather after some time, tasks are getting assigned ? Also is it only this host not getting assigned or no other host also gets any containers assigned ?


This machine were excluded from running tasks for that job.  It got tasks assigned after almost 1.5 hours when first job (during which machine failed) was finished and next job was started, see timestampts:



2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests
2015-09-23 02:29:33,301 INFO  [Socket Reader #1 for port 10007] ipc.Server (Server.java:saslProcess(1316)) - Auth successful for appattempt_1441808341485_1975_000001 (auth:SIMPLE)


Previous job (during which that node rebooted) did not run more tasks on this host.


> 
> I suspect this issue might be similar to YARN-3990, hence the above questions. Further you can check the RM logs and inform weather you see some similar logs as below
> 2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size of event-queue is 14000 | AsyncDispatcher.java:235
> 2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size of event-queue is 15000 | AsyncDispatcher.java:235


There were 2 of these:
2015-09-23 00:54:39,623 INFO  [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000
2015-09-23 01:06:24,623 INFO  [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000


What does these mean?


> 
> 
> Regards,
> + Naga
> 
> 
> From: Dmitry Sivachenko [trtrmitya@gmail.com]
> Sent: Wednesday, September 23, 2015 03:57
> To: user@hadoop.apache.org
> Subject: node remains unused after reboot
> 
> Hello!
> 
> I am using hadoop-2.7.1. I have a large map job running (total cores available on the cluster about 3000, total tasks 35000).
> In the middle of this process one server reboots.
> 
> After reboot, nodemanager starts successfully end registers with resource manager:
> 2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests
> 
> In YARN web-interface I see this host as active, but VCores used remains zero (see screenshot).
> But the map job mentioned is still running and have about 12000 pending tasks.
> 
> Why this host does not receive tasks to run?
> 
> PS: I recently upgraded from 2.4.1 and I did not notice such a problem with 2.4.1: new tasks were spawning immediately after reboot.
> 
> Thanks!
> 
> 
> 
> 
> <Screen Shot 2015-09-23 at 1.22.10.png>

Re: node remains unused after reboot

Posted by Dmitry Sivachenko <tr...@gmail.com>.

> On 23 сент. 2015 г., at 7:02, Naganarasimha G R (Naga) <ga...@huawei.com> wrote:
> 
> Hi Dmitry,
> Seems to be an interesting case, would like some more clarifications in this regard :
> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have same resource capacity ? by 3000 cores if same config then expecting around 100 nodes, am i correct ?


I have 1 NN (and 1 SNN).
To be precise, I have 113 32-core machines assigned to run jobs (113*32=3616 total VCores)


> 2. How many applications are running and how many have got finished (basically available in RM) ? By 35000 you mean finished and running applications ?

There were 1 application running at that time (with 35000 map tasks)


> 3. Weather after some time, tasks are getting assigned ? Also is it only this host not getting assigned or no other host also gets any containers assigned ?


This machine were excluded from running tasks for that job.  It got tasks assigned after almost 1.5 hours when first job (during which machine failed) was finished and next job was started, see timestampts:



2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests
2015-09-23 02:29:33,301 INFO  [Socket Reader #1 for port 10007] ipc.Server (Server.java:saslProcess(1316)) - Auth successful for appattempt_1441808341485_1975_000001 (auth:SIMPLE)


Previous job (during which that node rebooted) did not run more tasks on this host.


> 
> I suspect this issue might be similar to YARN-3990, hence the above questions. Further you can check the RM logs and inform weather you see some similar logs as below
> 2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size of event-queue is 14000 | AsyncDispatcher.java:235
> 2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size of event-queue is 15000 | AsyncDispatcher.java:235


There were 2 of these:
2015-09-23 00:54:39,623 INFO  [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000
2015-09-23 01:06:24,623 INFO  [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000


What does these mean?


> 
> 
> Regards,
> + Naga
> 
> 
> From: Dmitry Sivachenko [trtrmitya@gmail.com]
> Sent: Wednesday, September 23, 2015 03:57
> To: user@hadoop.apache.org
> Subject: node remains unused after reboot
> 
> Hello!
> 
> I am using hadoop-2.7.1. I have a large map job running (total cores available on the cluster about 3000, total tasks 35000).
> In the middle of this process one server reboots.
> 
> After reboot, nodemanager starts successfully end registers with resource manager:
> 2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests
> 
> In YARN web-interface I see this host as active, but VCores used remains zero (see screenshot).
> But the map job mentioned is still running and have about 12000 pending tasks.
> 
> Why this host does not receive tasks to run?
> 
> PS: I recently upgraded from 2.4.1 and I did not notice such a problem with 2.4.1: new tasks were spawning immediately after reboot.
> 
> Thanks!
> 
> 
> 
> 
> <Screen Shot 2015-09-23 at 1.22.10.png>

Re: node remains unused after reboot

Posted by Dmitry Sivachenko <tr...@gmail.com>.

> On 23 сент. 2015 г., at 7:02, Naganarasimha G R (Naga) <ga...@huawei.com> wrote:
> 
> Hi Dmitry,
> Seems to be an interesting case, would like some more clarifications in this regard :
> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have same resource capacity ? by 3000 cores if same config then expecting around 100 nodes, am i correct ?


I have 1 NN (and 1 SNN).
To be precise, I have 113 32-core machines assigned to run jobs (113*32=3616 total VCores)


> 2. How many applications are running and how many have got finished (basically available in RM) ? By 35000 you mean finished and running applications ?

There were 1 application running at that time (with 35000 map tasks)


> 3. Weather after some time, tasks are getting assigned ? Also is it only this host not getting assigned or no other host also gets any containers assigned ?


This machine were excluded from running tasks for that job.  It got tasks assigned after almost 1.5 hours when first job (during which machine failed) was finished and next job was started, see timestampts:



2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests
2015-09-23 02:29:33,301 INFO  [Socket Reader #1 for port 10007] ipc.Server (Server.java:saslProcess(1316)) - Auth successful for appattempt_1441808341485_1975_000001 (auth:SIMPLE)


Previous job (during which that node rebooted) did not run more tasks on this host.


> 
> I suspect this issue might be similar to YARN-3990, hence the above questions. Further you can check the RM logs and inform weather you see some similar logs as below
> 2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size of event-queue is 14000 | AsyncDispatcher.java:235
> 2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size of event-queue is 15000 | AsyncDispatcher.java:235


There were 2 of these:
2015-09-23 00:54:39,623 INFO  [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000
2015-09-23 01:06:24,623 INFO  [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000


What does these mean?


> 
> 
> Regards,
> + Naga
> 
> 
> From: Dmitry Sivachenko [trtrmitya@gmail.com]
> Sent: Wednesday, September 23, 2015 03:57
> To: user@hadoop.apache.org
> Subject: node remains unused after reboot
> 
> Hello!
> 
> I am using hadoop-2.7.1. I have a large map job running (total cores available on the cluster about 3000, total tasks 35000).
> In the middle of this process one server reboots.
> 
> After reboot, nodemanager starts successfully end registers with resource manager:
> 2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests
> 
> In YARN web-interface I see this host as active, but VCores used remains zero (see screenshot).
> But the map job mentioned is still running and have about 12000 pending tasks.
> 
> Why this host does not receive tasks to run?
> 
> PS: I recently upgraded from 2.4.1 and I did not notice such a problem with 2.4.1: new tasks were spawning immediately after reboot.
> 
> Thanks!
> 
> 
> 
> 
> <Screen Shot 2015-09-23 at 1.22.10.png>

Re: node remains unused after reboot

Posted by Dmitry Sivachenko <tr...@gmail.com>.

> On 23 сент. 2015 г., at 7:02, Naganarasimha G R (Naga) <ga...@huawei.com> wrote:
> 
> Hi Dmitry,
> Seems to be an interesting case, would like some more clarifications in this regard :
> 1. How many NM's ? Is it a hetergenous cluster or all the nodes have same resource capacity ? by 3000 cores if same config then expecting around 100 nodes, am i correct ?


I have 1 NN (and 1 SNN).
To be precise, I have 113 32-core machines assigned to run jobs (113*32=3616 total VCores)


> 2. How many applications are running and how many have got finished (basically available in RM) ? By 35000 you mean finished and running applications ?

There were 1 application running at that time (with 35000 map tasks)


> 3. Weather after some time, tasks are getting assigned ? Also is it only this host not getting assigned or no other host also gets any containers assigned ?


This machine were excluded from running tasks for that job.  It got tasks assigned after almost 1.5 hours when first job (during which machine failed) was finished and next job was started, see timestampts:



2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests
2015-09-23 02:29:33,301 INFO  [Socket Reader #1 for port 10007] ipc.Server (Server.java:saslProcess(1316)) - Auth successful for appattempt_1441808341485_1975_000001 (auth:SIMPLE)


Previous job (during which that node rebooted) did not run more tasks on this host.


> 
> I suspect this issue might be similar to YARN-3990, hence the above questions. Further you can check the RM logs and inform weather you see some similar logs as below
> 2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size of event-queue is 14000 | AsyncDispatcher.java:235
> 2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size of event-queue is 15000 | AsyncDispatcher.java:235


There were 2 of these:
2015-09-23 00:54:39,623 INFO  [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000
2015-09-23 01:06:24,623 INFO  [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:handle(235)) - Size of event-queue is 1000


What does these mean?


> 
> 
> Regards,
> + Naga
> 
> 
> From: Dmitry Sivachenko [trtrmitya@gmail.com]
> Sent: Wednesday, September 23, 2015 03:57
> To: user@hadoop.apache.org
> Subject: node remains unused after reboot
> 
> Hello!
> 
> I am using hadoop-2.7.1. I have a large map job running (total cores available on the cluster about 3000, total tasks 35000).
> In the middle of this process one server reboots.
> 
> After reboot, nodemanager starts successfully end registers with resource manager:
> 2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests
> 
> In YARN web-interface I see this host as active, but VCores used remains zero (see screenshot).
> But the map job mentioned is still running and have about 12000 pending tasks.
> 
> Why this host does not receive tasks to run?
> 
> PS: I recently upgraded from 2.4.1 and I did not notice such a problem with 2.4.1: new tasks were spawning immediately after reboot.
> 
> Thanks!
> 
> 
> 
> 
> <Screen Shot 2015-09-23 at 1.22.10.png>

RE: node remains unused after reboot

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.

Hi Dmitry,
Seems to be an interesting case, would like some more clarifications in this regard :
1. How many NM's ? Is it a hetergenous cluster or all the nodes have same resource capacity ? by 3000 cores if same config then expecting around 100 nodes, am i correct ?
2. How many applications are running and how many have got finished (basically available in RM) ? By 35000 you mean finished and running applications ?
3. Weather after some time, tasks are getting assigned ? Also is it only this host not getting assigned or no other host also gets any containers assigned ?

I suspect this issue might be similar to YARN-3990, hence the above questions. Further you can check the RM logs and inform weather you see some similar logs as below

2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size of event-queue is 14000 | AsyncDispatcher.java:235
2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size of event-queue is 15000 | AsyncDispatcher.java:235

Regards,
+ Naga


________________________________
From: Dmitry Sivachenko [trtrmitya@gmail.com]
Sent: Wednesday, September 23, 2015 03:57
To: user@hadoop.apache.org
Subject: node remains unused after reboot

Hello!

I am using hadoop-2.7.1. I have a large map job running (total cores available on the cluster about 3000, total tasks 35000).
In the middle of this process one server reboots.

After reboot, nodemanager starts successfully end registers with resource manager:
2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests

In YARN web-interface I see this host as active, but VCores used remains zero (see screenshot).
But the map job mentioned is still running and have about 12000 pending tasks.

Why this host does not receive tasks to run?

PS: I recently upgraded from 2.4.1 and I did not notice such a problem with 2.4.1: new tasks were spawning immediately after reboot.

Thanks!




[cid:D5DB63EB-D60D-4301-8A5A-4C8FFE970F71@yandex.ru]

RE: node remains unused after reboot

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.

Hi Dmitry,
Seems to be an interesting case, would like some more clarifications in this regard :
1. How many NM's ? Is it a hetergenous cluster or all the nodes have same resource capacity ? by 3000 cores if same config then expecting around 100 nodes, am i correct ?
2. How many applications are running and how many have got finished (basically available in RM) ? By 35000 you mean finished and running applications ?
3. Weather after some time, tasks are getting assigned ? Also is it only this host not getting assigned or no other host also gets any containers assigned ?

I suspect this issue might be similar to YARN-3990, hence the above questions. Further you can check the RM logs and inform weather you see some similar logs as below

2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size of event-queue is 14000 | AsyncDispatcher.java:235
2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size of event-queue is 15000 | AsyncDispatcher.java:235

Regards,
+ Naga


________________________________
From: Dmitry Sivachenko [trtrmitya@gmail.com]
Sent: Wednesday, September 23, 2015 03:57
To: user@hadoop.apache.org
Subject: node remains unused after reboot

Hello!

I am using hadoop-2.7.1. I have a large map job running (total cores available on the cluster about 3000, total tasks 35000).
In the middle of this process one server reboots.

After reboot, nodemanager starts successfully end registers with resource manager:
2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests

In YARN web-interface I see this host as active, but VCores used remains zero (see screenshot).
But the map job mentioned is still running and have about 12000 pending tasks.

Why this host does not receive tasks to run?

PS: I recently upgraded from 2.4.1 and I did not notice such a problem with 2.4.1: new tasks were spawning immediately after reboot.

Thanks!




[cid:D5DB63EB-D60D-4301-8A5A-4C8FFE970F71@yandex.ru]

RE: node remains unused after reboot

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.

Hi Dmitry,
Seems to be an interesting case, would like some more clarifications in this regard :
1. How many NM's ? Is it a hetergenous cluster or all the nodes have same resource capacity ? by 3000 cores if same config then expecting around 100 nodes, am i correct ?
2. How many applications are running and how many have got finished (basically available in RM) ? By 35000 you mean finished and running applications ?
3. Weather after some time, tasks are getting assigned ? Also is it only this host not getting assigned or no other host also gets any containers assigned ?

I suspect this issue might be similar to YARN-3990, hence the above questions. Further you can check the RM logs and inform weather you see some similar logs as below

2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size of event-queue is 14000 | AsyncDispatcher.java:235
2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size of event-queue is 15000 | AsyncDispatcher.java:235

Regards,
+ Naga


________________________________
From: Dmitry Sivachenko [trtrmitya@gmail.com]
Sent: Wednesday, September 23, 2015 03:57
To: user@hadoop.apache.org
Subject: node remains unused after reboot

Hello!

I am using hadoop-2.7.1. I have a large map job running (total cores available on the cluster about 3000, total tasks 35000).
In the middle of this process one server reboots.

After reboot, nodemanager starts successfully end registers with resource manager:
2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests

In YARN web-interface I see this host as active, but VCores used remains zero (see screenshot).
But the map job mentioned is still running and have about 12000 pending tasks.

Why this host does not receive tasks to run?

PS: I recently upgraded from 2.4.1 and I did not notice such a problem with 2.4.1: new tasks were spawning immediately after reboot.

Thanks!




[cid:D5DB63EB-D60D-4301-8A5A-4C8FFE970F71@yandex.ru]

RE: node remains unused after reboot

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.

Hi Dmitry,
Seems to be an interesting case, would like some more clarifications in this regard :
1. How many NM's ? Is it a hetergenous cluster or all the nodes have same resource capacity ? by 3000 cores if same config then expecting around 100 nodes, am i correct ?
2. How many applications are running and how many have got finished (basically available in RM) ? By 35000 you mean finished and running applications ?
3. Weather after some time, tasks are getting assigned ? Also is it only this host not getting assigned or no other host also gets any containers assigned ?

I suspect this issue might be similar to YARN-3990, hence the above questions. Further you can check the RM logs and inform weather you see some similar logs as below

2015-07-29 19:39:03,416 | INFO  | AsyncDispatcher event handler | Size of event-queue is 14000 | AsyncDispatcher.java:235
2015-07-29 19:39:03,417 | INFO  | AsyncDispatcher event handler | Size of event-queue is 15000 | AsyncDispatcher.java:235

Regards,
+ Naga


________________________________
From: Dmitry Sivachenko [trtrmitya@gmail.com]
Sent: Wednesday, September 23, 2015 03:57
To: user@hadoop.apache.org
Subject: node remains unused after reboot

Hello!

I am using hadoop-2.7.1. I have a large map job running (total cores available on the cluster about 3000, total tasks 35000).
In the middle of this process one server reboots.

After reboot, nodemanager starts successfully end registers with resource manager:
2015-09-23 01:06:24,656 INFO  [main] nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:registerWithRM(311)) - Notifying ContainerManager to unblock new container-requests

In YARN web-interface I see this host as active, but VCores used remains zero (see screenshot).
But the map job mentioned is still running and have about 12000 pending tasks.

Why this host does not receive tasks to run?

PS: I recently upgraded from 2.4.1 and I did not notice such a problem with 2.4.1: new tasks were spawning immediately after reboot.

Thanks!




[cid:D5DB63EB-D60D-4301-8A5A-4C8FFE970F71@yandex.ru]