You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@myriad.apache.org by "Sarjeet Singh (JIRA)" <ji...@apache.org> on 2015/08/25 02:58:47 UTC

[jira] [Updated] (MYRIAD-128) Issue with Flex down, Pending NMs stuck in staging and don't get to active task.

     [ https://issues.apache.org/jira/browse/MYRIAD-128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sarjeet Singh updated MYRIAD-128:
---------------------------------
    Attachment: Screen Shot 2015-08-24 at 5.51.38 PM.png

Myriad UI screenshot

> Issue with Flex down, Pending NMs stuck in staging and don't get to active task.
> --------------------------------------------------------------------------------
>
>                 Key: MYRIAD-128
>                 URL: https://issues.apache.org/jira/browse/MYRIAD-128
>             Project: Myriad
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: Myriad 0.1.0
>            Reporter: Sarjeet Singh
>         Attachments: Screen Shot 2015-08-24 at 5.51.38 PM.png
>
>
> Seeing some issue when I tried flexing NMs from Myriad UI. On flexing down active NM,  pending NMs doesn't go to active state (not sowing in 'Active Tasks') and there is no active NM showing on Myriad UI. Although, there is a NM running on the node (verified from jps). 
> mapr     20528 20526  1 17:23 ?        00:00:26 /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.85.x86_64/bin/java -Dproc_nodemanager -Xmx1000m -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs -Dyarn.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log -Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console -Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native -Dyarn.policy.file=hadoop-policy.xml -server -Dnodemanager.resource.io-spindles=4.0 -Dyarn.resourcemanager.hostname=testrm.marathon.mesos -Dyarn.nodemanager.container-executor.class=org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor -Dnodemanager.resource.cpu-vcores=0 -Dnodemanager.resource.memory-mb=0 -Dmyriad.yarn.nodemanager.address=0.0.0.0:31000 -Dmyriad.yarn.nodemanager.localizer.address=0.0.0.0:31001 -Dmyriad.yarn.nodemanager.webapp.address=0.0.0.0:31002 -Dmyriad.mapreduce.shuffle.port=0.0.0.0:31003 -Dhadoop.login=maprsasl -Dhttps.protocols=TLSv1.2 -Djava.security.auth.login.config=/opt/mapr/conf/mapr.login.conf -Dzookeeper.sasl.clientconfig=Client_simple -Dzookeeper.saslprovider=com.mapr.security.simplesasl.SimpleSaslProvider -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs -Dyarn.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log -Dyarn.home.dir=/opt/mapr/hadoop/hadoop-2.7.0 -Dhadoop.home.dir=/opt/mapr/hadoop/hadoop-2.7.0 -Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console -Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native -classpath /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/nm-config/log4j.properties:/opt/mapr/lib/JPam-1.1.jar org.apache.hadoop.yarn.server.nodemanager.NodeManager
> From myriad UI:
> Active Tasks
> Killable Tasks
> Pending Tasks
> Staging Tasks
> nm.large.123badb1-57d8-4bd2-aa2e-de9fc1898c7f
> nm.medium.f2c4126c-4cb2-46af-a1e0-690034b914b8
> nm.medium.a9e9fd84-350a-48bc-bcd2-8712ecdc8c66
> nm.medium.663f9c6e-f28e-4395-8540-70c306eb04c5
> nm.medium.93f7cc91-9263-48a7-821e-3b0ffbe70e66
> This is the state even after waited for about 30 min or so after flexing down the NM.
> I tried this on a single node cluster though, but looks like the problem can happen in any case.
> I started RM from marathon and was able to get RM & Myriad up & running. With RM launched, there is a CGS (medium profile) NM is launched along with it as well which is shown as 'Active Task' on Myriad UI. Then, I launched some large profile & zero profile NM which are shown now in 'Pending tasks' since there is a (CGS default) NM already running on a single node cluster.
> Then, I tried flexing down NM from myriad UI, which flexed up the active NM and all pending NMs start moving to staging tasks, and then they stuck in staging task for longer time. On waited for about > 30min, I dont see any active task for NM and all of the pending NM tasks are shown in 'Staging task' only. (See the screenshot)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [jira] [Updated] (MYRIAD-128) Issue with Flex down, Pending NMs stuck in staging and don't get to active task.

Posted by Swapnil Daingade <sw...@gmail.com>.
Hi Darin,

I have multiple important fixes that I have made as part of the HA work.
I feel it will be a lot of work and very time consuming to test each fix
independently,
get it reviewed, rebase a big change like HA on top.

I am working on rebasing the HA change (I have to anyways). Should be able
to update the PR in a an hour or two.
Most of the work has already been reviewed by multiple people. Only work
that needs review are the fixes.
I would say lets see if we can try to get it all in. If it really becomes a
problem I'll have to send out a separate pull request
for each fix.

Regards
Swapnil


On Sat, Aug 29, 2015 at 7:34 AM, Darin Johnson <db...@gmail.com>
wrote:

> If you have a fix.  Let's do a separate PR for it and then rebase the ha PR
> based on it.  This will make it easier to reason about the code in each PR
> and get the bug fix in quicker.
> On Aug 26, 2015 12:02 PM, "Swapnil Daingade" <sw...@gmail.com>
> wrote:
>
> > Hi Jim,
> >
> > I was also working on this as part of the review comments that I received
> > for the Myriad HA changes.
> > Are you too far along in fixing this? If not, I can send out an updated
> > pull request including this by eod today.
> >
> > Regards
> > Swapnil
> >
> >
> > On Wed, Aug 26, 2015 at 7:35 AM, Jim Klucar <kl...@gmail.com> wrote:
> >
> > > I took a brief look at this and have an idea about what could be going
> > on.
> > > Basically the SchedulerState class isn't thread-safe. There is a lot of
> > > adding and removing tasks from the various sets (pending, staging, etc)
> > > that aren't thread-safe. Short of synchronization and locks, perhaps we
> > can
> > > do a concurrent hash map with taskIds and a new enum representing the
> > > state.
> > >
> > > On Mon, Aug 24, 2015 at 8:58 PM, Sarjeet Singh (JIRA) <jira@apache.org
> >
> > > wrote:
> > >
> > > >
> > > >      [
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/MYRIAD-128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> > > > ]
> > > >
> > > > Sarjeet Singh updated MYRIAD-128:
> > > > ---------------------------------
> > > >     Attachment: Screen Shot 2015-08-24 at 5.51.38 PM.png
> > > >
> > > > Myriad UI screenshot
> > > >
> > > > > Issue with Flex down, Pending NMs stuck in staging and don't get to
> > > > active task.
> > > > >
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > > >
> > > > >                 Key: MYRIAD-128
> > > > >                 URL:
> > https://issues.apache.org/jira/browse/MYRIAD-128
> > > > >             Project: Myriad
> > > > >          Issue Type: Bug
> > > > >          Components: Scheduler
> > > > >    Affects Versions: Myriad 0.1.0
> > > > >            Reporter: Sarjeet Singh
> > > > >         Attachments: Screen Shot 2015-08-24 at 5.51.38 PM.png
> > > > >
> > > > >
> > > > > Seeing some issue when I tried flexing NMs from Myriad UI. On
> flexing
> > > > down active NM,  pending NMs doesn't go to active state (not sowing
> in
> > > > 'Active Tasks') and there is no active NM showing on Myriad UI.
> > Although,
> > > > there is a NM running on the node (verified from jps).
> > > > > mapr     20528 20526  1 17:23 ?        00:00:26
> > > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.85.x86_64/bin/java
> > > -Dproc_nodemanager
> > > > -Xmx1000m -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > > > -Dyarn.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > > > -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log -Dyarn.home.dir=
> > > > -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
> > > > -Dyarn.root.logger=INFO,console
> > > > -Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native
> > > > -Dyarn.policy.file=hadoop-policy.xml -server
> > > > -Dnodemanager.resource.io-spindles=4.0
> > > > -Dyarn.resourcemanager.hostname=testrm.marathon.mesos
> > > >
> > >
> >
> -Dyarn.nodemanager.container-executor.class=org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor
> > > > -Dnodemanager.resource.cpu-vcores=0
> -Dnodemanager.resource.memory-mb=0
> > > > -Dmyriad.yarn.nodemanager.address=0.0.0.0:31000
> > > > -Dmyriad.yarn.nodemanager.localizer.address=0.0.0.0:31001
> > > > -Dmyriad.yarn.nodemanager.webapp.address=0.0.0.0:31002
> > > > -Dmyriad.mapreduce.shuffle.port=0.0.0.0:31003
> -Dhadoop.login=maprsasl
> > > > -Dhttps.protocols=TLSv1.2
> > > > -Djava.security.auth.login.config=/opt/mapr/conf/mapr.login.conf
> > > > -Dzookeeper.sasl.clientconfig=Client_simple
> > > >
> > -Dzookeeper.saslprovider=com.mapr.security.simplesasl.SimpleSaslProvider
> > > > -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > > > -Dyarn.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > > > -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
> > > > -Dyarn.home.dir=/opt/mapr/hadoop/hadoop-2.7.0
> > > > -Dhadoop.home.dir=/opt/mapr/hadoop/hadoop-2.7.0
> > > > -Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
> > > > -Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native
> -classpath
> > > >
> > >
> >
> /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/nm-config/log4j.properties:/opt/mapr/lib/JPam-1.1.jar
> > > > org.apache.hadoop.yarn.server.nodemanager.NodeManager
> > > > > From myriad UI:
> > > > > Active Tasks
> > > > > Killable Tasks
> > > > > Pending Tasks
> > > > > Staging Tasks
> > > > > nm.large.123badb1-57d8-4bd2-aa2e-de9fc1898c7f
> > > > > nm.medium.f2c4126c-4cb2-46af-a1e0-690034b914b8
> > > > > nm.medium.a9e9fd84-350a-48bc-bcd2-8712ecdc8c66
> > > > > nm.medium.663f9c6e-f28e-4395-8540-70c306eb04c5
> > > > > nm.medium.93f7cc91-9263-48a7-821e-3b0ffbe70e66
> > > > > This is the state even after waited for about 30 min or so after
> > > flexing
> > > > down the NM.
> > > > > I tried this on a single node cluster though, but looks like the
> > > problem
> > > > can happen in any case.
> > > > > I started RM from marathon and was able to get RM & Myriad up &
> > > running.
> > > > With RM launched, there is a CGS (medium profile) NM is launched
> along
> > > with
> > > > it as well which is shown as 'Active Task' on Myriad UI. Then, I
> > launched
> > > > some large profile & zero profile NM which are shown now in 'Pending
> > > tasks'
> > > > since there is a (CGS default) NM already running on a single node
> > > cluster.
> > > > > Then, I tried flexing down NM from myriad UI, which flexed up the
> > > active
> > > > NM and all pending NMs start moving to staging tasks, and then they
> > stuck
> > > > in staging task for longer time. On waited for about > 30min, I dont
> > see
> > > > any active task for NM and all of the pending NM tasks are shown in
> > > > 'Staging task' only. (See the screenshot)
> > > >
> > > >
> > > >
> > > > --
> > > > This message was sent by Atlassian JIRA
> > > > (v6.3.4#6332)
> > > >
> > >
> >
>

Re: [jira] [Updated] (MYRIAD-128) Issue with Flex down, Pending NMs stuck in staging and don't get to active task.

Posted by Darin Johnson <db...@gmail.com>.
If you have a fix.  Let's do a separate PR for it and then rebase the ha PR
based on it.  This will make it easier to reason about the code in each PR
and get the bug fix in quicker.
On Aug 26, 2015 12:02 PM, "Swapnil Daingade" <sw...@gmail.com>
wrote:

> Hi Jim,
>
> I was also working on this as part of the review comments that I received
> for the Myriad HA changes.
> Are you too far along in fixing this? If not, I can send out an updated
> pull request including this by eod today.
>
> Regards
> Swapnil
>
>
> On Wed, Aug 26, 2015 at 7:35 AM, Jim Klucar <kl...@gmail.com> wrote:
>
> > I took a brief look at this and have an idea about what could be going
> on.
> > Basically the SchedulerState class isn't thread-safe. There is a lot of
> > adding and removing tasks from the various sets (pending, staging, etc)
> > that aren't thread-safe. Short of synchronization and locks, perhaps we
> can
> > do a concurrent hash map with taskIds and a new enum representing the
> > state.
> >
> > On Mon, Aug 24, 2015 at 8:58 PM, Sarjeet Singh (JIRA) <ji...@apache.org>
> > wrote:
> >
> > >
> > >      [
> > >
> >
> https://issues.apache.org/jira/browse/MYRIAD-128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> > > ]
> > >
> > > Sarjeet Singh updated MYRIAD-128:
> > > ---------------------------------
> > >     Attachment: Screen Shot 2015-08-24 at 5.51.38 PM.png
> > >
> > > Myriad UI screenshot
> > >
> > > > Issue with Flex down, Pending NMs stuck in staging and don't get to
> > > active task.
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > >
> > > >                 Key: MYRIAD-128
> > > >                 URL:
> https://issues.apache.org/jira/browse/MYRIAD-128
> > > >             Project: Myriad
> > > >          Issue Type: Bug
> > > >          Components: Scheduler
> > > >    Affects Versions: Myriad 0.1.0
> > > >            Reporter: Sarjeet Singh
> > > >         Attachments: Screen Shot 2015-08-24 at 5.51.38 PM.png
> > > >
> > > >
> > > > Seeing some issue when I tried flexing NMs from Myriad UI. On flexing
> > > down active NM,  pending NMs doesn't go to active state (not sowing in
> > > 'Active Tasks') and there is no active NM showing on Myriad UI.
> Although,
> > > there is a NM running on the node (verified from jps).
> > > > mapr     20528 20526  1 17:23 ?        00:00:26
> > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.85.x86_64/bin/java
> > -Dproc_nodemanager
> > > -Xmx1000m -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > > -Dyarn.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > > -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log -Dyarn.home.dir=
> > > -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
> > > -Dyarn.root.logger=INFO,console
> > > -Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native
> > > -Dyarn.policy.file=hadoop-policy.xml -server
> > > -Dnodemanager.resource.io-spindles=4.0
> > > -Dyarn.resourcemanager.hostname=testrm.marathon.mesos
> > >
> >
> -Dyarn.nodemanager.container-executor.class=org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor
> > > -Dnodemanager.resource.cpu-vcores=0 -Dnodemanager.resource.memory-mb=0
> > > -Dmyriad.yarn.nodemanager.address=0.0.0.0:31000
> > > -Dmyriad.yarn.nodemanager.localizer.address=0.0.0.0:31001
> > > -Dmyriad.yarn.nodemanager.webapp.address=0.0.0.0:31002
> > > -Dmyriad.mapreduce.shuffle.port=0.0.0.0:31003 -Dhadoop.login=maprsasl
> > > -Dhttps.protocols=TLSv1.2
> > > -Djava.security.auth.login.config=/opt/mapr/conf/mapr.login.conf
> > > -Dzookeeper.sasl.clientconfig=Client_simple
> > >
> -Dzookeeper.saslprovider=com.mapr.security.simplesasl.SimpleSaslProvider
> > > -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > > -Dyarn.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > > -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
> > > -Dyarn.home.dir=/opt/mapr/hadoop/hadoop-2.7.0
> > > -Dhadoop.home.dir=/opt/mapr/hadoop/hadoop-2.7.0
> > > -Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
> > > -Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native -classpath
> > >
> >
> /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/nm-config/log4j.properties:/opt/mapr/lib/JPam-1.1.jar
> > > org.apache.hadoop.yarn.server.nodemanager.NodeManager
> > > > From myriad UI:
> > > > Active Tasks
> > > > Killable Tasks
> > > > Pending Tasks
> > > > Staging Tasks
> > > > nm.large.123badb1-57d8-4bd2-aa2e-de9fc1898c7f
> > > > nm.medium.f2c4126c-4cb2-46af-a1e0-690034b914b8
> > > > nm.medium.a9e9fd84-350a-48bc-bcd2-8712ecdc8c66
> > > > nm.medium.663f9c6e-f28e-4395-8540-70c306eb04c5
> > > > nm.medium.93f7cc91-9263-48a7-821e-3b0ffbe70e66
> > > > This is the state even after waited for about 30 min or so after
> > flexing
> > > down the NM.
> > > > I tried this on a single node cluster though, but looks like the
> > problem
> > > can happen in any case.
> > > > I started RM from marathon and was able to get RM & Myriad up &
> > running.
> > > With RM launched, there is a CGS (medium profile) NM is launched along
> > with
> > > it as well which is shown as 'Active Task' on Myriad UI. Then, I
> launched
> > > some large profile & zero profile NM which are shown now in 'Pending
> > tasks'
> > > since there is a (CGS default) NM already running on a single node
> > cluster.
> > > > Then, I tried flexing down NM from myriad UI, which flexed up the
> > active
> > > NM and all pending NMs start moving to staging tasks, and then they
> stuck
> > > in staging task for longer time. On waited for about > 30min, I dont
> see
> > > any active task for NM and all of the pending NM tasks are shown in
> > > 'Staging task' only. (See the screenshot)
> > >
> > >
> > >
> > > --
> > > This message was sent by Atlassian JIRA
> > > (v6.3.4#6332)
> > >
> >
>

Re: [jira] [Updated] (MYRIAD-128) Issue with Flex down, Pending NMs stuck in staging and don't get to active task.

Posted by Jim Klucar <kl...@gmail.com>.
I haven't done any code at all, just dug in and read the code a bit so go
right ahead.

On Wed, Aug 26, 2015 at 12:02 PM, Swapnil Daingade <
swapnil.daingade@gmail.com> wrote:

> Hi Jim,
>
> I was also working on this as part of the review comments that I received
> for the Myriad HA changes.
> Are you too far along in fixing this? If not, I can send out an updated
> pull request including this by eod today.
>
> Regards
> Swapnil
>
>
> On Wed, Aug 26, 2015 at 7:35 AM, Jim Klucar <kl...@gmail.com> wrote:
>
> > I took a brief look at this and have an idea about what could be going
> on.
> > Basically the SchedulerState class isn't thread-safe. There is a lot of
> > adding and removing tasks from the various sets (pending, staging, etc)
> > that aren't thread-safe. Short of synchronization and locks, perhaps we
> can
> > do a concurrent hash map with taskIds and a new enum representing the
> > state.
> >
> > On Mon, Aug 24, 2015 at 8:58 PM, Sarjeet Singh (JIRA) <ji...@apache.org>
> > wrote:
> >
> > >
> > >      [
> > >
> >
> https://issues.apache.org/jira/browse/MYRIAD-128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> > > ]
> > >
> > > Sarjeet Singh updated MYRIAD-128:
> > > ---------------------------------
> > >     Attachment: Screen Shot 2015-08-24 at 5.51.38 PM.png
> > >
> > > Myriad UI screenshot
> > >
> > > > Issue with Flex down, Pending NMs stuck in staging and don't get to
> > > active task.
> > > >
> > >
> >
> --------------------------------------------------------------------------------
> > > >
> > > >                 Key: MYRIAD-128
> > > >                 URL:
> https://issues.apache.org/jira/browse/MYRIAD-128
> > > >             Project: Myriad
> > > >          Issue Type: Bug
> > > >          Components: Scheduler
> > > >    Affects Versions: Myriad 0.1.0
> > > >            Reporter: Sarjeet Singh
> > > >         Attachments: Screen Shot 2015-08-24 at 5.51.38 PM.png
> > > >
> > > >
> > > > Seeing some issue when I tried flexing NMs from Myriad UI. On flexing
> > > down active NM,  pending NMs doesn't go to active state (not sowing in
> > > 'Active Tasks') and there is no active NM showing on Myriad UI.
> Although,
> > > there is a NM running on the node (verified from jps).
> > > > mapr     20528 20526  1 17:23 ?        00:00:26
> > > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.85.x86_64/bin/java
> > -Dproc_nodemanager
> > > -Xmx1000m -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > > -Dyarn.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > > -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log -Dyarn.home.dir=
> > > -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
> > > -Dyarn.root.logger=INFO,console
> > > -Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native
> > > -Dyarn.policy.file=hadoop-policy.xml -server
> > > -Dnodemanager.resource.io-spindles=4.0
> > > -Dyarn.resourcemanager.hostname=testrm.marathon.mesos
> > >
> >
> -Dyarn.nodemanager.container-executor.class=org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor
> > > -Dnodemanager.resource.cpu-vcores=0 -Dnodemanager.resource.memory-mb=0
> > > -Dmyriad.yarn.nodemanager.address=0.0.0.0:31000
> > > -Dmyriad.yarn.nodemanager.localizer.address=0.0.0.0:31001
> > > -Dmyriad.yarn.nodemanager.webapp.address=0.0.0.0:31002
> > > -Dmyriad.mapreduce.shuffle.port=0.0.0.0:31003 -Dhadoop.login=maprsasl
> > > -Dhttps.protocols=TLSv1.2
> > > -Djava.security.auth.login.config=/opt/mapr/conf/mapr.login.conf
> > > -Dzookeeper.sasl.clientconfig=Client_simple
> > >
> -Dzookeeper.saslprovider=com.mapr.security.simplesasl.SimpleSaslProvider
> > > -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > > -Dyarn.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > > -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
> > > -Dyarn.home.dir=/opt/mapr/hadoop/hadoop-2.7.0
> > > -Dhadoop.home.dir=/opt/mapr/hadoop/hadoop-2.7.0
> > > -Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
> > > -Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native -classpath
> > >
> >
> /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/nm-config/log4j.properties:/opt/mapr/lib/JPam-1.1.jar
> > > org.apache.hadoop.yarn.server.nodemanager.NodeManager
> > > > From myriad UI:
> > > > Active Tasks
> > > > Killable Tasks
> > > > Pending Tasks
> > > > Staging Tasks
> > > > nm.large.123badb1-57d8-4bd2-aa2e-de9fc1898c7f
> > > > nm.medium.f2c4126c-4cb2-46af-a1e0-690034b914b8
> > > > nm.medium.a9e9fd84-350a-48bc-bcd2-8712ecdc8c66
> > > > nm.medium.663f9c6e-f28e-4395-8540-70c306eb04c5
> > > > nm.medium.93f7cc91-9263-48a7-821e-3b0ffbe70e66
> > > > This is the state even after waited for about 30 min or so after
> > flexing
> > > down the NM.
> > > > I tried this on a single node cluster though, but looks like the
> > problem
> > > can happen in any case.
> > > > I started RM from marathon and was able to get RM & Myriad up &
> > running.
> > > With RM launched, there is a CGS (medium profile) NM is launched along
> > with
> > > it as well which is shown as 'Active Task' on Myriad UI. Then, I
> launched
> > > some large profile & zero profile NM which are shown now in 'Pending
> > tasks'
> > > since there is a (CGS default) NM already running on a single node
> > cluster.
> > > > Then, I tried flexing down NM from myriad UI, which flexed up the
> > active
> > > NM and all pending NMs start moving to staging tasks, and then they
> stuck
> > > in staging task for longer time. On waited for about > 30min, I dont
> see
> > > any active task for NM and all of the pending NM tasks are shown in
> > > 'Staging task' only. (See the screenshot)
> > >
> > >
> > >
> > > --
> > > This message was sent by Atlassian JIRA
> > > (v6.3.4#6332)
> > >
> >
>

Re: [jira] [Updated] (MYRIAD-128) Issue with Flex down, Pending NMs stuck in staging and don't get to active task.

Posted by Swapnil Daingade <sw...@gmail.com>.
Hi Jim,

I was also working on this as part of the review comments that I received
for the Myriad HA changes.
Are you too far along in fixing this? If not, I can send out an updated
pull request including this by eod today.

Regards
Swapnil


On Wed, Aug 26, 2015 at 7:35 AM, Jim Klucar <kl...@gmail.com> wrote:

> I took a brief look at this and have an idea about what could be going on.
> Basically the SchedulerState class isn't thread-safe. There is a lot of
> adding and removing tasks from the various sets (pending, staging, etc)
> that aren't thread-safe. Short of synchronization and locks, perhaps we can
> do a concurrent hash map with taskIds and a new enum representing the
> state.
>
> On Mon, Aug 24, 2015 at 8:58 PM, Sarjeet Singh (JIRA) <ji...@apache.org>
> wrote:
>
> >
> >      [
> >
> https://issues.apache.org/jira/browse/MYRIAD-128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> > ]
> >
> > Sarjeet Singh updated MYRIAD-128:
> > ---------------------------------
> >     Attachment: Screen Shot 2015-08-24 at 5.51.38 PM.png
> >
> > Myriad UI screenshot
> >
> > > Issue with Flex down, Pending NMs stuck in staging and don't get to
> > active task.
> > >
> >
> --------------------------------------------------------------------------------
> > >
> > >                 Key: MYRIAD-128
> > >                 URL: https://issues.apache.org/jira/browse/MYRIAD-128
> > >             Project: Myriad
> > >          Issue Type: Bug
> > >          Components: Scheduler
> > >    Affects Versions: Myriad 0.1.0
> > >            Reporter: Sarjeet Singh
> > >         Attachments: Screen Shot 2015-08-24 at 5.51.38 PM.png
> > >
> > >
> > > Seeing some issue when I tried flexing NMs from Myriad UI. On flexing
> > down active NM,  pending NMs doesn't go to active state (not sowing in
> > 'Active Tasks') and there is no active NM showing on Myriad UI. Although,
> > there is a NM running on the node (verified from jps).
> > > mapr     20528 20526  1 17:23 ?        00:00:26
> > /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.85.x86_64/bin/java
> -Dproc_nodemanager
> > -Xmx1000m -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > -Dyarn.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log -Dyarn.home.dir=
> > -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
> > -Dyarn.root.logger=INFO,console
> > -Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native
> > -Dyarn.policy.file=hadoop-policy.xml -server
> > -Dnodemanager.resource.io-spindles=4.0
> > -Dyarn.resourcemanager.hostname=testrm.marathon.mesos
> >
> -Dyarn.nodemanager.container-executor.class=org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor
> > -Dnodemanager.resource.cpu-vcores=0 -Dnodemanager.resource.memory-mb=0
> > -Dmyriad.yarn.nodemanager.address=0.0.0.0:31000
> > -Dmyriad.yarn.nodemanager.localizer.address=0.0.0.0:31001
> > -Dmyriad.yarn.nodemanager.webapp.address=0.0.0.0:31002
> > -Dmyriad.mapreduce.shuffle.port=0.0.0.0:31003 -Dhadoop.login=maprsasl
> > -Dhttps.protocols=TLSv1.2
> > -Djava.security.auth.login.config=/opt/mapr/conf/mapr.login.conf
> > -Dzookeeper.sasl.clientconfig=Client_simple
> > -Dzookeeper.saslprovider=com.mapr.security.simplesasl.SimpleSaslProvider
> > -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > -Dyarn.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> > -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
> > -Dyarn.home.dir=/opt/mapr/hadoop/hadoop-2.7.0
> > -Dhadoop.home.dir=/opt/mapr/hadoop/hadoop-2.7.0
> > -Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
> > -Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native -classpath
> >
> /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/nm-config/log4j.properties:/opt/mapr/lib/JPam-1.1.jar
> > org.apache.hadoop.yarn.server.nodemanager.NodeManager
> > > From myriad UI:
> > > Active Tasks
> > > Killable Tasks
> > > Pending Tasks
> > > Staging Tasks
> > > nm.large.123badb1-57d8-4bd2-aa2e-de9fc1898c7f
> > > nm.medium.f2c4126c-4cb2-46af-a1e0-690034b914b8
> > > nm.medium.a9e9fd84-350a-48bc-bcd2-8712ecdc8c66
> > > nm.medium.663f9c6e-f28e-4395-8540-70c306eb04c5
> > > nm.medium.93f7cc91-9263-48a7-821e-3b0ffbe70e66
> > > This is the state even after waited for about 30 min or so after
> flexing
> > down the NM.
> > > I tried this on a single node cluster though, but looks like the
> problem
> > can happen in any case.
> > > I started RM from marathon and was able to get RM & Myriad up &
> running.
> > With RM launched, there is a CGS (medium profile) NM is launched along
> with
> > it as well which is shown as 'Active Task' on Myriad UI. Then, I launched
> > some large profile & zero profile NM which are shown now in 'Pending
> tasks'
> > since there is a (CGS default) NM already running on a single node
> cluster.
> > > Then, I tried flexing down NM from myriad UI, which flexed up the
> active
> > NM and all pending NMs start moving to staging tasks, and then they stuck
> > in staging task for longer time. On waited for about > 30min, I dont see
> > any active task for NM and all of the pending NM tasks are shown in
> > 'Staging task' only. (See the screenshot)
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v6.3.4#6332)
> >
>

Re: [jira] [Updated] (MYRIAD-128) Issue with Flex down, Pending NMs stuck in staging and don't get to active task.

Posted by Jim Klucar <kl...@gmail.com>.
I took a brief look at this and have an idea about what could be going on.
Basically the SchedulerState class isn't thread-safe. There is a lot of
adding and removing tasks from the various sets (pending, staging, etc)
that aren't thread-safe. Short of synchronization and locks, perhaps we can
do a concurrent hash map with taskIds and a new enum representing the state.

On Mon, Aug 24, 2015 at 8:58 PM, Sarjeet Singh (JIRA) <ji...@apache.org>
wrote:

>
>      [
> https://issues.apache.org/jira/browse/MYRIAD-128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
>
> Sarjeet Singh updated MYRIAD-128:
> ---------------------------------
>     Attachment: Screen Shot 2015-08-24 at 5.51.38 PM.png
>
> Myriad UI screenshot
>
> > Issue with Flex down, Pending NMs stuck in staging and don't get to
> active task.
> >
> --------------------------------------------------------------------------------
> >
> >                 Key: MYRIAD-128
> >                 URL: https://issues.apache.org/jira/browse/MYRIAD-128
> >             Project: Myriad
> >          Issue Type: Bug
> >          Components: Scheduler
> >    Affects Versions: Myriad 0.1.0
> >            Reporter: Sarjeet Singh
> >         Attachments: Screen Shot 2015-08-24 at 5.51.38 PM.png
> >
> >
> > Seeing some issue when I tried flexing NMs from Myriad UI. On flexing
> down active NM,  pending NMs doesn't go to active state (not sowing in
> 'Active Tasks') and there is no active NM showing on Myriad UI. Although,
> there is a NM running on the node (verified from jps).
> > mapr     20528 20526  1 17:23 ?        00:00:26
> /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.85.x86_64/bin/java -Dproc_nodemanager
> -Xmx1000m -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> -Dyarn.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log -Dyarn.home.dir=
> -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
> -Dyarn.root.logger=INFO,console
> -Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native
> -Dyarn.policy.file=hadoop-policy.xml -server
> -Dnodemanager.resource.io-spindles=4.0
> -Dyarn.resourcemanager.hostname=testrm.marathon.mesos
> -Dyarn.nodemanager.container-executor.class=org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor
> -Dnodemanager.resource.cpu-vcores=0 -Dnodemanager.resource.memory-mb=0
> -Dmyriad.yarn.nodemanager.address=0.0.0.0:31000
> -Dmyriad.yarn.nodemanager.localizer.address=0.0.0.0:31001
> -Dmyriad.yarn.nodemanager.webapp.address=0.0.0.0:31002
> -Dmyriad.mapreduce.shuffle.port=0.0.0.0:31003 -Dhadoop.login=maprsasl
> -Dhttps.protocols=TLSv1.2
> -Djava.security.auth.login.config=/opt/mapr/conf/mapr.login.conf
> -Dzookeeper.sasl.clientconfig=Client_simple
> -Dzookeeper.saslprovider=com.mapr.security.simplesasl.SimpleSaslProvider
> -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> -Dyarn.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs
> -Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
> -Dyarn.home.dir=/opt/mapr/hadoop/hadoop-2.7.0
> -Dhadoop.home.dir=/opt/mapr/hadoop/hadoop-2.7.0
> -Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
> -Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native -classpath
> /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/hdfs/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/*:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/lib/*:/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/nm-config/log4j.properties:/opt/mapr/lib/JPam-1.1.jar
> org.apache.hadoop.yarn.server.nodemanager.NodeManager
> > From myriad UI:
> > Active Tasks
> > Killable Tasks
> > Pending Tasks
> > Staging Tasks
> > nm.large.123badb1-57d8-4bd2-aa2e-de9fc1898c7f
> > nm.medium.f2c4126c-4cb2-46af-a1e0-690034b914b8
> > nm.medium.a9e9fd84-350a-48bc-bcd2-8712ecdc8c66
> > nm.medium.663f9c6e-f28e-4395-8540-70c306eb04c5
> > nm.medium.93f7cc91-9263-48a7-821e-3b0ffbe70e66
> > This is the state even after waited for about 30 min or so after flexing
> down the NM.
> > I tried this on a single node cluster though, but looks like the problem
> can happen in any case.
> > I started RM from marathon and was able to get RM & Myriad up & running.
> With RM launched, there is a CGS (medium profile) NM is launched along with
> it as well which is shown as 'Active Task' on Myriad UI. Then, I launched
> some large profile & zero profile NM which are shown now in 'Pending tasks'
> since there is a (CGS default) NM already running on a single node cluster.
> > Then, I tried flexing down NM from myriad UI, which flexed up the active
> NM and all pending NMs start moving to staging tasks, and then they stuck
> in staging task for longer time. On waited for about > 30min, I dont see
> any active task for NM and all of the pending NM tasks are shown in
> 'Staging task' only. (See the screenshot)
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>