You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@slider.apache.org by Frank Kemmer <fr...@1und1.de> on 2017/08/01 08:16:20 UTC

Re: Presto on Yarn using Slider

Does slider only work with the capacity scheduler? We are using the fair scheduler …

--

Am 31.07.17, 18:58 schrieb "Billie Rinaldi" <bi...@gmail.com>:

    It looks like it's not very obvious from the INFO log, so maybe changing
    the RM to DEBUG would help. Below is what my RM log looks like between an
    app transitioning to RUNNING and having its first container allocated, and
    I don't see a specific log line about a request being received. Still,
    being able to start an app the first time and not being able to restart it
    is strange. I would check whether the app processes are actually getting
    stopped (maybe if they are still using resources for some reason, that
    would interfere with new allocations on the same hosts). Another thing I am
    wondering is whether the host names that Slider is using for requests match
    the host names the RM is using for NMs. If Slider requested a host that the
    RM didn't know about, that could cause containers not to be allocated.
    
    2017-07-28 15:43:36,219 INFO rmapp.RMAppImpl:
    application_1501255339386_0002 State change from ACCEPTED to RUNNING on
    event = ATTEMPT_REGISTERED
    2017-07-28 15:43:36,865 INFO allocator.AbstractContainerAllocator:
    assignedContainer application attempt=appattempt_1501255339386_0002_000001
    container=null
    queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@4b10a7b0
    clusterResource=<memory:8192, vCores:8> type=OFF_SWITCH requestedPartition=
    2017-07-28 15:43:36,866 INFO capacity.ParentQueue: assignedContainer
    queue=root usedCapacity=0.125 absoluteUsedCapacity=0.125 used=<memory:1024,
    vCores:1> cluster=<memory:8192, vCores:8>
    2017-07-28 15:43:36,866 INFO rmcontainer.RMContainerImpl:
    container_e02_1501255339386_0002_01_000002 Container Transitioned from NEW
    to ALLOCATED
    
    
    On Mon, Jul 31, 2017 at 9:18 AM, Frank Kemmer <fr...@1und1.de> wrote:
    
    > Hi Billie,
    >
    > do you have any ideas, how I can find out, why the yarn resource manager
    > is denying the requests … do I need to set the yarn log to debug mode to
    > see such requests? In my opinion a denied request should result in a
    > Warning
    > or at least in an info … in the yarn log I can see some denied
    > reservations,
    > but never for requests from the slider AM …
    >
    > I will ask our IT to set the debug flag for yarn and try to find out why
    > yarn says
    > no to the placement. My feeling is, that yarn does not allow placements on
    > hosts … but then I am really lost … ;)
    >
    > > Am 31.07.2017 um 17:59 schrieb Billie Rinaldi <billie.rinaldi@gmail.com
    > >:
    > >
    > > Hi Frank,
    > >
    > >> My understanding here: Our yarn cannot satisfy anti-affinity ... despite
    > > cloudera saying they made backports from even hadoop 3.0.0 ...
    > > Anti-affinity is not implemented in YARN at this time; it's implemented
    > > entirely in Slider using its role history, so this should not be an
    > issue.
    > >
    > >> they are all using a common path on the host
    > > This may be a problem. Slider is not designed to support apps that store
    > > data locally; app components might be brought up on any server. (See
    > > http://slider.incubator.apache.org/docs/slider_specs/
    > application_needs.html
    > > for more information on Slider's expectations about apps.) You could get
    > > around this somewhat by using strict placement (policy 1), so that
    > > components will be restarted on the same nodes they were started
    > > previously. But if your component also needs anti-affinity, Slider does
    > not
    > > have a way to combine anti-affinity and strict placement (first spread
    > the
    > > components out, then after that only start them on the same nodes where
    > > they were started previously). I have speculated whether it would be
    > > possible to configure an app for anti-affinity for the first start, then
    > > change it to strict, but I have not attempted this.
    > >
    > >> To me this looks nice, but I never see a placement request at our yarn
    > > resource manager node. It looks like slider does decide on its own, that
    > it
    > > cannot place the requests.
    > > This is strange. You should see the container request in the RM. Slider
    > > does not decide on its own; the "requests unsatisfiable" message only
    > means
    > > that a request has been sent to the RM and the RM has not allocated a
    > > container.
    > >
    > >
    > > On Fri, Jul 28, 2017 at 7:43 AM, Frank Kemmer <fr...@1und1.de>
    > wrote:
    > >
    > >> Greetings to all ... and sorry for the long post ...
    > >>
    > >> I am trying to deploy presto on our hadoop cluster via slider. Slider
    > >> really looks very
    > >> promising for this task but somehow I am experiencing some glitches.
    > >>
    > >> I am using the following components:
    > >>
    > >>   - java version "1.8.0_131" (oracle jdk)
    > >>   - Cloudera CDH 5.11.1
    > >>   - slider-0.90.0
    > >>   - presto-yarn https://github.com/prestodb/presto-yarn
    > >>
    > >> I can do:
    > >>
    > >>   1. slider package install --name PRESTO --package
    > >> ./presto-yarn-package-1.6-SNAPSHOT-0.167.zip
    > >>   2. slider create presto --template ./appConfig.json --resources
    > >> ./resources.json
    > >>
    > >> The last statement succeeds some times in bringing up presto and
    > >> everything looks fine,
    > >> exports are right and I can access the presto coordinator by the
    > exported
    > >> host_port.
    > >>
    > >> Then when I stop the slider cluster and start it again, the slider AM
    > >> comes up, and says
    > >> that the placement requests are unsatisfiable by the cluster.
    > >>
    > >> I experimented changing the placement.policy yarn.component.placement.
    > >> policy".
    > >>
    > >> Setting: yarn.component.placement.policy": "1"
    > >>
    > >> Result: This works only when there is not history file and even then
    > only
    > >> some times, when
    > >> the requested containers are on different hosts. Sometimes slider can
    > >> place the containers
    > >> on different hosts and everything is fine sometimes not, then it fails
    > >> with unstable
    > >> application, since it tries to start two or more presto components on
    > one
    > >> host ...
    > >>
    > >> My understanding here: presto would need anti affinity, even between
    > >> the different component types, i.e. role COORDINATOR and WORKER are
    > never
    > >> allowed to run
    > >> on the same host ... they are all using a common path on the host, so
    > only
    > >> one presto component
    > >> can run on one host ... but yarn is not able to guarantee that ...
    > >>
    > >> Setting: yarn.component.placement.policy": "4"
    > >>
    > >> Result: The slider AM starts up and says:
    > >>
    > >>   Diagnostics: 2 anti-affinity components have with requests
    > >> unsatisfiable by cluster
    > >>
    > >> My understanding here: Our yarn cannot satisfy anti-affinity ... despite
    > >> cloudera saying
    > >> they made backports from even hadoop 3.0.0 ... I don't know, how to
    > check
    > >> that ...
    > >>
    > >> Then I had an idea, went back to
    > >>
    > >> Setting: yarn.component.placement.policy": "1"
    > >>
    > >> This fails first. Then I edit the history.json file to place each
    > >> component on a different host and
    > >> hoped that would fix it. But even here the slider AM says, that the
    > >> requests are unsatisfiable
    > >> by the cluster.
    > >>
    > >> I switched on debug for slider and yarn and found the following in the
    > >> slider.log of the slider AM:
    > >>
    > >> 2017-07-26 18:20:57,560 [AmExecutor-006] INFO
    > appmaster.SliderAppMaster -
    > >> Registered service under /users/pentaho/services/org-
    > apache-slider/presto;
    > >> absolute path /registry/users/pentaho/services/org-apache-slider/presto
    > >> 2017-07-26 18:20:57,564 [AmExecutor-006] DEBUG actions.QueueExecutor -
    > >> Completed org.apache.slider.server.appmaster.actions.
    > >> ActionRegisterServiceInstance@35b17c06 name='
    > >> ActionRegisterServiceInstance', delay=0, attrs=0, sequenceNumber=5}
    > >> 2017-07-26 18:20:57,564 [AmExecutor-006] DEBUG actions.QueueExecutor -
    > >> Executing org.apache.slider.server.appmaster.actions.
    > >> ReviewAndFlexApplicationSize@9f674ac name='flexCluster', delay=0,
    > >> attrs=4, sequenceNumber=6}
    > >> 2017-07-26 18:20:57,565 [AmExecutor-006] DEBUG
    > appmaster.SliderAppMaster -
    > >> in executeNodeReview(flexCluster)
    > >> 2017-07-26 18:20:57,565 [AmExecutor-006] DEBUG state.AppState - in
    > >> reviewRequestAndReleaseNodes()
    > >> 2017-07-26 18:20:57,566 [AmExecutor-006] INFO  state.AppState -
    > Reviewing
    > >> RoleStatus{name='COORDINATOR', group=COORDINATOR, key=1, desired=1,
    > >> actual=0, requested=0, releasing=0, failed=0, startFailed=0, started=0,
    > >> completed=0, totalRequested=0, preempted=0, nodeFailed=0,
    > failedRecently=0,
    > >> limitsExceeded=0, resourceRequirements=<memory:64512, vCores:1>,
    > >> isAntiAffinePlacement=false, failureMessage='',
    > providerRole=ProviderRole{name='COORDINATOR',
    > >> group=COORDINATOR, id=1, placementPolicy=1, nodeFailureThreshold=3,
    > >> placementTimeoutSeconds=300, labelExpression='null'},
    > failedContainers=[]} :
    > >> 2017-07-26 18:20:57,566 [AmExecutor-006] DEBUG state.AppState - Expected
    > >> 1, Delta: 1
    > >> 2017-07-26 18:20:57,566 [AmExecutor-006] INFO  state.AppState -
    > >> COORDINATOR: Asking for 1 more nodes(s) for a total of 1
    > >> 2017-07-26 18:20:57,566 [AmExecutor-006] DEBUG state.RoleHistory - There
    > >> are 1 node(s) to consider for COORDINATOR
    > >> 2017-07-26 18:20:57,567 [AmExecutor-006] INFO  state.OutstandingRequest
    > -
    > >> Submitting request for container on hadoop-worknode04.our.net
    > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState -
    > Container
    > >> ask is Capability[<memory:64512, vCores:1>]Priority[1] and label = null
    > >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState -
    > operations
    > >> scheduled: 1; updated role: RoleStatus{name='COORDINATOR',
    > >> group=COORDINATOR, key=1, desired=1, actual=0, requested=1, releasing=0,
    > >> failed=0, startFailed=0, started=0, completed=0, totalRequested=1,
    > >> preempted=0, nodeFailed=0, failedRecently=0, limitsExceeded=0,
    > >> resourceRequirements=<memory:64512, vCores:1>,
    > >> isAntiAffinePlacement=false, failureMessage='',
    > providerRole=ProviderRole{name='COORDINATOR',
    > >> group=COORDINATOR, id=1, placementPolicy=1, nodeFailureThreshold=3,
    > >> placementTimeoutSeconds=300, labelExpression='null'},
    > failedContainers=[]}
    > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState -
    > Reviewing
    > >> RoleStatus{name='WORKER', group=WORKER, key=2, desired=1, actual=0,
    > >> requested=0, releasing=0, failed=0, startFailed=0, started=0,
    > completed=0,
    > >> totalRequested=0, preempted=0, nodeFailed=0, failedRecently=0,
    > >> limitsExceeded=0, resourceRequirements=<memory:64512, vCores:1>,
    > >> isAntiAffinePlacement=false, failureMessage='',
    > providerRole=ProviderRole{name='WORKER',
    > >> group=WORKER, id=2, placementPolicy=1, nodeFailureThreshold=3,
    > >> placementTimeoutSeconds=300, labelExpression='null'},
    > failedContainers=[]} :
    > >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState - Expected
    > >> 1, Delta: 1
    > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState - WORKER:
    > >> Asking for 1 more nodes(s) for a total of 1
    > >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.RoleHistory - There
    > >> are 4 node(s) to consider for WORKER
    > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.OutstandingRequest
    > -
    > >> Submitting request for container on hadoop-worknode01.our.net
    > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState -
    > Container
    > >> ask is Capability[<memory:64512, vCores:1>]Priority[2] and label = null
    > >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState -
    > operations
    > >> scheduled: 1; updated role: RoleStatus{name='WORKER', group=WORKER,
    > key=2,
    > >> desired=1, actual=0, requested=1, releasing=0, failed=0, startFailed=0,
    > >> started=0, completed=0, totalRequested=1, preempted=0, nodeFailed=0,
    > >> failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:64512,
    > >> vCores:1>, isAntiAffinePlacement=false, failureMessage='',
    > >> providerRole=ProviderRole{name='WORKER', group=WORKER, id=2,
    > >> placementPolicy=1, nodeFailureThreshold=3, placementTimeoutSeconds=300,
    > >> labelExpression='null'}, failedContainers=[]}
    > >> 2017-07-26 18:20:57,570 [AmExecutor-006] INFO  util.RackResolver -
    > >> Resolved hadoop-worknode04.our.net to /default-rack
    > >> 2017-07-26 18:20:57,570 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
    > Added
    > >> priority=1
    > >> 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
    > >> addResourceRequest: applicationId= priority=1 resourceName=hadoop-
    > >> worknode04.our.net numContainers=1 #asks=1
    > >> 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
    > >> addResourceRequest: applicationId= priority=1 resourceName=/default-rack
    > >> numContainers=1 #asks=2
    > >> 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
    > >> addResourceRequest: applicationId= priority=1 resourceName=*
    > >> numContainers=1 #asks=3
    > >> 2017-07-26 18:20:57,574 [AmExecutor-006] INFO  util.RackResolver -
    > >> Resolved hadoop-worknode01.our.net to /default-rack
    > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
    > Added
    > >> priority=2
    > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
    > >> addResourceRequest: applicationId= priority=2 resourceName=hadoop-
    > >> worknode01.our.net numContainers=1 #asks=4
    > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
    > >> addResourceRequest: applicationId= priority=2 resourceName=/default-rack
    > >> numContainers=1 #asks=5
    > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
    > >> addResourceRequest: applicationId= priority=2 resourceName=*
    > >> numContainers=1 #asks=6
    > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG actions.QueueExecutor -
    > >> Completed org.apache.slider.server.appmaster.actions.
    > >> ReviewAndFlexApplicationSize@9f674ac name='flexCluster', delay=0,
    > >> attrs=4, sequenceNumber=6}
    > >> 2017-07-26 18:21:22,803 [1708490318@qtp-1547994163-0] INFO
    > >> state.AppState - app state clusterNodes {slider-appmaster={container_
    > >> e26_1500976506429_0025_01_000001=container_e26_
    > 1500976506429_0025_01_000001:
    > >> 3
    > >> state: 3
    > >> role: slider-appmaster
    > >> host: hadoop-worknode04.our.net
    > >> hostURL: http://hadoop-worknode04.our.net:52967
    > >> }}
    > >>
    > >> To me this looks nice, but I never see a placement request at our yarn
    > >> resource manager node.
    > >> It looks like slider does decide on its own, that it cannot place the
    > >> requests. But I cannot
    > >> see why it cannot place the requests or why the requests cannot be
    > >> satisfied.
    > >>
    > >> And now I am out of any ideas ... do you have any suggestions how I can
    > >> find out why the
    > >> slider AM does not place the requests?
    > >>
    > >> Any ideas are welcome.
    > >>
    > >> By the way, the cluster has enough resources as the scheduler page
    > shows:
    > >>
    > >>   Instantaneous Fair Share:    <memory:1441792, vCores:224>
    > >>
    > >> Thanks for any help in advance.
    >
    >

Re: Presto on Yarn using Slider

Posted by Frank Kemmer <fr...@1und1.de>.

Solved: Here is the solution ...

Thanks Billie … you set me on the right track for looking after the problem … at least the diagnostics given by slider, saying that the resource requests are unsatisfiable helped a lot.

The cloudera config setup made by the CM is wrong and resolves all hosts to ‘/default-rack’ while the configured rack for the hadoop nodes in reality was ‘/default’. So for every resource request, the Slider App Master sent the wrong rack name, because CM configured it in the wrong way.

The problem is the Cloudera Manager (CM) using tons of Hadoop Configuration Directories and using a slight different way to resolve hosts to racks ... AND IT NEVER USES THE HADOOP_CONF_DIR you set ...

When you look for the standard topology settings in core-site.xml of the client configuration, the CM provides, you see the following:

net.topology.impl: org.apache.hadoop.net.NetworkTopology
net.topology.node.switch.mapping.impl: org.apache.hadoop.net.ScriptBasedMapping
net.topology.script.file.name:
net.topology.script.number.args: 100
net.topology.table.file.name:

You easily see they configure ScriptBasedMapping but don’t provide a script … in this case the rack always resolves to ‘/default-rack’ as explained in the hadoop docs.

Okay, we said the IT department to configure the correct topology script and now it starts to get horrible ...

We downloaded the new client configuration from the CM. Everything looked fine, topology script was defined and calling the script resolved hostnames to the correct rack '/default'.

But the Slider App Manager still said the requests are unsatisfiable and still resolved to '/default-rack'. We couldn't believe our eyes since the topology settings looked good.

Then I found this in the Slider App Manager logs:

2017-08-02 14:20:05,066 [main] INFO appmaster.SliderAppMaster - System env HADOOP_CONF_DIR=/run/cloudera-scm-agent/process/2390-yarn-NODEMANAGER

What is that? Of course this damn CM knows better which configuration to inject ... and of course nobody is allowed to look at that config, at least we were not allowed ...
So we asked the IT department to check the topology settings in core-site.xml in that directory and they were STILL wrong ... and despite what you configured in CM, it didn't change.

Finally we found a way to set the right topology properties via CM:

Goto "YARN Service Advanced Configuration Snippet (Safety Valve) for core-site.xml" in Cloudera Manager and insert the topology settings there. Cloudera thinks topology settings are only for HDFS and NOT for Yarn ... oh my god ... after this change we finally had the correct topology settings in this damn CM shadow hadoop configuration directory ... AND slider resolved the correct rack and worked as intended and could place its containers on the requested resources ...

I think it is a pity that the Yarn Resourcemanager logs don't show rejected resource requests, even in DEBUG mode. And it did not show any reasons why a request was rejected ... in my opinion the Yarn Resourcemanager should log denied resource requests as a WARNING or at least as a INFO ... I think denied requests are of much more interest and importance then accepted requests.

Sorry guys for all the inconveniences and thank your help. I hope you get slider integrated into hadoop as a standard tool ... cheerio

Am 01.08.17, 15:47 schrieb "Billie Rinaldi" <bi...@gmail.com>:

I don't think it would matter which scheduler you are using.

Re: Presto on Yarn using Slider

Posted by Billie Rinaldi <bi...@gmail.com>.

I don't think it would matter which scheduler you are using.

On Tue, Aug 1, 2017 at 1:16 AM, Frank Kemmer <fr...@1und1.de> wrote:

> Does slider only work with the capacity scheduler? We are using the fair
> scheduler …
>
> --
>
> Am 31.07.17, 18:58 schrieb "Billie Rinaldi" <bi...@gmail.com>:
>
>     It looks like it's not very obvious from the INFO log, so maybe
> changing
>     the RM to DEBUG would help. Below is what my RM log looks like between
> an
>     app transitioning to RUNNING and having its first container allocated,
> and
>     I don't see a specific log line about a request being received. Still,
>     being able to start an app the first time and not being able to
> restart it
>     is strange. I would check whether the app processes are actually
> getting
>     stopped (maybe if they are still using resources for some reason, that
>     would interfere with new allocations on the same hosts). Another thing
> I am
>     wondering is whether the host names that Slider is using for requests
> match
>     the host names the RM is using for NMs. If Slider requested a host
> that the
>     RM didn't know about, that could cause containers not to be allocated.
>
>     2017-07-28 15:43:36,219 INFO rmapp.RMAppImpl:
>     application_1501255339386_0002 State change from ACCEPTED to RUNNING on
>     event = ATTEMPT_REGISTERED
>     2017-07-28 15:43:36,865 INFO allocator.AbstractContainerAllocator:
>     assignedContainer application attempt=appattempt_
> 1501255339386_0002_000001
>     container=null
>     queue=org.apache.hadoop.yarn.server.resourcemanager.
> scheduler.capacity.allocator.RegularContainerAllocator@4b10a7b0
>     clusterResource=<memory:8192, vCores:8> type=OFF_SWITCH
> requestedPartition=
>     2017-07-28 15:43:36,866 INFO capacity.ParentQueue: assignedContainer
>     queue=root usedCapacity=0.125 absoluteUsedCapacity=0.125
> used=<memory:1024,
>     vCores:1> cluster=<memory:8192, vCores:8>
>     2017-07-28 15:43:36,866 INFO rmcontainer.RMContainerImpl:
>     container_e02_1501255339386_0002_01_000002 Container Transitioned
> from NEW
>     to ALLOCATED
>
>
>     On Mon, Jul 31, 2017 at 9:18 AM, Frank Kemmer <fr...@1und1.de>
> wrote:
>
>     > Hi Billie,
>     >
>     > do you have any ideas, how I can find out, why the yarn resource
> manager
>     > is denying the requests … do I need to set the yarn log to debug
> mode to
>     > see such requests? In my opinion a denied request should result in a
>     > Warning
>     > or at least in an info … in the yarn log I can see some denied
>     > reservations,
>     > but never for requests from the slider AM …
>     >
>     > I will ask our IT to set the debug flag for yarn and try to find out
> why
>     > yarn says
>     > no to the placement. My feeling is, that yarn does not allow
> placements on
>     > hosts … but then I am really lost … ;)
>     >
>     > > Am 31.07.2017 um 17:59 schrieb Billie Rinaldi <
> billie.rinaldi@gmail.com
>     > >:
>     > >
>     > > Hi Frank,
>     > >
>     > >> My understanding here: Our yarn cannot satisfy anti-affinity ...
> despite
>     > > cloudera saying they made backports from even hadoop 3.0.0 ...
>     > > Anti-affinity is not implemented in YARN at this time; it's
> implemented
>     > > entirely in Slider using its role history, so this should not be an
>     > issue.
>     > >
>     > >> they are all using a common path on the host
>     > > This may be a problem. Slider is not designed to support apps that
> store
>     > > data locally; app components might be brought up on any server.
> (See
>     > > http://slider.incubator.apache.org/docs/slider_specs/
>     > application_needs.html
>     > > for more information on Slider's expectations about apps.) You
> could get
>     > > around this somewhat by using strict placement (policy 1), so that
>     > > components will be restarted on the same nodes they were started
>     > > previously. But if your component also needs anti-affinity, Slider
> does
>     > not
>     > > have a way to combine anti-affinity and strict placement (first
> spread
>     > the
>     > > components out, then after that only start them on the same nodes
> where
>     > > they were started previously). I have speculated whether it would
> be
>     > > possible to configure an app for anti-affinity for the first
> start, then
>     > > change it to strict, but I have not attempted this.
>     > >
>     > >> To me this looks nice, but I never see a placement request at our
> yarn
>     > > resource manager node. It looks like slider does decide on its
> own, that
>     > it
>     > > cannot place the requests.
>     > > This is strange. You should see the container request in the RM.
> Slider
>     > > does not decide on its own; the "requests unsatisfiable" message
> only
>     > means
>     > > that a request has been sent to the RM and the RM has not
> allocated a
>     > > container.
>     > >
>     > >
>     > > On Fri, Jul 28, 2017 at 7:43 AM, Frank Kemmer <
> frank.kemmer@1und1.de>
>     > wrote:
>     > >
>     > >> Greetings to all ... and sorry for the long post ...
>     > >>
>     > >> I am trying to deploy presto on our hadoop cluster via slider.
> Slider
>     > >> really looks very
>     > >> promising for this task but somehow I am experiencing some
> glitches.
>     > >>
>     > >> I am using the following components:
>     > >>
>     > >>   - java version "1.8.0_131" (oracle jdk)
>     > >>   - Cloudera CDH 5.11.1
>     > >>   - slider-0.90.0
>     > >>   - presto-yarn https://github.com/prestodb/presto-yarn
>     > >>
>     > >> I can do:
>     > >>
>     > >>   1. slider package install --name PRESTO --package
>     > >> ./presto-yarn-package-1.6-SNAPSHOT-0.167.zip
>     > >>   2. slider create presto --template ./appConfig.json --resources
>     > >> ./resources.json
>     > >>
>     > >> The last statement succeeds some times in bringing up presto and
>     > >> everything looks fine,
>     > >> exports are right and I can access the presto coordinator by the
>     > exported
>     > >> host_port.
>     > >>
>     > >> Then when I stop the slider cluster and start it again, the
> slider AM
>     > >> comes up, and says
>     > >> that the placement requests are unsatisfiable by the cluster.
>     > >>
>     > >> I experimented changing the placement.policy
> yarn.component.placement.
>     > >> policy".
>     > >>
>     > >> Setting: yarn.component.placement.policy": "1"
>     > >>
>     > >> Result: This works only when there is not history file and even
> then
>     > only
>     > >> some times, when
>     > >> the requested containers are on different hosts. Sometimes slider
> can
>     > >> place the containers
>     > >> on different hosts and everything is fine sometimes not, then it
> fails
>     > >> with unstable
>     > >> application, since it tries to start two or more presto
> components on
>     > one
>     > >> host ...
>     > >>
>     > >> My understanding here: presto would need anti affinity, even
> between
>     > >> the different component types, i.e. role COORDINATOR and WORKER
> are
>     > never
>     > >> allowed to run
>     > >> on the same host ... they are all using a common path on the
> host, so
>     > only
>     > >> one presto component
>     > >> can run on one host ... but yarn is not able to guarantee that ...
>     > >>
>     > >> Setting: yarn.component.placement.policy": "4"
>     > >>
>     > >> Result: The slider AM starts up and says:
>     > >>
>     > >>   Diagnostics: 2 anti-affinity components have with requests
>     > >> unsatisfiable by cluster
>     > >>
>     > >> My understanding here: Our yarn cannot satisfy anti-affinity ...
> despite
>     > >> cloudera saying
>     > >> they made backports from even hadoop 3.0.0 ... I don't know, how
> to
>     > check
>     > >> that ...
>     > >>
>     > >> Then I had an idea, went back to
>     > >>
>     > >> Setting: yarn.component.placement.policy": "1"
>     > >>
>     > >> This fails first. Then I edit the history.json file to place each
>     > >> component on a different host and
>     > >> hoped that would fix it. But even here the slider AM says, that
> the
>     > >> requests are unsatisfiable
>     > >> by the cluster.
>     > >>
>     > >> I switched on debug for slider and yarn and found the following
> in the
>     > >> slider.log of the slider AM:
>     > >>
>     > >> 2017-07-26 18:20:57,560 [AmExecutor-006] INFO
>     > appmaster.SliderAppMaster -
>     > >> Registered service under /users/pentaho/services/org-
>     > apache-slider/presto;
>     > >> absolute path /registry/users/pentaho/services/org-apache-slider/
> presto
>     > >> 2017-07-26 18:20:57,564 [AmExecutor-006] DEBUG
> actions.QueueExecutor -
>     > >> Completed org.apache.slider.server.appmaster.actions.
>     > >> ActionRegisterServiceInstance@35b17c06 name='
>     > >> ActionRegisterServiceInstance', delay=0, attrs=0,
> sequenceNumber=5}
>     > >> 2017-07-26 18:20:57,564 [AmExecutor-006] DEBUG
> actions.QueueExecutor -
>     > >> Executing org.apache.slider.server.appmaster.actions.
>     > >> ReviewAndFlexApplicationSize@9f674ac name='flexCluster', delay=0,
>     > >> attrs=4, sequenceNumber=6}
>     > >> 2017-07-26 18:20:57,565 [AmExecutor-006] DEBUG
>     > appmaster.SliderAppMaster -
>     > >> in executeNodeReview(flexCluster)
>     > >> 2017-07-26 18:20:57,565 [AmExecutor-006] DEBUG state.AppState - in
>     > >> reviewRequestAndReleaseNodes()
>     > >> 2017-07-26 18:20:57,566 [AmExecutor-006] INFO  state.AppState -
>     > Reviewing
>     > >> RoleStatus{name='COORDINATOR', group=COORDINATOR, key=1,
> desired=1,
>     > >> actual=0, requested=0, releasing=0, failed=0, startFailed=0,
> started=0,
>     > >> completed=0, totalRequested=0, preempted=0, nodeFailed=0,
>     > failedRecently=0,
>     > >> limitsExceeded=0, resourceRequirements=<memory:64512, vCores:1>,
>     > >> isAntiAffinePlacement=false, failureMessage='',
>     > providerRole=ProviderRole{name='COORDINATOR',
>     > >> group=COORDINATOR, id=1, placementPolicy=1,
> nodeFailureThreshold=3,
>     > >> placementTimeoutSeconds=300, labelExpression='null'},
>     > failedContainers=[]} :
>     > >> 2017-07-26 18:20:57,566 [AmExecutor-006] DEBUG state.AppState -
> Expected
>     > >> 1, Delta: 1
>     > >> 2017-07-26 18:20:57,566 [AmExecutor-006] INFO  state.AppState -
>     > >> COORDINATOR: Asking for 1 more nodes(s) for a total of 1
>     > >> 2017-07-26 18:20:57,566 [AmExecutor-006] DEBUG state.RoleHistory
> - There
>     > >> are 1 node(s) to consider for COORDINATOR
>     > >> 2017-07-26 18:20:57,567 [AmExecutor-006] INFO
> state.OutstandingRequest
>     > -
>     > >> Submitting request for container on hadoop-worknode04.our.net
>     > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState -
>     > Container
>     > >> ask is Capability[<memory:64512, vCores:1>]Priority[1] and label
> = null
>     > >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState -
>     > operations
>     > >> scheduled: 1; updated role: RoleStatus{name='COORDINATOR',
>     > >> group=COORDINATOR, key=1, desired=1, actual=0, requested=1,
> releasing=0,
>     > >> failed=0, startFailed=0, started=0, completed=0, totalRequested=1,
>     > >> preempted=0, nodeFailed=0, failedRecently=0, limitsExceeded=0,
>     > >> resourceRequirements=<memory:64512, vCores:1>,
>     > >> isAntiAffinePlacement=false, failureMessage='',
>     > providerRole=ProviderRole{name='COORDINATOR',
>     > >> group=COORDINATOR, id=1, placementPolicy=1,
> nodeFailureThreshold=3,
>     > >> placementTimeoutSeconds=300, labelExpression='null'},
>     > failedContainers=[]}
>     > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState -
>     > Reviewing
>     > >> RoleStatus{name='WORKER', group=WORKER, key=2, desired=1,
> actual=0,
>     > >> requested=0, releasing=0, failed=0, startFailed=0, started=0,
>     > completed=0,
>     > >> totalRequested=0, preempted=0, nodeFailed=0, failedRecently=0,
>     > >> limitsExceeded=0, resourceRequirements=<memory:64512, vCores:1>,
>     > >> isAntiAffinePlacement=false, failureMessage='',
>     > providerRole=ProviderRole{name='WORKER',
>     > >> group=WORKER, id=2, placementPolicy=1, nodeFailureThreshold=3,
>     > >> placementTimeoutSeconds=300, labelExpression='null'},
>     > failedContainers=[]} :
>     > >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState -
> Expected
>     > >> 1, Delta: 1
>     > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState -
> WORKER:
>     > >> Asking for 1 more nodes(s) for a total of 1
>     > >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.RoleHistory
> - There
>     > >> are 4 node(s) to consider for WORKER
>     > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO
> state.OutstandingRequest
>     > -
>     > >> Submitting request for container on hadoop-worknode01.our.net
>     > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState -
>     > Container
>     > >> ask is Capability[<memory:64512, vCores:1>]Priority[2] and label
> = null
>     > >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState -
>     > operations
>     > >> scheduled: 1; updated role: RoleStatus{name='WORKER',
> group=WORKER,
>     > key=2,
>     > >> desired=1, actual=0, requested=1, releasing=0, failed=0,
> startFailed=0,
>     > >> started=0, completed=0, totalRequested=1, preempted=0,
> nodeFailed=0,
>     > >> failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:
> 64512,
>     > >> vCores:1>, isAntiAffinePlacement=false, failureMessage='',
>     > >> providerRole=ProviderRole{name='WORKER', group=WORKER, id=2,
>     > >> placementPolicy=1, nodeFailureThreshold=3,
> placementTimeoutSeconds=300,
>     > >> labelExpression='null'}, failedContainers=[]}
>     > >> 2017-07-26 18:20:57,570 [AmExecutor-006] INFO  util.RackResolver -
>     > >> Resolved hadoop-worknode04.our.net to /default-rack
>     > >> 2017-07-26 18:20:57,570 [AmExecutor-006] DEBUG
> impl.AMRMClientImpl -
>     > Added
>     > >> priority=1
>     > >> 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG
> impl.AMRMClientImpl -
>     > >> addResourceRequest: applicationId= priority=1 resourceName=hadoop-
>     > >> worknode04.our.net numContainers=1 #asks=1
>     > >> 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG
> impl.AMRMClientImpl -
>     > >> addResourceRequest: applicationId= priority=1
> resourceName=/default-rack
>     > >> numContainers=1 #asks=2
>     > >> 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG
> impl.AMRMClientImpl -
>     > >> addResourceRequest: applicationId= priority=1 resourceName=*
>     > >> numContainers=1 #asks=3
>     > >> 2017-07-26 18:20:57,574 [AmExecutor-006] INFO  util.RackResolver -
>     > >> Resolved hadoop-worknode01.our.net to /default-rack
>     > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG
> impl.AMRMClientImpl -
>     > Added
>     > >> priority=2
>     > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG
> impl.AMRMClientImpl -
>     > >> addResourceRequest: applicationId= priority=2 resourceName=hadoop-
>     > >> worknode01.our.net numContainers=1 #asks=4
>     > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG
> impl.AMRMClientImpl -
>     > >> addResourceRequest: applicationId= priority=2
> resourceName=/default-rack
>     > >> numContainers=1 #asks=5
>     > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG
> impl.AMRMClientImpl -
>     > >> addResourceRequest: applicationId= priority=2 resourceName=*
>     > >> numContainers=1 #asks=6
>     > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG
> actions.QueueExecutor -
>     > >> Completed org.apache.slider.server.appmaster.actions.
>     > >> ReviewAndFlexApplicationSize@9f674ac name='flexCluster', delay=0,
>     > >> attrs=4, sequenceNumber=6}
>     > >> 2017-07-26 18:21:22,803 [1708490318@qtp-1547994163-0] INFO
>     > >> state.AppState - app state clusterNodes
> {slider-appmaster={container_
>     > >> e26_1500976506429_0025_01_000001=container_e26_
>     > 1500976506429_0025_01_000001:
>     > >> 3
>     > >> state: 3
>     > >> role: slider-appmaster
>     > >> host: hadoop-worknode04.our.net
>     > >> hostURL: http://hadoop-worknode04.our.net:52967
>     > >> }}
>     > >>
>     > >> To me this looks nice, but I never see a placement request at our
> yarn
>     > >> resource manager node.
>     > >> It looks like slider does decide on its own, that it cannot place
> the
>     > >> requests. But I cannot
>     > >> see why it cannot place the requests or why the requests cannot be
>     > >> satisfied.
>     > >>
>     > >> And now I am out of any ideas ... do you have any suggestions how
> I can
>     > >> find out why the
>     > >> slider AM does not place the requests?
>     > >>
>     > >> Any ideas are welcome.
>     > >>
>     > >> By the way, the cluster has enough resources as the scheduler page
>     > shows:
>     > >>
>     > >>   Instantaneous Fair Share:    <memory:1441792, vCores:224>
>     > >>
>     > >> Thanks for any help in advance.
>     >
>     >
>
>
>