You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@slider.apache.org by Manoj Samel <ma...@gmail.com> on 2016/07/25 19:25:24 UTC

Slider AM fails to run when RM in HA setup fails over

Setup

- Hadoop 2.6 with RM HA, Kerberos enabled
- Slider 0.80
- In my slider-client.xml, I have added all RM HA properties, including the
ones mentioned in http://markmail.org/message/wnhpp2zn6ixo65e3.

Following is the issue

* rm1 is active, rm2 is standby
* deploy and start slider application, it runs fine
* restart rm1, rm2 is now active.
* The slider-am now goes from running into "ACCEPTED" mode. It stays there
till rm1 is made active again.

In the slider-am log, it tries to connect to RM2 and connection fails due
to org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN]. See detailed log below

 It seems it has some token (delegation token?) for RM1 but tries to use
same(?) for RM2 and fails. Am I missing some configuration ???

Thanks,



2016-07-25 19:06:48,088 [AMRM Heartbeater thread] INFO
 client.ConfiguredRMFailoverProxyProvider - Failing over to rm2
2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
 security.UserGroupInformation - PriviledgedActionException as:abc@XYZ
(auth:KERBEROS) cause:org.apache.hadoop.security.AccessControlException:
Client cannot authenticate via:[TOKEN]
2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN  ipc.Client -
Exception encountered while connecting to the server :
org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN]
2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
 security.UserGroupInformation - PriviledgedActionException
as:workdayadmin@BIGDATA (auth:KERBEROS) cause:java.io.IOException:
org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN]
2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
 retry.RetryInvocationHandler - Exception while invoking allocate of class
ApplicationMasterProtocolPBClientImpl over rm2 after 287 fail over
attempts. Trying to fail over immediately.
java.io.IOException: Failed on local exception: java.io.IOException:
org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN]; Host Details : local host is: "<SliderAM
HOST>/<slider AM Host IP>"; destination host is: "<RM2 HOST>":23130;
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
        at org.apache.hadoop.ipc.Client.call(Client.java:1476)
        at org.apache.hadoop.ipc.Client.call(Client.java:1403)
        at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
        at com.sun.proxy.$Proxy23.allocate(Unknown Source)
        at
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
        at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
        at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
        at com.sun.proxy.$Proxy24.allocate(Unknown Source)
        at
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278)
        at
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
Caused by: java.io.IOException:
org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN]
        at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
        at
org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:645)
        at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:733)
        at
org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
        at org.apache.hadoop.ipc.Client.call(Client.java:1442)
        ... 12 more
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN]
        at
org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:172)
        at
org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:396)
        at
org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:555)
        at
org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
        at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
        at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
        at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720)
        ... 15 more
2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
 client.ConfiguredRMFailoverProxyProvider - Failing over to rm1

Re: Slider AM fails to run when RM in HA setup fails over

Posted by Manoj Samel <ma...@gmail.com>.
Hello,

I have uploaded requested logs, configurations and my observations on the
logs etc. to https://issues.apache.org/jira/browse/SLIDER-1158.

Would greatly appreciate if someone takes a looks and provides any pointers
on slider created ticket and what could be leading to the observed behavior
?

Thanks in advance,

Manoj

On Thu, Jul 28, 2016 at 7:01 PM, Manoj Samel <ma...@gmail.com>
wrote:

> Hi Gour,
>
> I added properties in /etc/hadoop/conf/yarn-site.xml and emptied the
> /data/slider/conf/slider-client.xml and restarted both RMs.
>
>    - hadoop.registry.zk.quorum
>    - hadoop.registry.zk.root
>    - slider.yarn.queue
>
> Now there are no issues in creating or destroying cluster. This helps as
> it keeps all configs in one location - thanks for the update.
>
>  I am still hitting the original issue - Starting application with RM1
> active and then RM1 to RM2 fail over leads to slider-AM getting Client
> cannot authenticate via:[TOKEN] errors.
>
> I will upload the config files soon ...
>
> Thanks,
>
> On Thu, Jul 28, 2016 at 5:28 PM, Manoj Samel <ma...@gmail.com>
> wrote:
>
>> Thanks. I will test with the updated config and then upload the latest
>> ones ...
>>
>> Thanks,
>>
>> Manoj
>>
>> On Thu, Jul 28, 2016 at 5:21 PM, Gour Saha <gs...@hortonworks.com> wrote:
>>
>>> slider.zookeeper.quorum is deprecated and should not be used.
>>> hadoop.registry.zk.quorum is used instead and is typically defined in
>>> yarn-site.xml. So is hadoop.registry.zk.root.
>>>
>>> It is not encouraged to specify slider.yarn.queue at the cluster config
>>> level. Ideally it is best to specify the queue during the application
>>> submission. So you can use --queue option with slider create cmd. You can
>>> also set on the command line using -D slider.yarn.queue=<> during the
>>> create call. If indeed all slider apps should go to one and only one
>>> queue, then this prop can be specified in any one of the existing site
>>> xml
>>> files under /etc/hadoop/conf.
>>>
>>> -Gour
>>>
>>> On 7/28/16, 4:43 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>>>
>>> >Following slider specific properties are at present added in
>>> >/data/slider/conf/slider-client.xml. If you think they should be picked
>>> up
>>> >from HADOOP_CONF_DIR (/etc/hadoop/conf) file, which file in
>>> >HADOOP_CONF_DIR
>>> >should these be added ?
>>> >
>>> >   - slider.zookeeper.quorum
>>> >   - hadoop.registry.zk.quorum
>>> >   - hadoop.registry.zk.root
>>> >   - slider.yarn.queue
>>> >
>>> >
>>> >On Thu, Jul 28, 2016 at 4:37 PM, Gour Saha <gs...@hortonworks.com>
>>> wrote:
>>> >
>>> >> That is strange, since it is indeed not required to contain anything
>>> in
>>> >> slider-client.xml (except <configuration></configuration>) if
>>> >> HADOOP_CONF_DIR has everything that Slider needs. This probably gives
>>> an
>>> >> indication that there might be some issue with cluster configuration
>>> >>based
>>> >> on files solely under HADOOP_CONF_DIR to begin with.
>>> >>
>>> >> Suggest you to upload all the config files to the jira to help debug
>>> >>this
>>> >> further.
>>> >>
>>> >> -Gour
>>> >>
>>> >> On 7/28/16, 4:27 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>>> >>
>>> >> >Thanks Gour for prompt reply
>>> >> >
>>> >> >BTW - Creating a empty slider-client.xml (with just
>>> >> ><configuration></configuration>) does not works. The AM starts but
>>> >>fails
>>> >> >to
>>> >> >create any components and shows errors like
>>> >> >
>>> >> >2016-07-28 23:18:46,018
>>> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
>>> >> > zookeeper.ClientCnxn - Session 0x0 for server null, unexpected
>>> error,
>>> >> >closing socket connection and attempting reconnect
>>> >> >java.net.ConnectException: Connection refused
>>> >> >        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>> >> >        at
>>> >>
>>> >sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>>> >> >        at
>>> >>
>>>
>>> >>>org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO
>>> >>>.j
>>> >> >ava:361)
>>> >> >        at
>>> >> >org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>>> >> >
>>> >> >Also, command "slider destroy <app>" fails with zookeeper errors ...
>>> >> >
>>> >> >I had to keep a minimal slider-client.xml. It does not have any RM
>>> info
>>> >> >etc. but does contain slider ZK related properties like
>>> >> >"slider.zookeeper.quorum", "hadoop.registry.zk.quorum",
>>> >> >"hadoop.registry.zk.root". I haven't yet distilled the absolute
>>> minimal
>>> >> >set
>>> >> >of properties required, but this should suffice for now. All RM /
>>> HDFS
>>> >> >properties will be read from HADOOP_CONF_DIR files.
>>> >> >
>>> >> >Let me know if this could cause any issues.
>>> >> >
>>> >> >On Thu, Jul 28, 2016 at 3:36 PM, Gour Saha <gs...@hortonworks.com>
>>> >>wrote:
>>> >> >
>>> >> >> No need to copy any files. Pointing HADOOP_CONF_DIR to
>>> >>/etc/hadoop/conf
>>> >> >>is
>>> >> >> good.
>>> >> >>
>>> >> >> -Gour
>>> >> >>
>>> >> >> On 7/28/16, 3:24 PM, "Manoj Samel" <ma...@gmail.com>
>>> wrote:
>>> >> >>
>>> >> >> >Follow up question regarding Gour's comment in earlier thread -
>>> >> >> >
>>> >> >> >Slider is installed on one of the hadoop nodes. SLIDER_HOME/conf
>>> >> >>directory
>>> >> >> >(say /data/slider/conf) is different than HADOOP_CONF_DIR
>>> >> >> >(/etc/hadoop/conf). Is it required/recommended that files in
>>> >> >> >HADOOP_CONF_DIR be copied to SLIDER_HOME/conf and slider-env.sh
>>> >>script
>>> >> >> >sets
>>> >> >> >HADOOP_CONF_DIR to /data/slider/conf ?
>>> >> >> >
>>> >> >> >Or can the slider-env.sh set HADOOP_CONF_DIR to /etc/hadoop/conf ,
>>> >> >>without
>>> >> >> >copying the files ?
>>> >> >> >
>>> >> >> >Using slider .80 for now, but would like to know recommendation
>>> for
>>> >> >>this
>>> >> >> >and future versions as well.
>>> >> >> >
>>> >> >> >Thanks in advance,
>>> >> >> >
>>> >> >> >Manoj
>>> >> >> >
>>> >> >> >On Tue, Jul 26, 2016 at 3:27 PM, Manoj Samel
>>> >><manojsameltech@gmail.com
>>> >> >
>>> >> >> >wrote:
>>> >> >> >
>>> >> >> >> Filed https://issues.apache.org/jira/browse/SLIDER-1158 with
>>> logs
>>> >> and
>>> >> >> my
>>> >> >> >> analysis of logs.
>>> >> >> >>
>>> >> >> >> On Tue, Jul 26, 2016 at 10:36 AM, Gour Saha
>>> >><gs...@hortonworks.com>
>>> >> >> >>wrote:
>>> >> >> >>
>>> >> >> >>> Please file a JIRA and upload the logs to it.
>>> >> >> >>>
>>> >> >> >>> On 7/26/16, 10:21 AM, "Manoj Samel" <ma...@gmail.com>
>>> >> >>wrote:
>>> >> >> >>>
>>> >> >> >>> >Hi Gour,
>>> >> >> >>> >
>>> >> >> >>> >Can you please reach me using your own email-id? I will then
>>> >>send
>>> >> >> >>>logs to
>>> >> >> >>> >you, along with my analysis - I don't want to send logs on
>>> >>public
>>> >> >>list
>>> >> >> >>> >
>>> >> >> >>> >Thanks,
>>> >> >> >>> >
>>> >> >> >>> >On Mon, Jul 25, 2016 at 5:39 PM, Gour Saha
>>> >><gs...@hortonworks.com>
>>> >> >> >>> wrote:
>>> >> >> >>> >
>>> >> >> >>> >> Ok, so this node is not a gateway. It is part of the
>>> cluster,
>>> >> >>which
>>> >> >> >>> >>means
>>> >> >> >>> >> you don¹t need slider-client.xml at all. Just have
>>> >> >>HADOOP_CONF_DIR
>>> >> >> >>> >> pointing to /etc/hadoop/conf in slider-env.sh and that
>>> should
>>> >>be
>>> >> >>it.
>>> >> >> >>> >>
>>> >> >> >>> >> So the above simplifies your config setup. It will not solve
>>> >> >>either
>>> >> >> >>>of
>>> >> >> >>> >>the
>>> >> >> >>> >> 2 problems you are facing.
>>> >> >> >>> >>
>>> >> >> >>> >> Now coming to the 2 issues you are facing, you have to
>>> provide
>>> >> >> >>> >>additional
>>> >> >> >>> >> logs for us to understand better. Let¹s start with  -
>>> >> >> >>> >> 1. RM logs (specifically between the time when rm1->rm2
>>> >>failover
>>> >> >>is
>>> >> >> >>> >> simulated)
>>> >> >> >>> >> 2. Slider App logs
>>> >> >> >>> >>
>>> >> >> >>> >> -Gour
>>> >> >> >>> >>
>>> >> >> >>> >> On 7/25/16, 5:16 PM, "Manoj Samel" <
>>> manojsameltech@gmail.com>
>>> >> >> wrote:
>>> >> >> >>> >>
>>> >> >> >>> >> >   1. Not clear about your question on "gateway" node. The
>>> >>node
>>> >> >> >>> running
>>> >> >> >>> >> >   slider is part of the hadoop cluster and there are other
>>> >> >> >>>services
>>> >> >> >>> >>like
>>> >> >> >>> >> >   Oozie that run on this node that utilizes hdfs and yarn.
>>> >>So
>>> >> >>if
>>> >> >> >>>your
>>> >> >> >>> >> >   question is whether the node is otherwise working for
>>> HDFS
>>> >> >>and
>>> >> >> >>>Yarn
>>> >> >> >>> >> >   configuration, it is working
>>> >> >> >>> >> >   2. I copied all files from HADOOP_CONF_DIR (say
>>> >> >> >>>/etc/hadoop/conf)
>>> >> >> >>> to
>>> >> >> >>> >> >the
>>> >> >> >>> >> >   directory containing slider-client.xml (say
>>> >> >>/data/latest/conf)
>>> >> >> >>> >> >   3. In earlier email, I had done a mistake where
>>> >>slider-env.sh
>>> >> >> >>>file
>>> >> >> >>> >> >HADOOP_CONF_DIR
>>> >> >> >>> >> >   was pointing to original directory /etc/hadoop/conf. I
>>> >>edited
>>> >> >> >>>it to
>>> >> >> >>> >> >   point to same directory containing slider-client.xml &
>>> >> >> >>> slider-env.sh
>>> >> >> >>> >> >i.e.
>>> >> >> >>> >> >   /data/latest/conf
>>> >> >> >>> >> >   4. I emptied slider-client.xml. It just had the
>>> >> >> >>> >> ><configuration></configuration>.
>>> >> >> >>> >> >   The creation of spas worked but the Slider AM still
>>> shows
>>> >>the
>>> >> >> >>>same
>>> >> >> >>> >> >issue.
>>> >> >> >>> >> >   i.e. when RM1 goes from active to standby, slider AM
>>> goes
>>> >> >>from
>>> >> >> >>> >>RUNNING
>>> >> >> >>> >> >to
>>> >> >> >>> >> >   ACCPTED state with same error about TOKEN. Also NOTE
>>> that
>>> >> >>when
>>> >> >> >>> >> >   slider-client.xml is empty, the "slider destroy xxx"
>>> >>command
>>> >> >> >>>still
>>> >> >> >>> >> >fails
>>> >> >> >>> >> >   with Zookeeper connection errors.
>>> >> >> >>> >> >   5. I then added same parameters (as my last email -
>>> except
>>> >> >> >>> >> >   HADOOP_CONF_DIR) to slider-client.xml and ran. This time
>>> >> >> >>> >>slider-env.sh
>>> >> >> >>> >> >   has HADOOP_CONF_DIR pointing to /data/latest/conf and
>>> >> >> >>> >>slider-client.xml
>>> >> >> >>> >> >   does not have HADOOP_CONF_DIR. The same issue exists
>>> (but
>>> >> >> >>>"slider
>>> >> >> >>> >> >   destroy" does not fails)
>>> >> >> >>> >> >   6. Could you explain what do you expect to pick up from
>>> >> >>Hadoop
>>> >> >> >>> >> >   configurations that will help you in RM Token ? If
>>> slider
>>> >>has
>>> >> >> >>>token
>>> >> >> >>> >> >from
>>> >> >> >>> >> >   RM1, and it switches to RM2, not clear what slider does
>>> to
>>> >> >>get
>>> >> >> >>> >> >delegation
>>> >> >> >>> >> >   token for RM2 communication ?
>>> >> >> >>> >> >   7. It is worth repeating again that issue happens only
>>> >>when
>>> >> >>RM1
>>> >> >> >>>was
>>> >> >> >>> >> >   active when slider app was created and then RM1 becomes
>>> >> >> >>>standby. If
>>> >> >> >>> >> >RM2 was
>>> >> >> >>> >> >   active when slider app was created, then slider AM keeps
>>> >> >>running
>>> >> >> >>> for
>>> >> >> >>> >> >any
>>> >> >> >>> >> >   number of switches between RM2 and RM1 back and forth
>>> ...
>>> >> >> >>> >> >
>>> >> >> >>> >> >
>>> >> >> >>> >> >On Mon, Jul 25, 2016 at 4:21 PM, Gour Saha
>>> >> >><gs...@hortonworks.com>
>>> >> >> >>> >>wrote:
>>> >> >> >>> >> >
>>> >> >> >>> >> >> The node you are running slider from, is that a gateway
>>> >>node?
>>> >> >> >>>Sorry
>>> >> >> >>> >>for
>>> >> >> >>> >> >> not being explicit. I meant copy everything under
>>> >> >> >>>/etc/hadoop/conf
>>> >> >> >>> >>from
>>> >> >> >>> >> >> your cluster into some temp directory (say
>>> >>/tmp/hadoop_conf)
>>> >> >>in
>>> >> >> >>>your
>>> >> >> >>> >> >> gateway node or local or whichever node you are running
>>> >>slider
>>> >> >> >>>from.
>>> >> >> >>> >> >>Then
>>> >> >> >>> >> >> set HADOOP_CONF_DIR to /tmp/hadoop_conf and clear
>>> >>everything
>>> >> >>out
>>> >> >> >>> from
>>> >> >> >>> >> >> slider-client.xml.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> On 7/25/16, 4:12 PM, "Manoj Samel"
>>> >><ma...@gmail.com>
>>> >> >> >>> wrote:
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> >Hi Gour,
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >Thanks for your prompt reply.
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >FYI, issue happens when I create slider app when rm1 is
>>> >> >>active
>>> >> >> >>>and
>>> >> >> >>> >>when
>>> >> >> >>> >> >> >rm1
>>> >> >> >>> >> >> >fails over to rm2. As soon as rm2 becomes active; the
>>> >>slider
>>> >> >>AM
>>> >> >> >>> goes
>>> >> >> >>> >> >>from
>>> >> >> >>> >> >> >RUNNING to ACCEPTED state with above error.
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >For your suggestion, I did following
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >1) Copied core-site, hdfs-site, yarn-site, and
>>> mapred-site
>>> >> >>from
>>> >> >> >>> >> >> >HADOOP_CONF_DIR
>>> >> >> >>> >> >> >to slider conf directory.
>>> >> >> >>> >> >> >2) Our slider-env.sh already had HADOOP_CONF_DIR set
>>> >> >> >>> >> >> >3) I removed all properties from slider-client.xml
>>> EXCEPT
>>> >> >> >>>following
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >   - HADOOP_CONF_DIR
>>> >> >> >>> >> >> >   - slider.yarn.queue
>>> >> >> >>> >> >> >   - slider.zookeeper.quorum
>>> >> >> >>> >> >> >   - hadoop.registry.zk.quorum
>>> >> >> >>> >> >> >   - hadoop.registry.zk.root
>>> >> >> >>> >> >> >   - hadoop.security.authorization
>>> >> >> >>> >> >> >   - hadoop.security.authentication
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >Then I made rm1 active, installed and created slider app
>>> >>and
>>> >> >> >>> >>restarted
>>> >> >> >>> >> >>rm1
>>> >> >> >>> >> >> >(to make rm2) active. The slider-am again went from
>>> >>RUNNING
>>> >> >>to
>>> >> >> >>> >>ACCEPTED
>>> >> >> >>> >> >> >state.
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >Let me know if you want me to try further changes.
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >If I make the slider-client.xml completely empty per
>>> your
>>> >> >> >>> >>suggestion,
>>> >> >> >>> >> >>only
>>> >> >> >>> >> >> >slider AM comes up but it
>>> >> >> >>> >> >> >fails to start components. The AM log shows errors
>>> trying
>>> >>to
>>> >> >> >>> >>connect to
>>> >> >> >>> >> >> >zookeeper like below.
>>> >> >> >>> >> >> >2016-07-25 23:07:41,532
>>> >> >> >>> >> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)]
>>> >>WARN
>>> >> >> >>> >> >> >zookeeper.ClientCnxn - Session 0x0 for server null,
>>> >> >>unexpected
>>> >> >> >>> >>error,
>>> >> >> >>> >> >> >closing socket connection and attempting reconnect
>>> >> >> >>> >> >> >java.net.ConnectException: Connection refused
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >Hence I kept minimal info in slider-client.xml
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >FYI This is slider version 0.80
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >Thanks,
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >Manoj
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >On Mon, Jul 25, 2016 at 2:54 PM, Gour Saha
>>> >> >> >>><gs...@hortonworks.com>
>>> >> >> >>> >> >>wrote:
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >> If possible, can you copy the entire content of the
>>> >> >>directory
>>> >> >> >>> >> >> >> /etc/hadoop/conf and then set HADOOP_CONF_DIR in
>>> >> >> >>>slider-env.sh to
>>> >> >> >>> >>it.
>>> >> >> >>> >> >> >>Keep
>>> >> >> >>> >> >> >> slider-client.xml empty.
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >> >> Now when you do the same rm1->rm2 and then the reverse
>>> >> >> >>>failovers,
>>> >> >> >>> >>do
>>> >> >> >>> >> >>you
>>> >> >> >>> >> >> >> see the same behaviors?
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >> >> -Gour
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >> >> On 7/25/16, 2:28 PM, "Manoj Samel"
>>> >> >><ma...@gmail.com>
>>> >> >> >>> >>wrote:
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >> >> >Another observation (whatever it is worth)
>>> >> >> >>> >> >> >> >
>>> >> >> >>> >> >> >> >If slider app is created and started when rm2 was
>>> >>active,
>>> >> >> >>>then
>>> >> >> >>> it
>>> >> >> >>> >> >> >>seems to
>>> >> >> >>> >> >> >> >survive switches between rm2 and rm1 (and back). I.e
>>> >> >> >>> >> >> >> >
>>> >> >> >>> >> >> >> >* rm2 is active
>>> >> >> >>> >> >> >> >* create and start slider application
>>> >> >> >>> >> >> >> >* fail over to rm1. Now the Slider AM keeps running
>>> >> >> >>> >> >> >> >* fail over to rm2 again. Slider AM still keeps
>>> running
>>> >> >> >>> >> >> >> >
>>> >> >> >>> >> >> >> >So, it seems if it starts with rm1 active, then the
>>> AM
>>> >> >>goes
>>> >> >> >>>to
>>> >> >> >>> >> >> >>"ACCEPTED"
>>> >> >> >>> >> >> >> >state when RM fails to rm2. If it starts with rm2
>>> >>active,
>>> >> >> >>>then
>>> >> >> >>> it
>>> >> >> >>> >> >>runs
>>> >> >> >>> >> >> >> >fine
>>> >> >> >>> >> >> >> >with any switches between rm1 and rm2.
>>> >> >> >>> >> >> >> >
>>> >> >> >>> >> >> >> >Any feedback ?
>>> >> >> >>> >> >> >> >
>>> >> >> >>> >> >> >> >Thanks,
>>> >> >> >>> >> >> >> >
>>> >> >> >>> >> >> >> >Manoj
>>> >> >> >>> >> >> >> >
>>> >> >> >>> >> >> >> >On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel
>>> >> >> >>> >> >> >><ma...@gmail.com>
>>> >> >> >>> >> >> >> >wrote:
>>> >> >> >>> >> >> >> >
>>> >> >> >>> >> >> >> >> Setup
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >> >> - Hadoop 2.6 with RM HA, Kerberos enabled
>>> >> >> >>> >> >> >> >> - Slider 0.80
>>> >> >> >>> >> >> >> >> - In my slider-client.xml, I have added all RM HA
>>> >> >> >>>properties,
>>> >> >> >>> >> >> >>including
>>> >> >> >>> >> >> >> >> the ones mentioned in
>>> >> >> >>> >> >>http://markmail.org/message/wnhpp2zn6ixo65e3.
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >> >> Following is the issue
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >> >> * rm1 is active, rm2 is standby
>>> >> >> >>> >> >> >> >> * deploy and start slider application, it runs fine
>>> >> >> >>> >> >> >> >> * restart rm1, rm2 is now active.
>>> >> >> >>> >> >> >> >> * The slider-am now goes from running into
>>> "ACCEPTED"
>>> >> >> >>>mode. It
>>> >> >> >>> >> >>stays
>>> >> >> >>> >> >> >> >>there
>>> >> >> >>> >> >> >> >> till rm1 is made active again.
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >> >> In the slider-am log, it tries to connect to RM2
>>> and
>>> >> >> >>> connection
>>> >> >> >>> >> >>fails
>>> >> >> >>> >> >> >> >>due
>>> >> >> >>> >> >> >> >> to
>>> org.apache.hadoop.security.AccessControlException:
>>> >> >> >>>Client
>>> >> >> >>> >> >>cannot
>>> >> >> >>> >> >> >> >> authenticate via:[TOKEN]. See detailed log below
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >> >>  It seems it has some token (delegation token?) for
>>> >>RM1
>>> >> >>but
>>> >> >> >>> >>tries
>>> >> >> >>> >> >>to
>>> >> >> >>> >> >> >>use
>>> >> >> >>> >> >> >> >> same(?) for RM2 and fails. Am I missing some
>>> >> >>configuration
>>> >> >> >>>???
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >> >> Thanks,
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
>>> >>INFO
>>> >> >> >>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing
>>> >> >>over to
>>> >> >> >>> rm2
>>> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
>>> >>WARN
>>> >> >> >>> >> >> >> >>  security.UserGroupInformation -
>>> >> >>PriviledgedActionException
>>> >> >> >>> >> >> >>as:abc@XYZ
>>> >> >> >>> >> >> >> >> (auth:KERBEROS)
>>> >> >> >>> >> >>
>>> >>cause:org.apache.hadoop.security.AccessControlException:
>>> >> >> >>> >> >> >> >> Client cannot authenticate via:[TOKEN]
>>> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
>>> >>WARN
>>> >> >> >>> >> >>ipc.Client -
>>> >> >> >>> >> >> >> >> Exception encountered while connecting to the
>>> server
>>> >>:
>>> >> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>>> >> >>Client
>>> >> >> >>> >>cannot
>>> >> >> >>> >> >> >> >> authenticate via:[TOKEN]
>>> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
>>> >>WARN
>>> >> >> >>> >> >> >> >>  security.UserGroupInformation -
>>> >> >>PriviledgedActionException
>>> >> >> >>> >> >> >>as:abc@XYZ
>>> >> >> >>> >> >> >> >> (auth:KERBEROS) cause:java.io.IOException:
>>> >> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>>> >> >>Client
>>> >> >> >>> >>cannot
>>> >> >> >>> >> >> >> >> authenticate via:[TOKEN]
>>> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread]
>>> >>INFO
>>> >> >> >>> >> >> >> >>  retry.RetryInvocationHandler - Exception while
>>> >>invoking
>>> >> >> >>> >>allocate
>>> >> >> >>> >> >>of
>>> >> >> >>> >> >> >> >>class
>>> >> >> >>> >> >> >> >> ApplicationMasterProtocolPBClientImpl over rm2
>>> after
>>> >>287
>>> >> >> >>>fail
>>> >> >> >>> >>over
>>> >> >> >>> >> >> >> >> attempts. Trying to fail over immediately.
>>> >> >> >>> >> >> >> >> java.io.IOException: Failed on local exception:
>>> >> >> >>> >> >>java.io.IOException:
>>> >> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>>> >> >>Client
>>> >> >> >>> >>cannot
>>> >> >> >>> >> >> >> >> authenticate via:[TOKEN]; Host Details : local host
>>> >>is:
>>> >> >> >>> >>"<SliderAM
>>> >> >> >>> >> >> >> >> HOST>/<slider AM Host IP>"; destination host is:
>>> >>"<RM2
>>> >> >> >>> >> >>HOST>":23130;
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >>
>>> >> >>
>>> >>>>>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>>org.apache.hadoop.ipc.Client.call(Client.java:1476)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>>org.apache.hadoop.ipc.Client.call(Client.java:1403)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(Proto
>>> >>>>>>>>>>>>>>>bu
>>> >> >>>>>>>>>>>>>fR
>>> >> >> >>>>>>>>>>>pcE
>>> >> >> >>> >>>>>>>>ng
>>> >> >> >>> >> >>>>>>in
>>> >> >> >>> >> >> >>>>e.
>>> >> >> >>> >> >> >> >>java:230)
>>> >> >> >>> >> >> >> >>         at com.sun.proxy.$Proxy23.allocate(Unknown
>>> >> >>Source)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterP
>>> >>>>>>>>>>>>>>>ro
>>> >> >>>>>>>>>>>>>to
>>> >> >> >>>>>>>>>>>col
>>> >> >> >>> >>>>>>>>PB
>>> >> >> >>> >> >>>>>>Cl
>>> >> >> >>> >> >> >>>>ie
>>> >> >> >>> >> >> >>
>>> >> >> >>>>>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> sun.reflect.GeneratedMethodAccessor10.invoke(Unknown
>>> >> >> >>> >> >> >>Source)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>>> >>>>>>>>>>>>>>>th
>>> >> >>>>>>>>>>>>>od
>>> >> >> >>>>>>>>>>>Acc
>>> >> >> >>> >>>>>>>>es
>>> >> >> >>> >> >>>>>>so
>>> >> >> >>> >> >> >>>>rI
>>> >> >> >>> >> >> >> >>mpl.java:43)
>>> >> >> >>> >> >> >> >>         at
>>> >> >>java.lang.reflect.Method.invoke(Method.java:497)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMeth
>>> >>>>>>>>>>>>>>>od
>>> >> >>>>>>>>>>>>>(R
>>> >> >> >>>>>>>>>>>etr
>>> >> >> >>> >>>>>>>>yI
>>> >> >> >>> >> >>>>>>nv
>>> >> >> >>> >> >> >>>>oc
>>> >> >> >>> >> >> >> >>ationHandler.java:252)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(Ret
>>> >>>>>>>>>>>>>>>ry
>>> >> >>>>>>>>>>>>>In
>>> >> >> >>>>>>>>>>>voc
>>> >> >> >>> >>>>>>>>at
>>> >> >> >>> >> >>>>>>io
>>> >> >> >>> >> >> >>>>nH
>>> >> >> >>> >> >> >> >>andler.java:104)
>>> >> >> >>> >> >> >> >>         at com.sun.proxy.$Proxy24.allocate(Unknown
>>> >> >>Source)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.alloca
>>> >>>>>>>>>>>>>>>te
>>> >> >>>>>>>>>>>>>(A
>>> >> >> >>>>>>>>>>>MRM
>>> >> >> >>> >>>>>>>>Cl
>>> >> >> >>> >> >>>>>>ie
>>> >> >> >>> >> >> >>>>nt
>>> >> >> >>> >> >> >> >>Impl.java:278)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsync
>>> >>>>>>>>>>>>>>>Im
>>> >> >>>>>>>>>>>>>pl
>>> >> >> >>>>>>>>>>>$He
>>> >> >> >>> >>>>>>>>ar
>>> >> >> >>> >> >>>>>>tb
>>> >> >> >>> >> >> >>>>ea
>>> >> >> >>> >> >> >> >>tThread.run(AMRMClientAsyncImpl.java:224)
>>> >> >> >>> >> >> >> >> Caused by: java.io.IOException:
>>> >> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>>> >> >>Client
>>> >> >> >>> >>cannot
>>> >> >> >>> >> >> >> >> authenticate via:[TOKEN]
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >>
>>> >> >> >>>>>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>>java.security.AccessController.doPrivileged(Native
>>> >> >> >>> >> >>Method)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>>javax.security.auth.Subject.doAs(Subject.java:422)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
>>> >>>>>>>>>>>>>>>up
>>> >> >>>>>>>>>>>>>In
>>> >> >> >>>>>>>>>>>for
>>> >> >> >>> >>>>>>>>ma
>>> >> >> >>> >> >>>>>>ti
>>> >> >> >>> >> >> >>>>on
>>> >> >> >>> >> >> >> >>.java:1671)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.handleSaslConnection
>>> >>>>>>>>>>>>>>>Fa
>>> >> >>>>>>>>>>>>>il
>>> >> >> >>>>>>>>>>>ure
>>> >> >> >>> >>>>>>>>(C
>>> >> >> >>> >> >>>>>>li
>>> >> >> >>> >> >> >>>>en
>>> >> >> >>> >> >> >> >>t.java:645)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.
>>> >>>>>>>>>>>>>ja
>>> >> >>>>>>>>>>>va
>>> >> >> >>>>>>>>>:73
>>> >> >> >>> >>>>>>3)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:37
>>> >>>>>>>>>0)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >>
>>> >>>>org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>>org.apache.hadoop.ipc.Client.call(Client.java:1442)
>>> >> >> >>> >> >> >> >>         ... 12 more
>>> >> >> >>> >> >> >> >> Caused by:
>>> >> >> >>>org.apache.hadoop.security.AccessControlException:
>>> >> >> >>> >> >>Client
>>> >> >> >>> >> >> >> >> cannot authenticate via:[TOKEN]
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(Sa
>>> >>>>>>>>>>>>>>>sl
>>> >> >>>>>>>>>>>>>Rp
>>> >> >> >>>>>>>>>>>cCl
>>> >> >> >>> >>>>>>>>ie
>>> >> >> >>> >> >>>>>>nt
>>> >> >> >>> >> >> >>>>.j
>>> >> >> >>> >> >> >> >>ava:172)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpc
>>> >>>>>>>>>>>>>>>Cl
>>> >> >>>>>>>>>>>>>ie
>>> >> >> >>>>>>>>>>>nt.
>>> >> >> >>> >>>>>>>>ja
>>> >> >> >>> >> >>>>>>va
>>> >> >> >>> >> >> >>>>:3
>>> >> >> >>> >> >> >> >>96)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(
>>> >>>>>>>>>>>>>>>Cl
>>> >> >>>>>>>>>>>>>ie
>>> >> >> >>>>>>>>>>>nt.
>>> >> >> >>> >>>>>>>>ja
>>> >> >> >>> >> >>>>>>va
>>> >> >> >>> >> >> >>>>:5
>>> >> >> >>> >> >> >> >>55)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:37
>>> >>>>>>>>>0)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >>
>>> >> >> >>>>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >>
>>> >> >> >>>>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>>java.security.AccessController.doPrivileged(Native
>>> >> >> >>> >> >>Method)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>>javax.security.auth.Subject.doAs(Subject.java:422)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
>>> >>>>>>>>>>>>>>>up
>>> >> >>>>>>>>>>>>>In
>>> >> >> >>>>>>>>>>>for
>>> >> >> >>> >>>>>>>>ma
>>> >> >> >>> >> >>>>>>ti
>>> >> >> >>> >> >> >>>>on
>>> >> >> >>> >> >> >> >>.java:1671)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.
>>> >>>>>>>>>>>>>ja
>>> >> >>>>>>>>>>>va
>>> >> >> >>>>>>>>>:72
>>> >> >> >>> >>>>>>0)
>>> >> >> >>> >> >> >> >>         ... 15 more
>>> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread]
>>> >>INFO
>>> >> >> >>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing
>>> >> >>over to
>>> >> >> >>> rm1
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>
>>> >> >>
>>> >> >>
>>> >>
>>> >>
>>>
>>>
>>
>

Re: Slider AM fails to run when RM in HA setup fails over

Posted by Manoj Samel <ma...@gmail.com>.
Hi,

I have uploaded the config files, hope these shed light into the TICKET
authentication issue.

As a side note - it seems the commands like "slider list <app>
--containers" etc. now are ** significantly ** slower (compared when
slider-client.xml was not empty and had few properties). The commands
sometime take 1 minute on same cluster where it used to take few seconds
before. Also, it seems the first command executed after a some inactivity
takes long time to execute, command repeated immediately returns quickly.
Same is observed when slider AM restarts (e.g. due to upgrade). This
slowness was not present when slider-client.xml had config parameters like
registry zookeepers and RM address. Why would there be such a difference
for first execution when all config is read from HADOOP_CONF_DIR files

Following is a output of "slider list <xxx> --containers" executed twice.
Note first one took almost a minute, second run was almost instantaneous

[root@... ~]# slider list foo --containers
2016-07-29 23:30:35,197 [main] INFO  tools.SliderUtils - JVM initialized
into secure mode with kerberos realm xxx
2016-07-29 23:31:22,035 [main] INFO
 client.ConfiguredRMFailoverProxyProvider - Failing over to rm2
2016-07-29 23:31:22,162 [main] INFO  util.ExitUtil - Exiting with status 0
foo                               RUNNING  application_1469834604094_0001
           http://xxx:23188/proxy/application_1469834604094_0001/
......
[root@... ~]# slider list foo --containers
2016-07-29 23:32:34,816 [main] INFO  tools.SliderUtils - JVM initialized
into secure mode with kerberos realm xxx
2016-07-29 23:32:35,775 [main] INFO
 client.ConfiguredRMFailoverProxyProvider - Failing over to rm2
2016-07-29 23:32:35,896 [main] INFO  util.ExitUtil - Exiting with status 0
foo                               RUNNING  application_1469834604094_0001
           http://xxx:23188/proxy/application_1469834604094_0001/
..

Thanks,

On Thu, Jul 28, 2016 at 7:01 PM, Manoj Samel <ma...@gmail.com>
wrote:

> Hi Gour,
>
> I added properties in /etc/hadoop/conf/yarn-site.xml and emptied the
> /data/slider/conf/slider-client.xml and restarted both RMs.
>
>    - hadoop.registry.zk.quorum
>    - hadoop.registry.zk.root
>    - slider.yarn.queue
>
> Now there are no issues in creating or destroying cluster. This helps as
> it keeps all configs in one location - thanks for the update.
>
>  I am still hitting the original issue - Starting application with RM1
> active and then RM1 to RM2 fail over leads to slider-AM getting Client
> cannot authenticate via:[TOKEN] errors.
>
> I will upload the config files soon ...
>
> Thanks,
>
> On Thu, Jul 28, 2016 at 5:28 PM, Manoj Samel <ma...@gmail.com>
> wrote:
>
>> Thanks. I will test with the updated config and then upload the latest
>> ones ...
>>
>> Thanks,
>>
>> Manoj
>>
>> On Thu, Jul 28, 2016 at 5:21 PM, Gour Saha <gs...@hortonworks.com> wrote:
>>
>>> slider.zookeeper.quorum is deprecated and should not be used.
>>> hadoop.registry.zk.quorum is used instead and is typically defined in
>>> yarn-site.xml. So is hadoop.registry.zk.root.
>>>
>>> It is not encouraged to specify slider.yarn.queue at the cluster config
>>> level. Ideally it is best to specify the queue during the application
>>> submission. So you can use --queue option with slider create cmd. You can
>>> also set on the command line using -D slider.yarn.queue=<> during the
>>> create call. If indeed all slider apps should go to one and only one
>>> queue, then this prop can be specified in any one of the existing site
>>> xml
>>> files under /etc/hadoop/conf.
>>>
>>> -Gour
>>>
>>> On 7/28/16, 4:43 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>>>
>>> >Following slider specific properties are at present added in
>>> >/data/slider/conf/slider-client.xml. If you think they should be picked
>>> up
>>> >from HADOOP_CONF_DIR (/etc/hadoop/conf) file, which file in
>>> >HADOOP_CONF_DIR
>>> >should these be added ?
>>> >
>>> >   - slider.zookeeper.quorum
>>> >   - hadoop.registry.zk.quorum
>>> >   - hadoop.registry.zk.root
>>> >   - slider.yarn.queue
>>> >
>>> >
>>> >On Thu, Jul 28, 2016 at 4:37 PM, Gour Saha <gs...@hortonworks.com>
>>> wrote:
>>> >
>>> >> That is strange, since it is indeed not required to contain anything
>>> in
>>> >> slider-client.xml (except <configuration></configuration>) if
>>> >> HADOOP_CONF_DIR has everything that Slider needs. This probably gives
>>> an
>>> >> indication that there might be some issue with cluster configuration
>>> >>based
>>> >> on files solely under HADOOP_CONF_DIR to begin with.
>>> >>
>>> >> Suggest you to upload all the config files to the jira to help debug
>>> >>this
>>> >> further.
>>> >>
>>> >> -Gour
>>> >>
>>> >> On 7/28/16, 4:27 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>>> >>
>>> >> >Thanks Gour for prompt reply
>>> >> >
>>> >> >BTW - Creating a empty slider-client.xml (with just
>>> >> ><configuration></configuration>) does not works. The AM starts but
>>> >>fails
>>> >> >to
>>> >> >create any components and shows errors like
>>> >> >
>>> >> >2016-07-28 23:18:46,018
>>> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
>>> >> > zookeeper.ClientCnxn - Session 0x0 for server null, unexpected
>>> error,
>>> >> >closing socket connection and attempting reconnect
>>> >> >java.net.ConnectException: Connection refused
>>> >> >        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>> >> >        at
>>> >>
>>> >sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>>> >> >        at
>>> >>
>>>
>>> >>>org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO
>>> >>>.j
>>> >> >ava:361)
>>> >> >        at
>>> >> >org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>>> >> >
>>> >> >Also, command "slider destroy <app>" fails with zookeeper errors ...
>>> >> >
>>> >> >I had to keep a minimal slider-client.xml. It does not have any RM
>>> info
>>> >> >etc. but does contain slider ZK related properties like
>>> >> >"slider.zookeeper.quorum", "hadoop.registry.zk.quorum",
>>> >> >"hadoop.registry.zk.root". I haven't yet distilled the absolute
>>> minimal
>>> >> >set
>>> >> >of properties required, but this should suffice for now. All RM /
>>> HDFS
>>> >> >properties will be read from HADOOP_CONF_DIR files.
>>> >> >
>>> >> >Let me know if this could cause any issues.
>>> >> >
>>> >> >On Thu, Jul 28, 2016 at 3:36 PM, Gour Saha <gs...@hortonworks.com>
>>> >>wrote:
>>> >> >
>>> >> >> No need to copy any files. Pointing HADOOP_CONF_DIR to
>>> >>/etc/hadoop/conf
>>> >> >>is
>>> >> >> good.
>>> >> >>
>>> >> >> -Gour
>>> >> >>
>>> >> >> On 7/28/16, 3:24 PM, "Manoj Samel" <ma...@gmail.com>
>>> wrote:
>>> >> >>
>>> >> >> >Follow up question regarding Gour's comment in earlier thread -
>>> >> >> >
>>> >> >> >Slider is installed on one of the hadoop nodes. SLIDER_HOME/conf
>>> >> >>directory
>>> >> >> >(say /data/slider/conf) is different than HADOOP_CONF_DIR
>>> >> >> >(/etc/hadoop/conf). Is it required/recommended that files in
>>> >> >> >HADOOP_CONF_DIR be copied to SLIDER_HOME/conf and slider-env.sh
>>> >>script
>>> >> >> >sets
>>> >> >> >HADOOP_CONF_DIR to /data/slider/conf ?
>>> >> >> >
>>> >> >> >Or can the slider-env.sh set HADOOP_CONF_DIR to /etc/hadoop/conf ,
>>> >> >>without
>>> >> >> >copying the files ?
>>> >> >> >
>>> >> >> >Using slider .80 for now, but would like to know recommendation
>>> for
>>> >> >>this
>>> >> >> >and future versions as well.
>>> >> >> >
>>> >> >> >Thanks in advance,
>>> >> >> >
>>> >> >> >Manoj
>>> >> >> >
>>> >> >> >On Tue, Jul 26, 2016 at 3:27 PM, Manoj Samel
>>> >><manojsameltech@gmail.com
>>> >> >
>>> >> >> >wrote:
>>> >> >> >
>>> >> >> >> Filed https://issues.apache.org/jira/browse/SLIDER-1158 with
>>> logs
>>> >> and
>>> >> >> my
>>> >> >> >> analysis of logs.
>>> >> >> >>
>>> >> >> >> On Tue, Jul 26, 2016 at 10:36 AM, Gour Saha
>>> >><gs...@hortonworks.com>
>>> >> >> >>wrote:
>>> >> >> >>
>>> >> >> >>> Please file a JIRA and upload the logs to it.
>>> >> >> >>>
>>> >> >> >>> On 7/26/16, 10:21 AM, "Manoj Samel" <ma...@gmail.com>
>>> >> >>wrote:
>>> >> >> >>>
>>> >> >> >>> >Hi Gour,
>>> >> >> >>> >
>>> >> >> >>> >Can you please reach me using your own email-id? I will then
>>> >>send
>>> >> >> >>>logs to
>>> >> >> >>> >you, along with my analysis - I don't want to send logs on
>>> >>public
>>> >> >>list
>>> >> >> >>> >
>>> >> >> >>> >Thanks,
>>> >> >> >>> >
>>> >> >> >>> >On Mon, Jul 25, 2016 at 5:39 PM, Gour Saha
>>> >><gs...@hortonworks.com>
>>> >> >> >>> wrote:
>>> >> >> >>> >
>>> >> >> >>> >> Ok, so this node is not a gateway. It is part of the
>>> cluster,
>>> >> >>which
>>> >> >> >>> >>means
>>> >> >> >>> >> you don¹t need slider-client.xml at all. Just have
>>> >> >>HADOOP_CONF_DIR
>>> >> >> >>> >> pointing to /etc/hadoop/conf in slider-env.sh and that
>>> should
>>> >>be
>>> >> >>it.
>>> >> >> >>> >>
>>> >> >> >>> >> So the above simplifies your config setup. It will not solve
>>> >> >>either
>>> >> >> >>>of
>>> >> >> >>> >>the
>>> >> >> >>> >> 2 problems you are facing.
>>> >> >> >>> >>
>>> >> >> >>> >> Now coming to the 2 issues you are facing, you have to
>>> provide
>>> >> >> >>> >>additional
>>> >> >> >>> >> logs for us to understand better. Let¹s start with  -
>>> >> >> >>> >> 1. RM logs (specifically between the time when rm1->rm2
>>> >>failover
>>> >> >>is
>>> >> >> >>> >> simulated)
>>> >> >> >>> >> 2. Slider App logs
>>> >> >> >>> >>
>>> >> >> >>> >> -Gour
>>> >> >> >>> >>
>>> >> >> >>> >> On 7/25/16, 5:16 PM, "Manoj Samel" <
>>> manojsameltech@gmail.com>
>>> >> >> wrote:
>>> >> >> >>> >>
>>> >> >> >>> >> >   1. Not clear about your question on "gateway" node. The
>>> >>node
>>> >> >> >>> running
>>> >> >> >>> >> >   slider is part of the hadoop cluster and there are other
>>> >> >> >>>services
>>> >> >> >>> >>like
>>> >> >> >>> >> >   Oozie that run on this node that utilizes hdfs and yarn.
>>> >>So
>>> >> >>if
>>> >> >> >>>your
>>> >> >> >>> >> >   question is whether the node is otherwise working for
>>> HDFS
>>> >> >>and
>>> >> >> >>>Yarn
>>> >> >> >>> >> >   configuration, it is working
>>> >> >> >>> >> >   2. I copied all files from HADOOP_CONF_DIR (say
>>> >> >> >>>/etc/hadoop/conf)
>>> >> >> >>> to
>>> >> >> >>> >> >the
>>> >> >> >>> >> >   directory containing slider-client.xml (say
>>> >> >>/data/latest/conf)
>>> >> >> >>> >> >   3. In earlier email, I had done a mistake where
>>> >>slider-env.sh
>>> >> >> >>>file
>>> >> >> >>> >> >HADOOP_CONF_DIR
>>> >> >> >>> >> >   was pointing to original directory /etc/hadoop/conf. I
>>> >>edited
>>> >> >> >>>it to
>>> >> >> >>> >> >   point to same directory containing slider-client.xml &
>>> >> >> >>> slider-env.sh
>>> >> >> >>> >> >i.e.
>>> >> >> >>> >> >   /data/latest/conf
>>> >> >> >>> >> >   4. I emptied slider-client.xml. It just had the
>>> >> >> >>> >> ><configuration></configuration>.
>>> >> >> >>> >> >   The creation of spas worked but the Slider AM still
>>> shows
>>> >>the
>>> >> >> >>>same
>>> >> >> >>> >> >issue.
>>> >> >> >>> >> >   i.e. when RM1 goes from active to standby, slider AM
>>> goes
>>> >> >>from
>>> >> >> >>> >>RUNNING
>>> >> >> >>> >> >to
>>> >> >> >>> >> >   ACCPTED state with same error about TOKEN. Also NOTE
>>> that
>>> >> >>when
>>> >> >> >>> >> >   slider-client.xml is empty, the "slider destroy xxx"
>>> >>command
>>> >> >> >>>still
>>> >> >> >>> >> >fails
>>> >> >> >>> >> >   with Zookeeper connection errors.
>>> >> >> >>> >> >   5. I then added same parameters (as my last email -
>>> except
>>> >> >> >>> >> >   HADOOP_CONF_DIR) to slider-client.xml and ran. This time
>>> >> >> >>> >>slider-env.sh
>>> >> >> >>> >> >   has HADOOP_CONF_DIR pointing to /data/latest/conf and
>>> >> >> >>> >>slider-client.xml
>>> >> >> >>> >> >   does not have HADOOP_CONF_DIR. The same issue exists
>>> (but
>>> >> >> >>>"slider
>>> >> >> >>> >> >   destroy" does not fails)
>>> >> >> >>> >> >   6. Could you explain what do you expect to pick up from
>>> >> >>Hadoop
>>> >> >> >>> >> >   configurations that will help you in RM Token ? If
>>> slider
>>> >>has
>>> >> >> >>>token
>>> >> >> >>> >> >from
>>> >> >> >>> >> >   RM1, and it switches to RM2, not clear what slider does
>>> to
>>> >> >>get
>>> >> >> >>> >> >delegation
>>> >> >> >>> >> >   token for RM2 communication ?
>>> >> >> >>> >> >   7. It is worth repeating again that issue happens only
>>> >>when
>>> >> >>RM1
>>> >> >> >>>was
>>> >> >> >>> >> >   active when slider app was created and then RM1 becomes
>>> >> >> >>>standby. If
>>> >> >> >>> >> >RM2 was
>>> >> >> >>> >> >   active when slider app was created, then slider AM keeps
>>> >> >>running
>>> >> >> >>> for
>>> >> >> >>> >> >any
>>> >> >> >>> >> >   number of switches between RM2 and RM1 back and forth
>>> ...
>>> >> >> >>> >> >
>>> >> >> >>> >> >
>>> >> >> >>> >> >On Mon, Jul 25, 2016 at 4:21 PM, Gour Saha
>>> >> >><gs...@hortonworks.com>
>>> >> >> >>> >>wrote:
>>> >> >> >>> >> >
>>> >> >> >>> >> >> The node you are running slider from, is that a gateway
>>> >>node?
>>> >> >> >>>Sorry
>>> >> >> >>> >>for
>>> >> >> >>> >> >> not being explicit. I meant copy everything under
>>> >> >> >>>/etc/hadoop/conf
>>> >> >> >>> >>from
>>> >> >> >>> >> >> your cluster into some temp directory (say
>>> >>/tmp/hadoop_conf)
>>> >> >>in
>>> >> >> >>>your
>>> >> >> >>> >> >> gateway node or local or whichever node you are running
>>> >>slider
>>> >> >> >>>from.
>>> >> >> >>> >> >>Then
>>> >> >> >>> >> >> set HADOOP_CONF_DIR to /tmp/hadoop_conf and clear
>>> >>everything
>>> >> >>out
>>> >> >> >>> from
>>> >> >> >>> >> >> slider-client.xml.
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> On 7/25/16, 4:12 PM, "Manoj Samel"
>>> >><ma...@gmail.com>
>>> >> >> >>> wrote:
>>> >> >> >>> >> >>
>>> >> >> >>> >> >> >Hi Gour,
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >Thanks for your prompt reply.
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >FYI, issue happens when I create slider app when rm1 is
>>> >> >>active
>>> >> >> >>>and
>>> >> >> >>> >>when
>>> >> >> >>> >> >> >rm1
>>> >> >> >>> >> >> >fails over to rm2. As soon as rm2 becomes active; the
>>> >>slider
>>> >> >>AM
>>> >> >> >>> goes
>>> >> >> >>> >> >>from
>>> >> >> >>> >> >> >RUNNING to ACCEPTED state with above error.
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >For your suggestion, I did following
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >1) Copied core-site, hdfs-site, yarn-site, and
>>> mapred-site
>>> >> >>from
>>> >> >> >>> >> >> >HADOOP_CONF_DIR
>>> >> >> >>> >> >> >to slider conf directory.
>>> >> >> >>> >> >> >2) Our slider-env.sh already had HADOOP_CONF_DIR set
>>> >> >> >>> >> >> >3) I removed all properties from slider-client.xml
>>> EXCEPT
>>> >> >> >>>following
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >   - HADOOP_CONF_DIR
>>> >> >> >>> >> >> >   - slider.yarn.queue
>>> >> >> >>> >> >> >   - slider.zookeeper.quorum
>>> >> >> >>> >> >> >   - hadoop.registry.zk.quorum
>>> >> >> >>> >> >> >   - hadoop.registry.zk.root
>>> >> >> >>> >> >> >   - hadoop.security.authorization
>>> >> >> >>> >> >> >   - hadoop.security.authentication
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >Then I made rm1 active, installed and created slider app
>>> >>and
>>> >> >> >>> >>restarted
>>> >> >> >>> >> >>rm1
>>> >> >> >>> >> >> >(to make rm2) active. The slider-am again went from
>>> >>RUNNING
>>> >> >>to
>>> >> >> >>> >>ACCEPTED
>>> >> >> >>> >> >> >state.
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >Let me know if you want me to try further changes.
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >If I make the slider-client.xml completely empty per
>>> your
>>> >> >> >>> >>suggestion,
>>> >> >> >>> >> >>only
>>> >> >> >>> >> >> >slider AM comes up but it
>>> >> >> >>> >> >> >fails to start components. The AM log shows errors
>>> trying
>>> >>to
>>> >> >> >>> >>connect to
>>> >> >> >>> >> >> >zookeeper like below.
>>> >> >> >>> >> >> >2016-07-25 23:07:41,532
>>> >> >> >>> >> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)]
>>> >>WARN
>>> >> >> >>> >> >> >zookeeper.ClientCnxn - Session 0x0 for server null,
>>> >> >>unexpected
>>> >> >> >>> >>error,
>>> >> >> >>> >> >> >closing socket connection and attempting reconnect
>>> >> >> >>> >> >> >java.net.ConnectException: Connection refused
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >Hence I kept minimal info in slider-client.xml
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >FYI This is slider version 0.80
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >Thanks,
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >Manoj
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >On Mon, Jul 25, 2016 at 2:54 PM, Gour Saha
>>> >> >> >>><gs...@hortonworks.com>
>>> >> >> >>> >> >>wrote:
>>> >> >> >>> >> >> >
>>> >> >> >>> >> >> >> If possible, can you copy the entire content of the
>>> >> >>directory
>>> >> >> >>> >> >> >> /etc/hadoop/conf and then set HADOOP_CONF_DIR in
>>> >> >> >>>slider-env.sh to
>>> >> >> >>> >>it.
>>> >> >> >>> >> >> >>Keep
>>> >> >> >>> >> >> >> slider-client.xml empty.
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >> >> Now when you do the same rm1->rm2 and then the reverse
>>> >> >> >>>failovers,
>>> >> >> >>> >>do
>>> >> >> >>> >> >>you
>>> >> >> >>> >> >> >> see the same behaviors?
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >> >> -Gour
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >> >> On 7/25/16, 2:28 PM, "Manoj Samel"
>>> >> >><ma...@gmail.com>
>>> >> >> >>> >>wrote:
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >> >> >Another observation (whatever it is worth)
>>> >> >> >>> >> >> >> >
>>> >> >> >>> >> >> >> >If slider app is created and started when rm2 was
>>> >>active,
>>> >> >> >>>then
>>> >> >> >>> it
>>> >> >> >>> >> >> >>seems to
>>> >> >> >>> >> >> >> >survive switches between rm2 and rm1 (and back). I.e
>>> >> >> >>> >> >> >> >
>>> >> >> >>> >> >> >> >* rm2 is active
>>> >> >> >>> >> >> >> >* create and start slider application
>>> >> >> >>> >> >> >> >* fail over to rm1. Now the Slider AM keeps running
>>> >> >> >>> >> >> >> >* fail over to rm2 again. Slider AM still keeps
>>> running
>>> >> >> >>> >> >> >> >
>>> >> >> >>> >> >> >> >So, it seems if it starts with rm1 active, then the
>>> AM
>>> >> >>goes
>>> >> >> >>>to
>>> >> >> >>> >> >> >>"ACCEPTED"
>>> >> >> >>> >> >> >> >state when RM fails to rm2. If it starts with rm2
>>> >>active,
>>> >> >> >>>then
>>> >> >> >>> it
>>> >> >> >>> >> >>runs
>>> >> >> >>> >> >> >> >fine
>>> >> >> >>> >> >> >> >with any switches between rm1 and rm2.
>>> >> >> >>> >> >> >> >
>>> >> >> >>> >> >> >> >Any feedback ?
>>> >> >> >>> >> >> >> >
>>> >> >> >>> >> >> >> >Thanks,
>>> >> >> >>> >> >> >> >
>>> >> >> >>> >> >> >> >Manoj
>>> >> >> >>> >> >> >> >
>>> >> >> >>> >> >> >> >On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel
>>> >> >> >>> >> >> >><ma...@gmail.com>
>>> >> >> >>> >> >> >> >wrote:
>>> >> >> >>> >> >> >> >
>>> >> >> >>> >> >> >> >> Setup
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >> >> - Hadoop 2.6 with RM HA, Kerberos enabled
>>> >> >> >>> >> >> >> >> - Slider 0.80
>>> >> >> >>> >> >> >> >> - In my slider-client.xml, I have added all RM HA
>>> >> >> >>>properties,
>>> >> >> >>> >> >> >>including
>>> >> >> >>> >> >> >> >> the ones mentioned in
>>> >> >> >>> >> >>http://markmail.org/message/wnhpp2zn6ixo65e3.
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >> >> Following is the issue
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >> >> * rm1 is active, rm2 is standby
>>> >> >> >>> >> >> >> >> * deploy and start slider application, it runs fine
>>> >> >> >>> >> >> >> >> * restart rm1, rm2 is now active.
>>> >> >> >>> >> >> >> >> * The slider-am now goes from running into
>>> "ACCEPTED"
>>> >> >> >>>mode. It
>>> >> >> >>> >> >>stays
>>> >> >> >>> >> >> >> >>there
>>> >> >> >>> >> >> >> >> till rm1 is made active again.
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >> >> In the slider-am log, it tries to connect to RM2
>>> and
>>> >> >> >>> connection
>>> >> >> >>> >> >>fails
>>> >> >> >>> >> >> >> >>due
>>> >> >> >>> >> >> >> >> to
>>> org.apache.hadoop.security.AccessControlException:
>>> >> >> >>>Client
>>> >> >> >>> >> >>cannot
>>> >> >> >>> >> >> >> >> authenticate via:[TOKEN]. See detailed log below
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >> >>  It seems it has some token (delegation token?) for
>>> >>RM1
>>> >> >>but
>>> >> >> >>> >>tries
>>> >> >> >>> >> >>to
>>> >> >> >>> >> >> >>use
>>> >> >> >>> >> >> >> >> same(?) for RM2 and fails. Am I missing some
>>> >> >>configuration
>>> >> >> >>>???
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >> >> Thanks,
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
>>> >>INFO
>>> >> >> >>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing
>>> >> >>over to
>>> >> >> >>> rm2
>>> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
>>> >>WARN
>>> >> >> >>> >> >> >> >>  security.UserGroupInformation -
>>> >> >>PriviledgedActionException
>>> >> >> >>> >> >> >>as:abc@XYZ
>>> >> >> >>> >> >> >> >> (auth:KERBEROS)
>>> >> >> >>> >> >>
>>> >>cause:org.apache.hadoop.security.AccessControlException:
>>> >> >> >>> >> >> >> >> Client cannot authenticate via:[TOKEN]
>>> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
>>> >>WARN
>>> >> >> >>> >> >>ipc.Client -
>>> >> >> >>> >> >> >> >> Exception encountered while connecting to the
>>> server
>>> >>:
>>> >> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>>> >> >>Client
>>> >> >> >>> >>cannot
>>> >> >> >>> >> >> >> >> authenticate via:[TOKEN]
>>> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
>>> >>WARN
>>> >> >> >>> >> >> >> >>  security.UserGroupInformation -
>>> >> >>PriviledgedActionException
>>> >> >> >>> >> >> >>as:abc@XYZ
>>> >> >> >>> >> >> >> >> (auth:KERBEROS) cause:java.io.IOException:
>>> >> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>>> >> >>Client
>>> >> >> >>> >>cannot
>>> >> >> >>> >> >> >> >> authenticate via:[TOKEN]
>>> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread]
>>> >>INFO
>>> >> >> >>> >> >> >> >>  retry.RetryInvocationHandler - Exception while
>>> >>invoking
>>> >> >> >>> >>allocate
>>> >> >> >>> >> >>of
>>> >> >> >>> >> >> >> >>class
>>> >> >> >>> >> >> >> >> ApplicationMasterProtocolPBClientImpl over rm2
>>> after
>>> >>287
>>> >> >> >>>fail
>>> >> >> >>> >>over
>>> >> >> >>> >> >> >> >> attempts. Trying to fail over immediately.
>>> >> >> >>> >> >> >> >> java.io.IOException: Failed on local exception:
>>> >> >> >>> >> >>java.io.IOException:
>>> >> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>>> >> >>Client
>>> >> >> >>> >>cannot
>>> >> >> >>> >> >> >> >> authenticate via:[TOKEN]; Host Details : local host
>>> >>is:
>>> >> >> >>> >>"<SliderAM
>>> >> >> >>> >> >> >> >> HOST>/<slider AM Host IP>"; destination host is:
>>> >>"<RM2
>>> >> >> >>> >> >>HOST>":23130;
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >>
>>> >> >>
>>> >>>>>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>>org.apache.hadoop.ipc.Client.call(Client.java:1476)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>>org.apache.hadoop.ipc.Client.call(Client.java:1403)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(Proto
>>> >>>>>>>>>>>>>>>bu
>>> >> >>>>>>>>>>>>>fR
>>> >> >> >>>>>>>>>>>pcE
>>> >> >> >>> >>>>>>>>ng
>>> >> >> >>> >> >>>>>>in
>>> >> >> >>> >> >> >>>>e.
>>> >> >> >>> >> >> >> >>java:230)
>>> >> >> >>> >> >> >> >>         at com.sun.proxy.$Proxy23.allocate(Unknown
>>> >> >>Source)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterP
>>> >>>>>>>>>>>>>>>ro
>>> >> >>>>>>>>>>>>>to
>>> >> >> >>>>>>>>>>>col
>>> >> >> >>> >>>>>>>>PB
>>> >> >> >>> >> >>>>>>Cl
>>> >> >> >>> >> >> >>>>ie
>>> >> >> >>> >> >> >>
>>> >> >> >>>>>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> sun.reflect.GeneratedMethodAccessor10.invoke(Unknown
>>> >> >> >>> >> >> >>Source)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>>> >>>>>>>>>>>>>>>th
>>> >> >>>>>>>>>>>>>od
>>> >> >> >>>>>>>>>>>Acc
>>> >> >> >>> >>>>>>>>es
>>> >> >> >>> >> >>>>>>so
>>> >> >> >>> >> >> >>>>rI
>>> >> >> >>> >> >> >> >>mpl.java:43)
>>> >> >> >>> >> >> >> >>         at
>>> >> >>java.lang.reflect.Method.invoke(Method.java:497)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMeth
>>> >>>>>>>>>>>>>>>od
>>> >> >>>>>>>>>>>>>(R
>>> >> >> >>>>>>>>>>>etr
>>> >> >> >>> >>>>>>>>yI
>>> >> >> >>> >> >>>>>>nv
>>> >> >> >>> >> >> >>>>oc
>>> >> >> >>> >> >> >> >>ationHandler.java:252)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(Ret
>>> >>>>>>>>>>>>>>>ry
>>> >> >>>>>>>>>>>>>In
>>> >> >> >>>>>>>>>>>voc
>>> >> >> >>> >>>>>>>>at
>>> >> >> >>> >> >>>>>>io
>>> >> >> >>> >> >> >>>>nH
>>> >> >> >>> >> >> >> >>andler.java:104)
>>> >> >> >>> >> >> >> >>         at com.sun.proxy.$Proxy24.allocate(Unknown
>>> >> >>Source)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.alloca
>>> >>>>>>>>>>>>>>>te
>>> >> >>>>>>>>>>>>>(A
>>> >> >> >>>>>>>>>>>MRM
>>> >> >> >>> >>>>>>>>Cl
>>> >> >> >>> >> >>>>>>ie
>>> >> >> >>> >> >> >>>>nt
>>> >> >> >>> >> >> >> >>Impl.java:278)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsync
>>> >>>>>>>>>>>>>>>Im
>>> >> >>>>>>>>>>>>>pl
>>> >> >> >>>>>>>>>>>$He
>>> >> >> >>> >>>>>>>>ar
>>> >> >> >>> >> >>>>>>tb
>>> >> >> >>> >> >> >>>>ea
>>> >> >> >>> >> >> >> >>tThread.run(AMRMClientAsyncImpl.java:224)
>>> >> >> >>> >> >> >> >> Caused by: java.io.IOException:
>>> >> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>>> >> >>Client
>>> >> >> >>> >>cannot
>>> >> >> >>> >> >> >> >> authenticate via:[TOKEN]
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >>
>>> >> >> >>>>>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>>java.security.AccessController.doPrivileged(Native
>>> >> >> >>> >> >>Method)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>>javax.security.auth.Subject.doAs(Subject.java:422)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
>>> >>>>>>>>>>>>>>>up
>>> >> >>>>>>>>>>>>>In
>>> >> >> >>>>>>>>>>>for
>>> >> >> >>> >>>>>>>>ma
>>> >> >> >>> >> >>>>>>ti
>>> >> >> >>> >> >> >>>>on
>>> >> >> >>> >> >> >> >>.java:1671)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.handleSaslConnection
>>> >>>>>>>>>>>>>>>Fa
>>> >> >>>>>>>>>>>>>il
>>> >> >> >>>>>>>>>>>ure
>>> >> >> >>> >>>>>>>>(C
>>> >> >> >>> >> >>>>>>li
>>> >> >> >>> >> >> >>>>en
>>> >> >> >>> >> >> >> >>t.java:645)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.
>>> >>>>>>>>>>>>>ja
>>> >> >>>>>>>>>>>va
>>> >> >> >>>>>>>>>:73
>>> >> >> >>> >>>>>>3)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:37
>>> >>>>>>>>>0)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >>
>>> >>>>org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>>org.apache.hadoop.ipc.Client.call(Client.java:1442)
>>> >> >> >>> >> >> >> >>         ... 12 more
>>> >> >> >>> >> >> >> >> Caused by:
>>> >> >> >>>org.apache.hadoop.security.AccessControlException:
>>> >> >> >>> >> >>Client
>>> >> >> >>> >> >> >> >> cannot authenticate via:[TOKEN]
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(Sa
>>> >>>>>>>>>>>>>>>sl
>>> >> >>>>>>>>>>>>>Rp
>>> >> >> >>>>>>>>>>>cCl
>>> >> >> >>> >>>>>>>>ie
>>> >> >> >>> >> >>>>>>nt
>>> >> >> >>> >> >> >>>>.j
>>> >> >> >>> >> >> >> >>ava:172)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpc
>>> >>>>>>>>>>>>>>>Cl
>>> >> >>>>>>>>>>>>>ie
>>> >> >> >>>>>>>>>>>nt.
>>> >> >> >>> >>>>>>>>ja
>>> >> >> >>> >> >>>>>>va
>>> >> >> >>> >> >> >>>>:3
>>> >> >> >>> >> >> >> >>96)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(
>>> >>>>>>>>>>>>>>>Cl
>>> >> >>>>>>>>>>>>>ie
>>> >> >> >>>>>>>>>>>nt.
>>> >> >> >>> >>>>>>>>ja
>>> >> >> >>> >> >>>>>>va
>>> >> >> >>> >> >> >>>>:5
>>> >> >> >>> >> >> >> >>55)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:37
>>> >>>>>>>>>0)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >>
>>> >> >> >>>>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >>
>>> >> >> >>>>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>>java.security.AccessController.doPrivileged(Native
>>> >> >> >>> >> >>Method)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>>javax.security.auth.Subject.doAs(Subject.java:422)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
>>> >>>>>>>>>>>>>>>up
>>> >> >>>>>>>>>>>>>In
>>> >> >> >>>>>>>>>>>for
>>> >> >> >>> >>>>>>>>ma
>>> >> >> >>> >> >>>>>>ti
>>> >> >> >>> >> >> >>>>on
>>> >> >> >>> >> >> >> >>.java:1671)
>>> >> >> >>> >> >> >> >>         at
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>>
>>> >>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.
>>> >>>>>>>>>>>>>ja
>>> >> >>>>>>>>>>>va
>>> >> >> >>>>>>>>>:72
>>> >> >> >>> >>>>>>0)
>>> >> >> >>> >> >> >> >>         ... 15 more
>>> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread]
>>> >>INFO
>>> >> >> >>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing
>>> >> >>over to
>>> >> >> >>> rm1
>>> >> >> >>> >> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >> >>
>>> >> >> >>> >>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>>
>>> >> >> >>
>>> >> >>
>>> >> >>
>>> >>
>>> >>
>>>
>>>
>>
>

Re: Slider AM fails to run when RM in HA setup fails over

Posted by Manoj Samel <ma...@gmail.com>.
Hi Gour,

I added properties in /etc/hadoop/conf/yarn-site.xml and emptied the
/data/slider/conf/slider-client.xml and restarted both RMs.

   - hadoop.registry.zk.quorum
   - hadoop.registry.zk.root
   - slider.yarn.queue

Now there are no issues in creating or destroying cluster. This helps as it
keeps all configs in one location - thanks for the update.

 I am still hitting the original issue - Starting application with RM1
active and then RM1 to RM2 fail over leads to slider-AM getting Client
cannot authenticate via:[TOKEN] errors.

I will upload the config files soon ...

Thanks,

On Thu, Jul 28, 2016 at 5:28 PM, Manoj Samel <ma...@gmail.com>
wrote:

> Thanks. I will test with the updated config and then upload the latest
> ones ...
>
> Thanks,
>
> Manoj
>
> On Thu, Jul 28, 2016 at 5:21 PM, Gour Saha <gs...@hortonworks.com> wrote:
>
>> slider.zookeeper.quorum is deprecated and should not be used.
>> hadoop.registry.zk.quorum is used instead and is typically defined in
>> yarn-site.xml. So is hadoop.registry.zk.root.
>>
>> It is not encouraged to specify slider.yarn.queue at the cluster config
>> level. Ideally it is best to specify the queue during the application
>> submission. So you can use --queue option with slider create cmd. You can
>> also set on the command line using -D slider.yarn.queue=<> during the
>> create call. If indeed all slider apps should go to one and only one
>> queue, then this prop can be specified in any one of the existing site xml
>> files under /etc/hadoop/conf.
>>
>> -Gour
>>
>> On 7/28/16, 4:43 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>>
>> >Following slider specific properties are at present added in
>> >/data/slider/conf/slider-client.xml. If you think they should be picked
>> up
>> >from HADOOP_CONF_DIR (/etc/hadoop/conf) file, which file in
>> >HADOOP_CONF_DIR
>> >should these be added ?
>> >
>> >   - slider.zookeeper.quorum
>> >   - hadoop.registry.zk.quorum
>> >   - hadoop.registry.zk.root
>> >   - slider.yarn.queue
>> >
>> >
>> >On Thu, Jul 28, 2016 at 4:37 PM, Gour Saha <gs...@hortonworks.com>
>> wrote:
>> >
>> >> That is strange, since it is indeed not required to contain anything in
>> >> slider-client.xml (except <configuration></configuration>) if
>> >> HADOOP_CONF_DIR has everything that Slider needs. This probably gives
>> an
>> >> indication that there might be some issue with cluster configuration
>> >>based
>> >> on files solely under HADOOP_CONF_DIR to begin with.
>> >>
>> >> Suggest you to upload all the config files to the jira to help debug
>> >>this
>> >> further.
>> >>
>> >> -Gour
>> >>
>> >> On 7/28/16, 4:27 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>> >>
>> >> >Thanks Gour for prompt reply
>> >> >
>> >> >BTW - Creating a empty slider-client.xml (with just
>> >> ><configuration></configuration>) does not works. The AM starts but
>> >>fails
>> >> >to
>> >> >create any components and shows errors like
>> >> >
>> >> >2016-07-28 23:18:46,018
>> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
>> >> > zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error,
>> >> >closing socket connection and attempting reconnect
>> >> >java.net.ConnectException: Connection refused
>> >> >        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>> >> >        at
>> >> >sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>> >> >        at
>> >>
>>
>> >>>org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO
>> >>>.j
>> >> >ava:361)
>> >> >        at
>> >> >org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>> >> >
>> >> >Also, command "slider destroy <app>" fails with zookeeper errors ...
>> >> >
>> >> >I had to keep a minimal slider-client.xml. It does not have any RM
>> info
>> >> >etc. but does contain slider ZK related properties like
>> >> >"slider.zookeeper.quorum", "hadoop.registry.zk.quorum",
>> >> >"hadoop.registry.zk.root". I haven't yet distilled the absolute
>> minimal
>> >> >set
>> >> >of properties required, but this should suffice for now. All RM / HDFS
>> >> >properties will be read from HADOOP_CONF_DIR files.
>> >> >
>> >> >Let me know if this could cause any issues.
>> >> >
>> >> >On Thu, Jul 28, 2016 at 3:36 PM, Gour Saha <gs...@hortonworks.com>
>> >>wrote:
>> >> >
>> >> >> No need to copy any files. Pointing HADOOP_CONF_DIR to
>> >>/etc/hadoop/conf
>> >> >>is
>> >> >> good.
>> >> >>
>> >> >> -Gour
>> >> >>
>> >> >> On 7/28/16, 3:24 PM, "Manoj Samel" <ma...@gmail.com>
>> wrote:
>> >> >>
>> >> >> >Follow up question regarding Gour's comment in earlier thread -
>> >> >> >
>> >> >> >Slider is installed on one of the hadoop nodes. SLIDER_HOME/conf
>> >> >>directory
>> >> >> >(say /data/slider/conf) is different than HADOOP_CONF_DIR
>> >> >> >(/etc/hadoop/conf). Is it required/recommended that files in
>> >> >> >HADOOP_CONF_DIR be copied to SLIDER_HOME/conf and slider-env.sh
>> >>script
>> >> >> >sets
>> >> >> >HADOOP_CONF_DIR to /data/slider/conf ?
>> >> >> >
>> >> >> >Or can the slider-env.sh set HADOOP_CONF_DIR to /etc/hadoop/conf ,
>> >> >>without
>> >> >> >copying the files ?
>> >> >> >
>> >> >> >Using slider .80 for now, but would like to know recommendation for
>> >> >>this
>> >> >> >and future versions as well.
>> >> >> >
>> >> >> >Thanks in advance,
>> >> >> >
>> >> >> >Manoj
>> >> >> >
>> >> >> >On Tue, Jul 26, 2016 at 3:27 PM, Manoj Samel
>> >><manojsameltech@gmail.com
>> >> >
>> >> >> >wrote:
>> >> >> >
>> >> >> >> Filed https://issues.apache.org/jira/browse/SLIDER-1158 with
>> logs
>> >> and
>> >> >> my
>> >> >> >> analysis of logs.
>> >> >> >>
>> >> >> >> On Tue, Jul 26, 2016 at 10:36 AM, Gour Saha
>> >><gs...@hortonworks.com>
>> >> >> >>wrote:
>> >> >> >>
>> >> >> >>> Please file a JIRA and upload the logs to it.
>> >> >> >>>
>> >> >> >>> On 7/26/16, 10:21 AM, "Manoj Samel" <ma...@gmail.com>
>> >> >>wrote:
>> >> >> >>>
>> >> >> >>> >Hi Gour,
>> >> >> >>> >
>> >> >> >>> >Can you please reach me using your own email-id? I will then
>> >>send
>> >> >> >>>logs to
>> >> >> >>> >you, along with my analysis - I don't want to send logs on
>> >>public
>> >> >>list
>> >> >> >>> >
>> >> >> >>> >Thanks,
>> >> >> >>> >
>> >> >> >>> >On Mon, Jul 25, 2016 at 5:39 PM, Gour Saha
>> >><gs...@hortonworks.com>
>> >> >> >>> wrote:
>> >> >> >>> >
>> >> >> >>> >> Ok, so this node is not a gateway. It is part of the cluster,
>> >> >>which
>> >> >> >>> >>means
>> >> >> >>> >> you don¹t need slider-client.xml at all. Just have
>> >> >>HADOOP_CONF_DIR
>> >> >> >>> >> pointing to /etc/hadoop/conf in slider-env.sh and that should
>> >>be
>> >> >>it.
>> >> >> >>> >>
>> >> >> >>> >> So the above simplifies your config setup. It will not solve
>> >> >>either
>> >> >> >>>of
>> >> >> >>> >>the
>> >> >> >>> >> 2 problems you are facing.
>> >> >> >>> >>
>> >> >> >>> >> Now coming to the 2 issues you are facing, you have to
>> provide
>> >> >> >>> >>additional
>> >> >> >>> >> logs for us to understand better. Let¹s start with  -
>> >> >> >>> >> 1. RM logs (specifically between the time when rm1->rm2
>> >>failover
>> >> >>is
>> >> >> >>> >> simulated)
>> >> >> >>> >> 2. Slider App logs
>> >> >> >>> >>
>> >> >> >>> >> -Gour
>> >> >> >>> >>
>> >> >> >>> >> On 7/25/16, 5:16 PM, "Manoj Samel" <manojsameltech@gmail.com
>> >
>> >> >> wrote:
>> >> >> >>> >>
>> >> >> >>> >> >   1. Not clear about your question on "gateway" node. The
>> >>node
>> >> >> >>> running
>> >> >> >>> >> >   slider is part of the hadoop cluster and there are other
>> >> >> >>>services
>> >> >> >>> >>like
>> >> >> >>> >> >   Oozie that run on this node that utilizes hdfs and yarn.
>> >>So
>> >> >>if
>> >> >> >>>your
>> >> >> >>> >> >   question is whether the node is otherwise working for
>> HDFS
>> >> >>and
>> >> >> >>>Yarn
>> >> >> >>> >> >   configuration, it is working
>> >> >> >>> >> >   2. I copied all files from HADOOP_CONF_DIR (say
>> >> >> >>>/etc/hadoop/conf)
>> >> >> >>> to
>> >> >> >>> >> >the
>> >> >> >>> >> >   directory containing slider-client.xml (say
>> >> >>/data/latest/conf)
>> >> >> >>> >> >   3. In earlier email, I had done a mistake where
>> >>slider-env.sh
>> >> >> >>>file
>> >> >> >>> >> >HADOOP_CONF_DIR
>> >> >> >>> >> >   was pointing to original directory /etc/hadoop/conf. I
>> >>edited
>> >> >> >>>it to
>> >> >> >>> >> >   point to same directory containing slider-client.xml &
>> >> >> >>> slider-env.sh
>> >> >> >>> >> >i.e.
>> >> >> >>> >> >   /data/latest/conf
>> >> >> >>> >> >   4. I emptied slider-client.xml. It just had the
>> >> >> >>> >> ><configuration></configuration>.
>> >> >> >>> >> >   The creation of spas worked but the Slider AM still shows
>> >>the
>> >> >> >>>same
>> >> >> >>> >> >issue.
>> >> >> >>> >> >   i.e. when RM1 goes from active to standby, slider AM goes
>> >> >>from
>> >> >> >>> >>RUNNING
>> >> >> >>> >> >to
>> >> >> >>> >> >   ACCPTED state with same error about TOKEN. Also NOTE that
>> >> >>when
>> >> >> >>> >> >   slider-client.xml is empty, the "slider destroy xxx"
>> >>command
>> >> >> >>>still
>> >> >> >>> >> >fails
>> >> >> >>> >> >   with Zookeeper connection errors.
>> >> >> >>> >> >   5. I then added same parameters (as my last email -
>> except
>> >> >> >>> >> >   HADOOP_CONF_DIR) to slider-client.xml and ran. This time
>> >> >> >>> >>slider-env.sh
>> >> >> >>> >> >   has HADOOP_CONF_DIR pointing to /data/latest/conf and
>> >> >> >>> >>slider-client.xml
>> >> >> >>> >> >   does not have HADOOP_CONF_DIR. The same issue exists (but
>> >> >> >>>"slider
>> >> >> >>> >> >   destroy" does not fails)
>> >> >> >>> >> >   6. Could you explain what do you expect to pick up from
>> >> >>Hadoop
>> >> >> >>> >> >   configurations that will help you in RM Token ? If slider
>> >>has
>> >> >> >>>token
>> >> >> >>> >> >from
>> >> >> >>> >> >   RM1, and it switches to RM2, not clear what slider does
>> to
>> >> >>get
>> >> >> >>> >> >delegation
>> >> >> >>> >> >   token for RM2 communication ?
>> >> >> >>> >> >   7. It is worth repeating again that issue happens only
>> >>when
>> >> >>RM1
>> >> >> >>>was
>> >> >> >>> >> >   active when slider app was created and then RM1 becomes
>> >> >> >>>standby. If
>> >> >> >>> >> >RM2 was
>> >> >> >>> >> >   active when slider app was created, then slider AM keeps
>> >> >>running
>> >> >> >>> for
>> >> >> >>> >> >any
>> >> >> >>> >> >   number of switches between RM2 and RM1 back and forth ...
>> >> >> >>> >> >
>> >> >> >>> >> >
>> >> >> >>> >> >On Mon, Jul 25, 2016 at 4:21 PM, Gour Saha
>> >> >><gs...@hortonworks.com>
>> >> >> >>> >>wrote:
>> >> >> >>> >> >
>> >> >> >>> >> >> The node you are running slider from, is that a gateway
>> >>node?
>> >> >> >>>Sorry
>> >> >> >>> >>for
>> >> >> >>> >> >> not being explicit. I meant copy everything under
>> >> >> >>>/etc/hadoop/conf
>> >> >> >>> >>from
>> >> >> >>> >> >> your cluster into some temp directory (say
>> >>/tmp/hadoop_conf)
>> >> >>in
>> >> >> >>>your
>> >> >> >>> >> >> gateway node or local or whichever node you are running
>> >>slider
>> >> >> >>>from.
>> >> >> >>> >> >>Then
>> >> >> >>> >> >> set HADOOP_CONF_DIR to /tmp/hadoop_conf and clear
>> >>everything
>> >> >>out
>> >> >> >>> from
>> >> >> >>> >> >> slider-client.xml.
>> >> >> >>> >> >>
>> >> >> >>> >> >> On 7/25/16, 4:12 PM, "Manoj Samel"
>> >><ma...@gmail.com>
>> >> >> >>> wrote:
>> >> >> >>> >> >>
>> >> >> >>> >> >> >Hi Gour,
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >Thanks for your prompt reply.
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >FYI, issue happens when I create slider app when rm1 is
>> >> >>active
>> >> >> >>>and
>> >> >> >>> >>when
>> >> >> >>> >> >> >rm1
>> >> >> >>> >> >> >fails over to rm2. As soon as rm2 becomes active; the
>> >>slider
>> >> >>AM
>> >> >> >>> goes
>> >> >> >>> >> >>from
>> >> >> >>> >> >> >RUNNING to ACCEPTED state with above error.
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >For your suggestion, I did following
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >1) Copied core-site, hdfs-site, yarn-site, and
>> mapred-site
>> >> >>from
>> >> >> >>> >> >> >HADOOP_CONF_DIR
>> >> >> >>> >> >> >to slider conf directory.
>> >> >> >>> >> >> >2) Our slider-env.sh already had HADOOP_CONF_DIR set
>> >> >> >>> >> >> >3) I removed all properties from slider-client.xml EXCEPT
>> >> >> >>>following
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >   - HADOOP_CONF_DIR
>> >> >> >>> >> >> >   - slider.yarn.queue
>> >> >> >>> >> >> >   - slider.zookeeper.quorum
>> >> >> >>> >> >> >   - hadoop.registry.zk.quorum
>> >> >> >>> >> >> >   - hadoop.registry.zk.root
>> >> >> >>> >> >> >   - hadoop.security.authorization
>> >> >> >>> >> >> >   - hadoop.security.authentication
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >Then I made rm1 active, installed and created slider app
>> >>and
>> >> >> >>> >>restarted
>> >> >> >>> >> >>rm1
>> >> >> >>> >> >> >(to make rm2) active. The slider-am again went from
>> >>RUNNING
>> >> >>to
>> >> >> >>> >>ACCEPTED
>> >> >> >>> >> >> >state.
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >Let me know if you want me to try further changes.
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >If I make the slider-client.xml completely empty per your
>> >> >> >>> >>suggestion,
>> >> >> >>> >> >>only
>> >> >> >>> >> >> >slider AM comes up but it
>> >> >> >>> >> >> >fails to start components. The AM log shows errors trying
>> >>to
>> >> >> >>> >>connect to
>> >> >> >>> >> >> >zookeeper like below.
>> >> >> >>> >> >> >2016-07-25 23:07:41,532
>> >> >> >>> >> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)]
>> >>WARN
>> >> >> >>> >> >> >zookeeper.ClientCnxn - Session 0x0 for server null,
>> >> >>unexpected
>> >> >> >>> >>error,
>> >> >> >>> >> >> >closing socket connection and attempting reconnect
>> >> >> >>> >> >> >java.net.ConnectException: Connection refused
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >Hence I kept minimal info in slider-client.xml
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >FYI This is slider version 0.80
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >Thanks,
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >Manoj
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >On Mon, Jul 25, 2016 at 2:54 PM, Gour Saha
>> >> >> >>><gs...@hortonworks.com>
>> >> >> >>> >> >>wrote:
>> >> >> >>> >> >> >
>> >> >> >>> >> >> >> If possible, can you copy the entire content of the
>> >> >>directory
>> >> >> >>> >> >> >> /etc/hadoop/conf and then set HADOOP_CONF_DIR in
>> >> >> >>>slider-env.sh to
>> >> >> >>> >>it.
>> >> >> >>> >> >> >>Keep
>> >> >> >>> >> >> >> slider-client.xml empty.
>> >> >> >>> >> >> >>
>> >> >> >>> >> >> >> Now when you do the same rm1->rm2 and then the reverse
>> >> >> >>>failovers,
>> >> >> >>> >>do
>> >> >> >>> >> >>you
>> >> >> >>> >> >> >> see the same behaviors?
>> >> >> >>> >> >> >>
>> >> >> >>> >> >> >> -Gour
>> >> >> >>> >> >> >>
>> >> >> >>> >> >> >> On 7/25/16, 2:28 PM, "Manoj Samel"
>> >> >><ma...@gmail.com>
>> >> >> >>> >>wrote:
>> >> >> >>> >> >> >>
>> >> >> >>> >> >> >> >Another observation (whatever it is worth)
>> >> >> >>> >> >> >> >
>> >> >> >>> >> >> >> >If slider app is created and started when rm2 was
>> >>active,
>> >> >> >>>then
>> >> >> >>> it
>> >> >> >>> >> >> >>seems to
>> >> >> >>> >> >> >> >survive switches between rm2 and rm1 (and back). I.e
>> >> >> >>> >> >> >> >
>> >> >> >>> >> >> >> >* rm2 is active
>> >> >> >>> >> >> >> >* create and start slider application
>> >> >> >>> >> >> >> >* fail over to rm1. Now the Slider AM keeps running
>> >> >> >>> >> >> >> >* fail over to rm2 again. Slider AM still keeps
>> running
>> >> >> >>> >> >> >> >
>> >> >> >>> >> >> >> >So, it seems if it starts with rm1 active, then the AM
>> >> >>goes
>> >> >> >>>to
>> >> >> >>> >> >> >>"ACCEPTED"
>> >> >> >>> >> >> >> >state when RM fails to rm2. If it starts with rm2
>> >>active,
>> >> >> >>>then
>> >> >> >>> it
>> >> >> >>> >> >>runs
>> >> >> >>> >> >> >> >fine
>> >> >> >>> >> >> >> >with any switches between rm1 and rm2.
>> >> >> >>> >> >> >> >
>> >> >> >>> >> >> >> >Any feedback ?
>> >> >> >>> >> >> >> >
>> >> >> >>> >> >> >> >Thanks,
>> >> >> >>> >> >> >> >
>> >> >> >>> >> >> >> >Manoj
>> >> >> >>> >> >> >> >
>> >> >> >>> >> >> >> >On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel
>> >> >> >>> >> >> >><ma...@gmail.com>
>> >> >> >>> >> >> >> >wrote:
>> >> >> >>> >> >> >> >
>> >> >> >>> >> >> >> >> Setup
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >> >> - Hadoop 2.6 with RM HA, Kerberos enabled
>> >> >> >>> >> >> >> >> - Slider 0.80
>> >> >> >>> >> >> >> >> - In my slider-client.xml, I have added all RM HA
>> >> >> >>>properties,
>> >> >> >>> >> >> >>including
>> >> >> >>> >> >> >> >> the ones mentioned in
>> >> >> >>> >> >>http://markmail.org/message/wnhpp2zn6ixo65e3.
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >> >> Following is the issue
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >> >> * rm1 is active, rm2 is standby
>> >> >> >>> >> >> >> >> * deploy and start slider application, it runs fine
>> >> >> >>> >> >> >> >> * restart rm1, rm2 is now active.
>> >> >> >>> >> >> >> >> * The slider-am now goes from running into
>> "ACCEPTED"
>> >> >> >>>mode. It
>> >> >> >>> >> >>stays
>> >> >> >>> >> >> >> >>there
>> >> >> >>> >> >> >> >> till rm1 is made active again.
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >> >> In the slider-am log, it tries to connect to RM2 and
>> >> >> >>> connection
>> >> >> >>> >> >>fails
>> >> >> >>> >> >> >> >>due
>> >> >> >>> >> >> >> >> to
>> org.apache.hadoop.security.AccessControlException:
>> >> >> >>>Client
>> >> >> >>> >> >>cannot
>> >> >> >>> >> >> >> >> authenticate via:[TOKEN]. See detailed log below
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >> >>  It seems it has some token (delegation token?) for
>> >>RM1
>> >> >>but
>> >> >> >>> >>tries
>> >> >> >>> >> >>to
>> >> >> >>> >> >> >>use
>> >> >> >>> >> >> >> >> same(?) for RM2 and fails. Am I missing some
>> >> >>configuration
>> >> >> >>>???
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >> >> Thanks,
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
>> >>INFO
>> >> >> >>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing
>> >> >>over to
>> >> >> >>> rm2
>> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
>> >>WARN
>> >> >> >>> >> >> >> >>  security.UserGroupInformation -
>> >> >>PriviledgedActionException
>> >> >> >>> >> >> >>as:abc@XYZ
>> >> >> >>> >> >> >> >> (auth:KERBEROS)
>> >> >> >>> >> >> >>cause:org.apache.hadoop.security.AccessControlException:
>> >> >> >>> >> >> >> >> Client cannot authenticate via:[TOKEN]
>> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
>> >>WARN
>> >> >> >>> >> >>ipc.Client -
>> >> >> >>> >> >> >> >> Exception encountered while connecting to the server
>> >>:
>> >> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>> >> >>Client
>> >> >> >>> >>cannot
>> >> >> >>> >> >> >> >> authenticate via:[TOKEN]
>> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
>> >>WARN
>> >> >> >>> >> >> >> >>  security.UserGroupInformation -
>> >> >>PriviledgedActionException
>> >> >> >>> >> >> >>as:abc@XYZ
>> >> >> >>> >> >> >> >> (auth:KERBEROS) cause:java.io.IOException:
>> >> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>> >> >>Client
>> >> >> >>> >>cannot
>> >> >> >>> >> >> >> >> authenticate via:[TOKEN]
>> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread]
>> >>INFO
>> >> >> >>> >> >> >> >>  retry.RetryInvocationHandler - Exception while
>> >>invoking
>> >> >> >>> >>allocate
>> >> >> >>> >> >>of
>> >> >> >>> >> >> >> >>class
>> >> >> >>> >> >> >> >> ApplicationMasterProtocolPBClientImpl over rm2 after
>> >>287
>> >> >> >>>fail
>> >> >> >>> >>over
>> >> >> >>> >> >> >> >> attempts. Trying to fail over immediately.
>> >> >> >>> >> >> >> >> java.io.IOException: Failed on local exception:
>> >> >> >>> >> >>java.io.IOException:
>> >> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>> >> >>Client
>> >> >> >>> >>cannot
>> >> >> >>> >> >> >> >> authenticate via:[TOKEN]; Host Details : local host
>> >>is:
>> >> >> >>> >>"<SliderAM
>> >> >> >>> >> >> >> >> HOST>/<slider AM Host IP>"; destination host is:
>> >>"<RM2
>> >> >> >>> >> >>HOST>":23130;
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >>
>> >> >> >>>>>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>>org.apache.hadoop.ipc.Client.call(Client.java:1476)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>>org.apache.hadoop.ipc.Client.call(Client.java:1403)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >>
>> >> >> >>>
>> >> >> >>>
>> >> >>
>> >>
>>
>> >>>>>>>>>>>>>>>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(Proto
>> >>>>>>>>>>>>>>>bu
>> >> >>>>>>>>>>>>>fR
>> >> >> >>>>>>>>>>>pcE
>> >> >> >>> >>>>>>>>ng
>> >> >> >>> >> >>>>>>in
>> >> >> >>> >> >> >>>>e.
>> >> >> >>> >> >> >> >>java:230)
>> >> >> >>> >> >> >> >>         at com.sun.proxy.$Proxy23.allocate(Unknown
>> >> >>Source)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >>
>> >> >> >>>
>> >> >> >>>
>> >> >>
>> >>
>>
>> >>>>>>>>>>>>>>>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterP
>> >>>>>>>>>>>>>>>ro
>> >> >>>>>>>>>>>>>to
>> >> >> >>>>>>>>>>>col
>> >> >> >>> >>>>>>>>PB
>> >> >> >>> >> >>>>>>Cl
>> >> >> >>> >> >> >>>>ie
>> >> >> >>> >> >> >>
>> >> >> >>>>>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> sun.reflect.GeneratedMethodAccessor10.invoke(Unknown
>> >> >> >>> >> >> >>Source)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >>
>> >> >> >>>
>> >> >> >>>
>> >> >>
>> >>
>>
>> >>>>>>>>>>>>>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> >>>>>>>>>>>>>>>th
>> >> >>>>>>>>>>>>>od
>> >> >> >>>>>>>>>>>Acc
>> >> >> >>> >>>>>>>>es
>> >> >> >>> >> >>>>>>so
>> >> >> >>> >> >> >>>>rI
>> >> >> >>> >> >> >> >>mpl.java:43)
>> >> >> >>> >> >> >> >>         at
>> >> >>java.lang.reflect.Method.invoke(Method.java:497)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >>
>> >> >> >>>
>> >> >> >>>
>> >> >>
>> >>
>>
>> >>>>>>>>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMeth
>> >>>>>>>>>>>>>>>od
>> >> >>>>>>>>>>>>>(R
>> >> >> >>>>>>>>>>>etr
>> >> >> >>> >>>>>>>>yI
>> >> >> >>> >> >>>>>>nv
>> >> >> >>> >> >> >>>>oc
>> >> >> >>> >> >> >> >>ationHandler.java:252)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >>
>> >> >> >>>
>> >> >> >>>
>> >> >>
>> >>
>>
>> >>>>>>>>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(Ret
>> >>>>>>>>>>>>>>>ry
>> >> >>>>>>>>>>>>>In
>> >> >> >>>>>>>>>>>voc
>> >> >> >>> >>>>>>>>at
>> >> >> >>> >> >>>>>>io
>> >> >> >>> >> >> >>>>nH
>> >> >> >>> >> >> >> >>andler.java:104)
>> >> >> >>> >> >> >> >>         at com.sun.proxy.$Proxy24.allocate(Unknown
>> >> >>Source)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >>
>> >> >> >>>
>> >> >> >>>
>> >> >>
>> >>
>>
>> >>>>>>>>>>>>>>>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.alloca
>> >>>>>>>>>>>>>>>te
>> >> >>>>>>>>>>>>>(A
>> >> >> >>>>>>>>>>>MRM
>> >> >> >>> >>>>>>>>Cl
>> >> >> >>> >> >>>>>>ie
>> >> >> >>> >> >> >>>>nt
>> >> >> >>> >> >> >> >>Impl.java:278)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >>
>> >> >> >>>
>> >> >> >>>
>> >> >>
>> >>
>>
>> >>>>>>>>>>>>>>>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsync
>> >>>>>>>>>>>>>>>Im
>> >> >>>>>>>>>>>>>pl
>> >> >> >>>>>>>>>>>$He
>> >> >> >>> >>>>>>>>ar
>> >> >> >>> >> >>>>>>tb
>> >> >> >>> >> >> >>>>ea
>> >> >> >>> >> >> >> >>tThread.run(AMRMClientAsyncImpl.java:224)
>> >> >> >>> >> >> >> >> Caused by: java.io.IOException:
>> >> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>> >> >>Client
>> >> >> >>> >>cannot
>> >> >> >>> >> >> >> >> authenticate via:[TOKEN]
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >>
>> >> >> >>>>>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>>java.security.AccessController.doPrivileged(Native
>> >> >> >>> >> >>Method)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>>javax.security.auth.Subject.doAs(Subject.java:422)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >>
>> >> >> >>>
>> >> >> >>>
>> >> >>
>> >>
>>
>> >>>>>>>>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
>> >>>>>>>>>>>>>>>up
>> >> >>>>>>>>>>>>>In
>> >> >> >>>>>>>>>>>for
>> >> >> >>> >>>>>>>>ma
>> >> >> >>> >> >>>>>>ti
>> >> >> >>> >> >> >>>>on
>> >> >> >>> >> >> >> >>.java:1671)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >>
>> >> >> >>>
>> >> >> >>>
>> >> >>
>> >>
>>
>> >>>>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.handleSaslConnection
>> >>>>>>>>>>>>>>>Fa
>> >> >>>>>>>>>>>>>il
>> >> >> >>>>>>>>>>>ure
>> >> >> >>> >>>>>>>>(C
>> >> >> >>> >> >>>>>>li
>> >> >> >>> >> >> >>>>en
>> >> >> >>> >> >> >> >>t.java:645)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >>
>> >> >> >>>
>> >> >> >>>
>> >> >>
>> >>
>>
>> >>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.
>> >>>>>>>>>>>>>ja
>> >> >>>>>>>>>>>va
>> >> >> >>>>>>>>>:73
>> >> >> >>> >>>>>>3)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >> >>
>> >> >> >>> >>
>> >> >>
>> >>
>>
>> >>>>>>>>>org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:37
>> >>>>>>>>>0)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >>
>> >>>>org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>>org.apache.hadoop.ipc.Client.call(Client.java:1442)
>> >> >> >>> >> >> >> >>         ... 12 more
>> >> >> >>> >> >> >> >> Caused by:
>> >> >> >>>org.apache.hadoop.security.AccessControlException:
>> >> >> >>> >> >>Client
>> >> >> >>> >> >> >> >> cannot authenticate via:[TOKEN]
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >>
>> >> >> >>>
>> >> >> >>>
>> >> >>
>> >>
>>
>> >>>>>>>>>>>>>>>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(Sa
>> >>>>>>>>>>>>>>>sl
>> >> >>>>>>>>>>>>>Rp
>> >> >> >>>>>>>>>>>cCl
>> >> >> >>> >>>>>>>>ie
>> >> >> >>> >> >>>>>>nt
>> >> >> >>> >> >> >>>>.j
>> >> >> >>> >> >> >> >>ava:172)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >>
>> >> >> >>>
>> >> >> >>>
>> >> >>
>> >>
>>
>> >>>>>>>>>>>>>>>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpc
>> >>>>>>>>>>>>>>>Cl
>> >> >>>>>>>>>>>>>ie
>> >> >> >>>>>>>>>>>nt.
>> >> >> >>> >>>>>>>>ja
>> >> >> >>> >> >>>>>>va
>> >> >> >>> >> >> >>>>:3
>> >> >> >>> >> >> >> >>96)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >>
>> >> >> >>>
>> >> >> >>>
>> >> >>
>> >>
>>
>> >>>>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(
>> >>>>>>>>>>>>>>>Cl
>> >> >>>>>>>>>>>>>ie
>> >> >> >>>>>>>>>>>nt.
>> >> >> >>> >>>>>>>>ja
>> >> >> >>> >> >>>>>>va
>> >> >> >>> >> >> >>>>:5
>> >> >> >>> >> >> >> >>55)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >> >>
>> >> >> >>> >>
>> >> >>
>> >>
>>
>> >>>>>>>>>org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:37
>> >>>>>>>>>0)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >>
>> >> >> >>>>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >>
>> >> >> >>>>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>>java.security.AccessController.doPrivileged(Native
>> >> >> >>> >> >>Method)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>>javax.security.auth.Subject.doAs(Subject.java:422)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >>
>> >> >> >>>
>> >> >> >>>
>> >> >>
>> >>
>>
>> >>>>>>>>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
>> >>>>>>>>>>>>>>>up
>> >> >>>>>>>>>>>>>In
>> >> >> >>>>>>>>>>>for
>> >> >> >>> >>>>>>>>ma
>> >> >> >>> >> >>>>>>ti
>> >> >> >>> >> >> >>>>on
>> >> >> >>> >> >> >> >>.java:1671)
>> >> >> >>> >> >> >> >>         at
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >>
>> >> >> >>>
>> >> >> >>>
>> >> >>
>> >>
>>
>> >>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.
>> >>>>>>>>>>>>>ja
>> >> >>>>>>>>>>>va
>> >> >> >>>>>>>>>:72
>> >> >> >>> >>>>>>0)
>> >> >> >>> >> >> >> >>         ... 15 more
>> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread]
>> >>INFO
>> >> >> >>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing
>> >> >>over to
>> >> >> >>> rm1
>> >> >> >>> >> >> >> >>
>> >> >> >>> >> >> >>
>> >> >> >>> >> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >>
>> >> >> >>> >>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>>
>>
>

Re: Slider AM fails to run when RM in HA setup fails over

Posted by Manoj Samel <ma...@gmail.com>.
Thanks. I will test with the updated config and then upload the latest ones
...

Thanks,

Manoj

On Thu, Jul 28, 2016 at 5:21 PM, Gour Saha <gs...@hortonworks.com> wrote:

> slider.zookeeper.quorum is deprecated and should not be used.
> hadoop.registry.zk.quorum is used instead and is typically defined in
> yarn-site.xml. So is hadoop.registry.zk.root.
>
> It is not encouraged to specify slider.yarn.queue at the cluster config
> level. Ideally it is best to specify the queue during the application
> submission. So you can use --queue option with slider create cmd. You can
> also set on the command line using -D slider.yarn.queue=<> during the
> create call. If indeed all slider apps should go to one and only one
> queue, then this prop can be specified in any one of the existing site xml
> files under /etc/hadoop/conf.
>
> -Gour
>
> On 7/28/16, 4:43 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>
> >Following slider specific properties are at present added in
> >/data/slider/conf/slider-client.xml. If you think they should be picked up
> >from HADOOP_CONF_DIR (/etc/hadoop/conf) file, which file in
> >HADOOP_CONF_DIR
> >should these be added ?
> >
> >   - slider.zookeeper.quorum
> >   - hadoop.registry.zk.quorum
> >   - hadoop.registry.zk.root
> >   - slider.yarn.queue
> >
> >
> >On Thu, Jul 28, 2016 at 4:37 PM, Gour Saha <gs...@hortonworks.com> wrote:
> >
> >> That is strange, since it is indeed not required to contain anything in
> >> slider-client.xml (except <configuration></configuration>) if
> >> HADOOP_CONF_DIR has everything that Slider needs. This probably gives an
> >> indication that there might be some issue with cluster configuration
> >>based
> >> on files solely under HADOOP_CONF_DIR to begin with.
> >>
> >> Suggest you to upload all the config files to the jira to help debug
> >>this
> >> further.
> >>
> >> -Gour
> >>
> >> On 7/28/16, 4:27 PM, "Manoj Samel" <ma...@gmail.com> wrote:
> >>
> >> >Thanks Gour for prompt reply
> >> >
> >> >BTW - Creating a empty slider-client.xml (with just
> >> ><configuration></configuration>) does not works. The AM starts but
> >>fails
> >> >to
> >> >create any components and shows errors like
> >> >
> >> >2016-07-28 23:18:46,018
> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
> >> > zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error,
> >> >closing socket connection and attempting reconnect
> >> >java.net.ConnectException: Connection refused
> >> >        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> >> >        at
> >> >sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> >> >        at
> >>
> >>>org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO
> >>>.j
> >> >ava:361)
> >> >        at
> >> >org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> >> >
> >> >Also, command "slider destroy <app>" fails with zookeeper errors ...
> >> >
> >> >I had to keep a minimal slider-client.xml. It does not have any RM info
> >> >etc. but does contain slider ZK related properties like
> >> >"slider.zookeeper.quorum", "hadoop.registry.zk.quorum",
> >> >"hadoop.registry.zk.root". I haven't yet distilled the absolute minimal
> >> >set
> >> >of properties required, but this should suffice for now. All RM / HDFS
> >> >properties will be read from HADOOP_CONF_DIR files.
> >> >
> >> >Let me know if this could cause any issues.
> >> >
> >> >On Thu, Jul 28, 2016 at 3:36 PM, Gour Saha <gs...@hortonworks.com>
> >>wrote:
> >> >
> >> >> No need to copy any files. Pointing HADOOP_CONF_DIR to
> >>/etc/hadoop/conf
> >> >>is
> >> >> good.
> >> >>
> >> >> -Gour
> >> >>
> >> >> On 7/28/16, 3:24 PM, "Manoj Samel" <ma...@gmail.com> wrote:
> >> >>
> >> >> >Follow up question regarding Gour's comment in earlier thread -
> >> >> >
> >> >> >Slider is installed on one of the hadoop nodes. SLIDER_HOME/conf
> >> >>directory
> >> >> >(say /data/slider/conf) is different than HADOOP_CONF_DIR
> >> >> >(/etc/hadoop/conf). Is it required/recommended that files in
> >> >> >HADOOP_CONF_DIR be copied to SLIDER_HOME/conf and slider-env.sh
> >>script
> >> >> >sets
> >> >> >HADOOP_CONF_DIR to /data/slider/conf ?
> >> >> >
> >> >> >Or can the slider-env.sh set HADOOP_CONF_DIR to /etc/hadoop/conf ,
> >> >>without
> >> >> >copying the files ?
> >> >> >
> >> >> >Using slider .80 for now, but would like to know recommendation for
> >> >>this
> >> >> >and future versions as well.
> >> >> >
> >> >> >Thanks in advance,
> >> >> >
> >> >> >Manoj
> >> >> >
> >> >> >On Tue, Jul 26, 2016 at 3:27 PM, Manoj Samel
> >><manojsameltech@gmail.com
> >> >
> >> >> >wrote:
> >> >> >
> >> >> >> Filed https://issues.apache.org/jira/browse/SLIDER-1158 with logs
> >> and
> >> >> my
> >> >> >> analysis of logs.
> >> >> >>
> >> >> >> On Tue, Jul 26, 2016 at 10:36 AM, Gour Saha
> >><gs...@hortonworks.com>
> >> >> >>wrote:
> >> >> >>
> >> >> >>> Please file a JIRA and upload the logs to it.
> >> >> >>>
> >> >> >>> On 7/26/16, 10:21 AM, "Manoj Samel" <ma...@gmail.com>
> >> >>wrote:
> >> >> >>>
> >> >> >>> >Hi Gour,
> >> >> >>> >
> >> >> >>> >Can you please reach me using your own email-id? I will then
> >>send
> >> >> >>>logs to
> >> >> >>> >you, along with my analysis - I don't want to send logs on
> >>public
> >> >>list
> >> >> >>> >
> >> >> >>> >Thanks,
> >> >> >>> >
> >> >> >>> >On Mon, Jul 25, 2016 at 5:39 PM, Gour Saha
> >><gs...@hortonworks.com>
> >> >> >>> wrote:
> >> >> >>> >
> >> >> >>> >> Ok, so this node is not a gateway. It is part of the cluster,
> >> >>which
> >> >> >>> >>means
> >> >> >>> >> you don¹t need slider-client.xml at all. Just have
> >> >>HADOOP_CONF_DIR
> >> >> >>> >> pointing to /etc/hadoop/conf in slider-env.sh and that should
> >>be
> >> >>it.
> >> >> >>> >>
> >> >> >>> >> So the above simplifies your config setup. It will not solve
> >> >>either
> >> >> >>>of
> >> >> >>> >>the
> >> >> >>> >> 2 problems you are facing.
> >> >> >>> >>
> >> >> >>> >> Now coming to the 2 issues you are facing, you have to provide
> >> >> >>> >>additional
> >> >> >>> >> logs for us to understand better. Let¹s start with  -
> >> >> >>> >> 1. RM logs (specifically between the time when rm1->rm2
> >>failover
> >> >>is
> >> >> >>> >> simulated)
> >> >> >>> >> 2. Slider App logs
> >> >> >>> >>
> >> >> >>> >> -Gour
> >> >> >>> >>
> >> >> >>> >> On 7/25/16, 5:16 PM, "Manoj Samel" <ma...@gmail.com>
> >> >> wrote:
> >> >> >>> >>
> >> >> >>> >> >   1. Not clear about your question on "gateway" node. The
> >>node
> >> >> >>> running
> >> >> >>> >> >   slider is part of the hadoop cluster and there are other
> >> >> >>>services
> >> >> >>> >>like
> >> >> >>> >> >   Oozie that run on this node that utilizes hdfs and yarn.
> >>So
> >> >>if
> >> >> >>>your
> >> >> >>> >> >   question is whether the node is otherwise working for HDFS
> >> >>and
> >> >> >>>Yarn
> >> >> >>> >> >   configuration, it is working
> >> >> >>> >> >   2. I copied all files from HADOOP_CONF_DIR (say
> >> >> >>>/etc/hadoop/conf)
> >> >> >>> to
> >> >> >>> >> >the
> >> >> >>> >> >   directory containing slider-client.xml (say
> >> >>/data/latest/conf)
> >> >> >>> >> >   3. In earlier email, I had done a mistake where
> >>slider-env.sh
> >> >> >>>file
> >> >> >>> >> >HADOOP_CONF_DIR
> >> >> >>> >> >   was pointing to original directory /etc/hadoop/conf. I
> >>edited
> >> >> >>>it to
> >> >> >>> >> >   point to same directory containing slider-client.xml &
> >> >> >>> slider-env.sh
> >> >> >>> >> >i.e.
> >> >> >>> >> >   /data/latest/conf
> >> >> >>> >> >   4. I emptied slider-client.xml. It just had the
> >> >> >>> >> ><configuration></configuration>.
> >> >> >>> >> >   The creation of spas worked but the Slider AM still shows
> >>the
> >> >> >>>same
> >> >> >>> >> >issue.
> >> >> >>> >> >   i.e. when RM1 goes from active to standby, slider AM goes
> >> >>from
> >> >> >>> >>RUNNING
> >> >> >>> >> >to
> >> >> >>> >> >   ACCPTED state with same error about TOKEN. Also NOTE that
> >> >>when
> >> >> >>> >> >   slider-client.xml is empty, the "slider destroy xxx"
> >>command
> >> >> >>>still
> >> >> >>> >> >fails
> >> >> >>> >> >   with Zookeeper connection errors.
> >> >> >>> >> >   5. I then added same parameters (as my last email - except
> >> >> >>> >> >   HADOOP_CONF_DIR) to slider-client.xml and ran. This time
> >> >> >>> >>slider-env.sh
> >> >> >>> >> >   has HADOOP_CONF_DIR pointing to /data/latest/conf and
> >> >> >>> >>slider-client.xml
> >> >> >>> >> >   does not have HADOOP_CONF_DIR. The same issue exists (but
> >> >> >>>"slider
> >> >> >>> >> >   destroy" does not fails)
> >> >> >>> >> >   6. Could you explain what do you expect to pick up from
> >> >>Hadoop
> >> >> >>> >> >   configurations that will help you in RM Token ? If slider
> >>has
> >> >> >>>token
> >> >> >>> >> >from
> >> >> >>> >> >   RM1, and it switches to RM2, not clear what slider does to
> >> >>get
> >> >> >>> >> >delegation
> >> >> >>> >> >   token for RM2 communication ?
> >> >> >>> >> >   7. It is worth repeating again that issue happens only
> >>when
> >> >>RM1
> >> >> >>>was
> >> >> >>> >> >   active when slider app was created and then RM1 becomes
> >> >> >>>standby. If
> >> >> >>> >> >RM2 was
> >> >> >>> >> >   active when slider app was created, then slider AM keeps
> >> >>running
> >> >> >>> for
> >> >> >>> >> >any
> >> >> >>> >> >   number of switches between RM2 and RM1 back and forth ...
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >On Mon, Jul 25, 2016 at 4:21 PM, Gour Saha
> >> >><gs...@hortonworks.com>
> >> >> >>> >>wrote:
> >> >> >>> >> >
> >> >> >>> >> >> The node you are running slider from, is that a gateway
> >>node?
> >> >> >>>Sorry
> >> >> >>> >>for
> >> >> >>> >> >> not being explicit. I meant copy everything under
> >> >> >>>/etc/hadoop/conf
> >> >> >>> >>from
> >> >> >>> >> >> your cluster into some temp directory (say
> >>/tmp/hadoop_conf)
> >> >>in
> >> >> >>>your
> >> >> >>> >> >> gateway node or local or whichever node you are running
> >>slider
> >> >> >>>from.
> >> >> >>> >> >>Then
> >> >> >>> >> >> set HADOOP_CONF_DIR to /tmp/hadoop_conf and clear
> >>everything
> >> >>out
> >> >> >>> from
> >> >> >>> >> >> slider-client.xml.
> >> >> >>> >> >>
> >> >> >>> >> >> On 7/25/16, 4:12 PM, "Manoj Samel"
> >><ma...@gmail.com>
> >> >> >>> wrote:
> >> >> >>> >> >>
> >> >> >>> >> >> >Hi Gour,
> >> >> >>> >> >> >
> >> >> >>> >> >> >Thanks for your prompt reply.
> >> >> >>> >> >> >
> >> >> >>> >> >> >FYI, issue happens when I create slider app when rm1 is
> >> >>active
> >> >> >>>and
> >> >> >>> >>when
> >> >> >>> >> >> >rm1
> >> >> >>> >> >> >fails over to rm2. As soon as rm2 becomes active; the
> >>slider
> >> >>AM
> >> >> >>> goes
> >> >> >>> >> >>from
> >> >> >>> >> >> >RUNNING to ACCEPTED state with above error.
> >> >> >>> >> >> >
> >> >> >>> >> >> >For your suggestion, I did following
> >> >> >>> >> >> >
> >> >> >>> >> >> >1) Copied core-site, hdfs-site, yarn-site, and mapred-site
> >> >>from
> >> >> >>> >> >> >HADOOP_CONF_DIR
> >> >> >>> >> >> >to slider conf directory.
> >> >> >>> >> >> >2) Our slider-env.sh already had HADOOP_CONF_DIR set
> >> >> >>> >> >> >3) I removed all properties from slider-client.xml EXCEPT
> >> >> >>>following
> >> >> >>> >> >> >
> >> >> >>> >> >> >   - HADOOP_CONF_DIR
> >> >> >>> >> >> >   - slider.yarn.queue
> >> >> >>> >> >> >   - slider.zookeeper.quorum
> >> >> >>> >> >> >   - hadoop.registry.zk.quorum
> >> >> >>> >> >> >   - hadoop.registry.zk.root
> >> >> >>> >> >> >   - hadoop.security.authorization
> >> >> >>> >> >> >   - hadoop.security.authentication
> >> >> >>> >> >> >
> >> >> >>> >> >> >Then I made rm1 active, installed and created slider app
> >>and
> >> >> >>> >>restarted
> >> >> >>> >> >>rm1
> >> >> >>> >> >> >(to make rm2) active. The slider-am again went from
> >>RUNNING
> >> >>to
> >> >> >>> >>ACCEPTED
> >> >> >>> >> >> >state.
> >> >> >>> >> >> >
> >> >> >>> >> >> >Let me know if you want me to try further changes.
> >> >> >>> >> >> >
> >> >> >>> >> >> >If I make the slider-client.xml completely empty per your
> >> >> >>> >>suggestion,
> >> >> >>> >> >>only
> >> >> >>> >> >> >slider AM comes up but it
> >> >> >>> >> >> >fails to start components. The AM log shows errors trying
> >>to
> >> >> >>> >>connect to
> >> >> >>> >> >> >zookeeper like below.
> >> >> >>> >> >> >2016-07-25 23:07:41,532
> >> >> >>> >> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)]
> >>WARN
> >> >> >>> >> >> >zookeeper.ClientCnxn - Session 0x0 for server null,
> >> >>unexpected
> >> >> >>> >>error,
> >> >> >>> >> >> >closing socket connection and attempting reconnect
> >> >> >>> >> >> >java.net.ConnectException: Connection refused
> >> >> >>> >> >> >
> >> >> >>> >> >> >Hence I kept minimal info in slider-client.xml
> >> >> >>> >> >> >
> >> >> >>> >> >> >FYI This is slider version 0.80
> >> >> >>> >> >> >
> >> >> >>> >> >> >Thanks,
> >> >> >>> >> >> >
> >> >> >>> >> >> >Manoj
> >> >> >>> >> >> >
> >> >> >>> >> >> >On Mon, Jul 25, 2016 at 2:54 PM, Gour Saha
> >> >> >>><gs...@hortonworks.com>
> >> >> >>> >> >>wrote:
> >> >> >>> >> >> >
> >> >> >>> >> >> >> If possible, can you copy the entire content of the
> >> >>directory
> >> >> >>> >> >> >> /etc/hadoop/conf and then set HADOOP_CONF_DIR in
> >> >> >>>slider-env.sh to
> >> >> >>> >>it.
> >> >> >>> >> >> >>Keep
> >> >> >>> >> >> >> slider-client.xml empty.
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> Now when you do the same rm1->rm2 and then the reverse
> >> >> >>>failovers,
> >> >> >>> >>do
> >> >> >>> >> >>you
> >> >> >>> >> >> >> see the same behaviors?
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> -Gour
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> On 7/25/16, 2:28 PM, "Manoj Samel"
> >> >><ma...@gmail.com>
> >> >> >>> >>wrote:
> >> >> >>> >> >> >>
> >> >> >>> >> >> >> >Another observation (whatever it is worth)
> >> >> >>> >> >> >> >
> >> >> >>> >> >> >> >If slider app is created and started when rm2 was
> >>active,
> >> >> >>>then
> >> >> >>> it
> >> >> >>> >> >> >>seems to
> >> >> >>> >> >> >> >survive switches between rm2 and rm1 (and back). I.e
> >> >> >>> >> >> >> >
> >> >> >>> >> >> >> >* rm2 is active
> >> >> >>> >> >> >> >* create and start slider application
> >> >> >>> >> >> >> >* fail over to rm1. Now the Slider AM keeps running
> >> >> >>> >> >> >> >* fail over to rm2 again. Slider AM still keeps running
> >> >> >>> >> >> >> >
> >> >> >>> >> >> >> >So, it seems if it starts with rm1 active, then the AM
> >> >>goes
> >> >> >>>to
> >> >> >>> >> >> >>"ACCEPTED"
> >> >> >>> >> >> >> >state when RM fails to rm2. If it starts with rm2
> >>active,
> >> >> >>>then
> >> >> >>> it
> >> >> >>> >> >>runs
> >> >> >>> >> >> >> >fine
> >> >> >>> >> >> >> >with any switches between rm1 and rm2.
> >> >> >>> >> >> >> >
> >> >> >>> >> >> >> >Any feedback ?
> >> >> >>> >> >> >> >
> >> >> >>> >> >> >> >Thanks,
> >> >> >>> >> >> >> >
> >> >> >>> >> >> >> >Manoj
> >> >> >>> >> >> >> >
> >> >> >>> >> >> >> >On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel
> >> >> >>> >> >> >><ma...@gmail.com>
> >> >> >>> >> >> >> >wrote:
> >> >> >>> >> >> >> >
> >> >> >>> >> >> >> >> Setup
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >> >> - Hadoop 2.6 with RM HA, Kerberos enabled
> >> >> >>> >> >> >> >> - Slider 0.80
> >> >> >>> >> >> >> >> - In my slider-client.xml, I have added all RM HA
> >> >> >>>properties,
> >> >> >>> >> >> >>including
> >> >> >>> >> >> >> >> the ones mentioned in
> >> >> >>> >> >>http://markmail.org/message/wnhpp2zn6ixo65e3.
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >> >> Following is the issue
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >> >> * rm1 is active, rm2 is standby
> >> >> >>> >> >> >> >> * deploy and start slider application, it runs fine
> >> >> >>> >> >> >> >> * restart rm1, rm2 is now active.
> >> >> >>> >> >> >> >> * The slider-am now goes from running into "ACCEPTED"
> >> >> >>>mode. It
> >> >> >>> >> >>stays
> >> >> >>> >> >> >> >>there
> >> >> >>> >> >> >> >> till rm1 is made active again.
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >> >> In the slider-am log, it tries to connect to RM2 and
> >> >> >>> connection
> >> >> >>> >> >>fails
> >> >> >>> >> >> >> >>due
> >> >> >>> >> >> >> >> to org.apache.hadoop.security.AccessControlException:
> >> >> >>>Client
> >> >> >>> >> >>cannot
> >> >> >>> >> >> >> >> authenticate via:[TOKEN]. See detailed log below
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >> >>  It seems it has some token (delegation token?) for
> >>RM1
> >> >>but
> >> >> >>> >>tries
> >> >> >>> >> >>to
> >> >> >>> >> >> >>use
> >> >> >>> >> >> >> >> same(?) for RM2 and fails. Am I missing some
> >> >>configuration
> >> >> >>>???
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >> >> Thanks,
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
> >>INFO
> >> >> >>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing
> >> >>over to
> >> >> >>> rm2
> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
> >>WARN
> >> >> >>> >> >> >> >>  security.UserGroupInformation -
> >> >>PriviledgedActionException
> >> >> >>> >> >> >>as:abc@XYZ
> >> >> >>> >> >> >> >> (auth:KERBEROS)
> >> >> >>> >> >> >>cause:org.apache.hadoop.security.AccessControlException:
> >> >> >>> >> >> >> >> Client cannot authenticate via:[TOKEN]
> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
> >>WARN
> >> >> >>> >> >>ipc.Client -
> >> >> >>> >> >> >> >> Exception encountered while connecting to the server
> >>:
> >> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
> >> >>Client
> >> >> >>> >>cannot
> >> >> >>> >> >> >> >> authenticate via:[TOKEN]
> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
> >>WARN
> >> >> >>> >> >> >> >>  security.UserGroupInformation -
> >> >>PriviledgedActionException
> >> >> >>> >> >> >>as:abc@XYZ
> >> >> >>> >> >> >> >> (auth:KERBEROS) cause:java.io.IOException:
> >> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
> >> >>Client
> >> >> >>> >>cannot
> >> >> >>> >> >> >> >> authenticate via:[TOKEN]
> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread]
> >>INFO
> >> >> >>> >> >> >> >>  retry.RetryInvocationHandler - Exception while
> >>invoking
> >> >> >>> >>allocate
> >> >> >>> >> >>of
> >> >> >>> >> >> >> >>class
> >> >> >>> >> >> >> >> ApplicationMasterProtocolPBClientImpl over rm2 after
> >>287
> >> >> >>>fail
> >> >> >>> >>over
> >> >> >>> >> >> >> >> attempts. Trying to fail over immediately.
> >> >> >>> >> >> >> >> java.io.IOException: Failed on local exception:
> >> >> >>> >> >>java.io.IOException:
> >> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
> >> >>Client
> >> >> >>> >>cannot
> >> >> >>> >> >> >> >> authenticate via:[TOKEN]; Host Details : local host
> >>is:
> >> >> >>> >>"<SliderAM
> >> >> >>> >> >> >> >> HOST>/<slider AM Host IP>"; destination host is:
> >>"<RM2
> >> >> >>> >> >>HOST>":23130;
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >>
> >> >> >>>>>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> >> >> >>> >> >> >> >>         at
> >> >> >>>org.apache.hadoop.ipc.Client.call(Client.java:1476)
> >> >> >>> >> >> >> >>         at
> >> >> >>>org.apache.hadoop.ipc.Client.call(Client.java:1403)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >>
> >> >> >>> >> >>
> >> >> >>> >>
> >> >> >>>
> >> >> >>>
> >> >>
> >>
> >>>>>>>>>>>>>>>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(Proto
> >>>>>>>>>>>>>>>bu
> >> >>>>>>>>>>>>>fR
> >> >> >>>>>>>>>>>pcE
> >> >> >>> >>>>>>>>ng
> >> >> >>> >> >>>>>>in
> >> >> >>> >> >> >>>>e.
> >> >> >>> >> >> >> >>java:230)
> >> >> >>> >> >> >> >>         at com.sun.proxy.$Proxy23.allocate(Unknown
> >> >>Source)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >>
> >> >> >>> >> >>
> >> >> >>> >>
> >> >> >>>
> >> >> >>>
> >> >>
> >>
> >>>>>>>>>>>>>>>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterP
> >>>>>>>>>>>>>>>ro
> >> >>>>>>>>>>>>>to
> >> >> >>>>>>>>>>>col
> >> >> >>> >>>>>>>>PB
> >> >> >>> >> >>>>>>Cl
> >> >> >>> >> >> >>>>ie
> >> >> >>> >> >> >>
> >> >> >>>>>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> >> >> >>> >> >> >> >>         at
> >> >> >>> sun.reflect.GeneratedMethodAccessor10.invoke(Unknown
> >> >> >>> >> >> >>Source)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >>
> >> >> >>> >> >>
> >> >> >>> >>
> >> >> >>>
> >> >> >>>
> >> >>
> >>
> >>>>>>>>>>>>>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
> >>>>>>>>>>>>>>>th
> >> >>>>>>>>>>>>>od
> >> >> >>>>>>>>>>>Acc
> >> >> >>> >>>>>>>>es
> >> >> >>> >> >>>>>>so
> >> >> >>> >> >> >>>>rI
> >> >> >>> >> >> >> >>mpl.java:43)
> >> >> >>> >> >> >> >>         at
> >> >>java.lang.reflect.Method.invoke(Method.java:497)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >>
> >> >> >>> >> >>
> >> >> >>> >>
> >> >> >>>
> >> >> >>>
> >> >>
> >>
> >>>>>>>>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMeth
> >>>>>>>>>>>>>>>od
> >> >>>>>>>>>>>>>(R
> >> >> >>>>>>>>>>>etr
> >> >> >>> >>>>>>>>yI
> >> >> >>> >> >>>>>>nv
> >> >> >>> >> >> >>>>oc
> >> >> >>> >> >> >> >>ationHandler.java:252)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >>
> >> >> >>> >> >>
> >> >> >>> >>
> >> >> >>>
> >> >> >>>
> >> >>
> >>
> >>>>>>>>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(Ret
> >>>>>>>>>>>>>>>ry
> >> >>>>>>>>>>>>>In
> >> >> >>>>>>>>>>>voc
> >> >> >>> >>>>>>>>at
> >> >> >>> >> >>>>>>io
> >> >> >>> >> >> >>>>nH
> >> >> >>> >> >> >> >>andler.java:104)
> >> >> >>> >> >> >> >>         at com.sun.proxy.$Proxy24.allocate(Unknown
> >> >>Source)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >>
> >> >> >>> >> >>
> >> >> >>> >>
> >> >> >>>
> >> >> >>>
> >> >>
> >>
> >>>>>>>>>>>>>>>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.alloca
> >>>>>>>>>>>>>>>te
> >> >>>>>>>>>>>>>(A
> >> >> >>>>>>>>>>>MRM
> >> >> >>> >>>>>>>>Cl
> >> >> >>> >> >>>>>>ie
> >> >> >>> >> >> >>>>nt
> >> >> >>> >> >> >> >>Impl.java:278)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >>
> >> >> >>> >> >>
> >> >> >>> >>
> >> >> >>>
> >> >> >>>
> >> >>
> >>
> >>>>>>>>>>>>>>>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsync
> >>>>>>>>>>>>>>>Im
> >> >>>>>>>>>>>>>pl
> >> >> >>>>>>>>>>>$He
> >> >> >>> >>>>>>>>ar
> >> >> >>> >> >>>>>>tb
> >> >> >>> >> >> >>>>ea
> >> >> >>> >> >> >> >>tThread.run(AMRMClientAsyncImpl.java:224)
> >> >> >>> >> >> >> >> Caused by: java.io.IOException:
> >> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
> >> >>Client
> >> >> >>> >>cannot
> >> >> >>> >> >> >> >> authenticate via:[TOKEN]
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >>
> >> >> >>>>>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
> >> >> >>> >> >> >> >>         at
> >> >> >>>java.security.AccessController.doPrivileged(Native
> >> >> >>> >> >>Method)
> >> >> >>> >> >> >> >>         at
> >> >> >>>javax.security.auth.Subject.doAs(Subject.java:422)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >>
> >> >> >>> >> >>
> >> >> >>> >>
> >> >> >>>
> >> >> >>>
> >> >>
> >>
> >>>>>>>>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
> >>>>>>>>>>>>>>>up
> >> >>>>>>>>>>>>>In
> >> >> >>>>>>>>>>>for
> >> >> >>> >>>>>>>>ma
> >> >> >>> >> >>>>>>ti
> >> >> >>> >> >> >>>>on
> >> >> >>> >> >> >> >>.java:1671)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >>
> >> >> >>> >> >>
> >> >> >>> >>
> >> >> >>>
> >> >> >>>
> >> >>
> >>
> >>>>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.handleSaslConnection
> >>>>>>>>>>>>>>>Fa
> >> >>>>>>>>>>>>>il
> >> >> >>>>>>>>>>>ure
> >> >> >>> >>>>>>>>(C
> >> >> >>> >> >>>>>>li
> >> >> >>> >> >> >>>>en
> >> >> >>> >> >> >> >>t.java:645)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >> >>
> >> >> >>> >> >>
> >> >> >>> >>
> >> >> >>>
> >> >> >>>
> >> >>
> >>
> >>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.
> >>>>>>>>>>>>>ja
> >> >>>>>>>>>>>va
> >> >> >>>>>>>>>:73
> >> >> >>> >>>>>>3)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >> >>
> >> >> >>> >>
> >> >>
> >>
> >>>>>>>>>org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:37
> >>>>>>>>>0)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >>
> >>>>org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
> >> >> >>> >> >> >> >>         at
> >> >> >>>org.apache.hadoop.ipc.Client.call(Client.java:1442)
> >> >> >>> >> >> >> >>         ... 12 more
> >> >> >>> >> >> >> >> Caused by:
> >> >> >>>org.apache.hadoop.security.AccessControlException:
> >> >> >>> >> >>Client
> >> >> >>> >> >> >> >> cannot authenticate via:[TOKEN]
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >>
> >> >> >>> >> >>
> >> >> >>> >>
> >> >> >>>
> >> >> >>>
> >> >>
> >>
> >>>>>>>>>>>>>>>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(Sa
> >>>>>>>>>>>>>>>sl
> >> >>>>>>>>>>>>>Rp
> >> >> >>>>>>>>>>>cCl
> >> >> >>> >>>>>>>>ie
> >> >> >>> >> >>>>>>nt
> >> >> >>> >> >> >>>>.j
> >> >> >>> >> >> >> >>ava:172)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >>
> >> >> >>> >> >>
> >> >> >>> >>
> >> >> >>>
> >> >> >>>
> >> >>
> >>
> >>>>>>>>>>>>>>>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpc
> >>>>>>>>>>>>>>>Cl
> >> >>>>>>>>>>>>>ie
> >> >> >>>>>>>>>>>nt.
> >> >> >>> >>>>>>>>ja
> >> >> >>> >> >>>>>>va
> >> >> >>> >> >> >>>>:3
> >> >> >>> >> >> >> >>96)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >>
> >> >> >>> >> >>
> >> >> >>> >>
> >> >> >>>
> >> >> >>>
> >> >>
> >>
> >>>>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(
> >>>>>>>>>>>>>>>Cl
> >> >>>>>>>>>>>>>ie
> >> >> >>>>>>>>>>>nt.
> >> >> >>> >>>>>>>>ja
> >> >> >>> >> >>>>>>va
> >> >> >>> >> >> >>>>:5
> >> >> >>> >> >> >> >>55)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >> >>
> >> >> >>> >>
> >> >>
> >>
> >>>>>>>>>org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:37
> >>>>>>>>>0)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >>
> >> >> >>>>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >>
> >> >> >>>>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
> >> >> >>> >> >> >> >>         at
> >> >> >>>java.security.AccessController.doPrivileged(Native
> >> >> >>> >> >>Method)
> >> >> >>> >> >> >> >>         at
> >> >> >>>javax.security.auth.Subject.doAs(Subject.java:422)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >>
> >> >> >>> >> >>
> >> >> >>> >>
> >> >> >>>
> >> >> >>>
> >> >>
> >>
> >>>>>>>>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
> >>>>>>>>>>>>>>>up
> >> >>>>>>>>>>>>>In
> >> >> >>>>>>>>>>>for
> >> >> >>> >>>>>>>>ma
> >> >> >>> >> >>>>>>ti
> >> >> >>> >> >> >>>>on
> >> >> >>> >> >> >> >>.java:1671)
> >> >> >>> >> >> >> >>         at
> >> >> >>> >> >> >> >>
> >> >> >>> >> >>
> >> >> >>> >>
> >> >> >>>
> >> >> >>>
> >> >>
> >>
> >>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.
> >>>>>>>>>>>>>ja
> >> >>>>>>>>>>>va
> >> >> >>>>>>>>>:72
> >> >> >>> >>>>>>0)
> >> >> >>> >> >> >> >>         ... 15 more
> >> >> >>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread]
> >>INFO
> >> >> >>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing
> >> >>over to
> >> >> >>> rm1
> >> >> >>> >> >> >> >>
> >> >> >>> >> >> >>
> >> >> >>> >> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>>
> >> >> >>>
> >> >> >>
> >> >>
> >> >>
> >>
> >>
>
>

Re: Slider AM fails to run when RM in HA setup fails over

Posted by Gour Saha <gs...@hortonworks.com>.
slider.zookeeper.quorum is deprecated and should not be used.
hadoop.registry.zk.quorum is used instead and is typically defined in
yarn-site.xml. So is hadoop.registry.zk.root.

It is not encouraged to specify slider.yarn.queue at the cluster config
level. Ideally it is best to specify the queue during the application
submission. So you can use --queue option with slider create cmd. You can
also set on the command line using -D slider.yarn.queue=<> during the
create call. If indeed all slider apps should go to one and only one
queue, then this prop can be specified in any one of the existing site xml
files under /etc/hadoop/conf.

-Gour

On 7/28/16, 4:43 PM, "Manoj Samel" <ma...@gmail.com> wrote:

>Following slider specific properties are at present added in
>/data/slider/conf/slider-client.xml. If you think they should be picked up
>from HADOOP_CONF_DIR (/etc/hadoop/conf) file, which file in
>HADOOP_CONF_DIR
>should these be added ?
>
>   - slider.zookeeper.quorum
>   - hadoop.registry.zk.quorum
>   - hadoop.registry.zk.root
>   - slider.yarn.queue
>
>
>On Thu, Jul 28, 2016 at 4:37 PM, Gour Saha <gs...@hortonworks.com> wrote:
>
>> That is strange, since it is indeed not required to contain anything in
>> slider-client.xml (except <configuration></configuration>) if
>> HADOOP_CONF_DIR has everything that Slider needs. This probably gives an
>> indication that there might be some issue with cluster configuration
>>based
>> on files solely under HADOOP_CONF_DIR to begin with.
>>
>> Suggest you to upload all the config files to the jira to help debug
>>this
>> further.
>>
>> -Gour
>>
>> On 7/28/16, 4:27 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>>
>> >Thanks Gour for prompt reply
>> >
>> >BTW - Creating a empty slider-client.xml (with just
>> ><configuration></configuration>) does not works. The AM starts but
>>fails
>> >to
>> >create any components and shows errors like
>> >
>> >2016-07-28 23:18:46,018
>> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
>> > zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error,
>> >closing socket connection and attempting reconnect
>> >java.net.ConnectException: Connection refused
>> >        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>> >        at
>> >sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>> >        at
>> 
>>>org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO
>>>.j
>> >ava:361)
>> >        at
>> >org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>> >
>> >Also, command "slider destroy <app>" fails with zookeeper errors ...
>> >
>> >I had to keep a minimal slider-client.xml. It does not have any RM info
>> >etc. but does contain slider ZK related properties like
>> >"slider.zookeeper.quorum", "hadoop.registry.zk.quorum",
>> >"hadoop.registry.zk.root". I haven't yet distilled the absolute minimal
>> >set
>> >of properties required, but this should suffice for now. All RM / HDFS
>> >properties will be read from HADOOP_CONF_DIR files.
>> >
>> >Let me know if this could cause any issues.
>> >
>> >On Thu, Jul 28, 2016 at 3:36 PM, Gour Saha <gs...@hortonworks.com>
>>wrote:
>> >
>> >> No need to copy any files. Pointing HADOOP_CONF_DIR to
>>/etc/hadoop/conf
>> >>is
>> >> good.
>> >>
>> >> -Gour
>> >>
>> >> On 7/28/16, 3:24 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>> >>
>> >> >Follow up question regarding Gour's comment in earlier thread -
>> >> >
>> >> >Slider is installed on one of the hadoop nodes. SLIDER_HOME/conf
>> >>directory
>> >> >(say /data/slider/conf) is different than HADOOP_CONF_DIR
>> >> >(/etc/hadoop/conf). Is it required/recommended that files in
>> >> >HADOOP_CONF_DIR be copied to SLIDER_HOME/conf and slider-env.sh
>>script
>> >> >sets
>> >> >HADOOP_CONF_DIR to /data/slider/conf ?
>> >> >
>> >> >Or can the slider-env.sh set HADOOP_CONF_DIR to /etc/hadoop/conf ,
>> >>without
>> >> >copying the files ?
>> >> >
>> >> >Using slider .80 for now, but would like to know recommendation for
>> >>this
>> >> >and future versions as well.
>> >> >
>> >> >Thanks in advance,
>> >> >
>> >> >Manoj
>> >> >
>> >> >On Tue, Jul 26, 2016 at 3:27 PM, Manoj Samel
>><manojsameltech@gmail.com
>> >
>> >> >wrote:
>> >> >
>> >> >> Filed https://issues.apache.org/jira/browse/SLIDER-1158 with logs
>> and
>> >> my
>> >> >> analysis of logs.
>> >> >>
>> >> >> On Tue, Jul 26, 2016 at 10:36 AM, Gour Saha
>><gs...@hortonworks.com>
>> >> >>wrote:
>> >> >>
>> >> >>> Please file a JIRA and upload the logs to it.
>> >> >>>
>> >> >>> On 7/26/16, 10:21 AM, "Manoj Samel" <ma...@gmail.com>
>> >>wrote:
>> >> >>>
>> >> >>> >Hi Gour,
>> >> >>> >
>> >> >>> >Can you please reach me using your own email-id? I will then
>>send
>> >> >>>logs to
>> >> >>> >you, along with my analysis - I don't want to send logs on
>>public
>> >>list
>> >> >>> >
>> >> >>> >Thanks,
>> >> >>> >
>> >> >>> >On Mon, Jul 25, 2016 at 5:39 PM, Gour Saha
>><gs...@hortonworks.com>
>> >> >>> wrote:
>> >> >>> >
>> >> >>> >> Ok, so this node is not a gateway. It is part of the cluster,
>> >>which
>> >> >>> >>means
>> >> >>> >> you don¹t need slider-client.xml at all. Just have
>> >>HADOOP_CONF_DIR
>> >> >>> >> pointing to /etc/hadoop/conf in slider-env.sh and that should
>>be
>> >>it.
>> >> >>> >>
>> >> >>> >> So the above simplifies your config setup. It will not solve
>> >>either
>> >> >>>of
>> >> >>> >>the
>> >> >>> >> 2 problems you are facing.
>> >> >>> >>
>> >> >>> >> Now coming to the 2 issues you are facing, you have to provide
>> >> >>> >>additional
>> >> >>> >> logs for us to understand better. Let¹s start with  -
>> >> >>> >> 1. RM logs (specifically between the time when rm1->rm2
>>failover
>> >>is
>> >> >>> >> simulated)
>> >> >>> >> 2. Slider App logs
>> >> >>> >>
>> >> >>> >> -Gour
>> >> >>> >>
>> >> >>> >> On 7/25/16, 5:16 PM, "Manoj Samel" <ma...@gmail.com>
>> >> wrote:
>> >> >>> >>
>> >> >>> >> >   1. Not clear about your question on "gateway" node. The
>>node
>> >> >>> running
>> >> >>> >> >   slider is part of the hadoop cluster and there are other
>> >> >>>services
>> >> >>> >>like
>> >> >>> >> >   Oozie that run on this node that utilizes hdfs and yarn.
>>So
>> >>if
>> >> >>>your
>> >> >>> >> >   question is whether the node is otherwise working for HDFS
>> >>and
>> >> >>>Yarn
>> >> >>> >> >   configuration, it is working
>> >> >>> >> >   2. I copied all files from HADOOP_CONF_DIR (say
>> >> >>>/etc/hadoop/conf)
>> >> >>> to
>> >> >>> >> >the
>> >> >>> >> >   directory containing slider-client.xml (say
>> >>/data/latest/conf)
>> >> >>> >> >   3. In earlier email, I had done a mistake where
>>slider-env.sh
>> >> >>>file
>> >> >>> >> >HADOOP_CONF_DIR
>> >> >>> >> >   was pointing to original directory /etc/hadoop/conf. I
>>edited
>> >> >>>it to
>> >> >>> >> >   point to same directory containing slider-client.xml &
>> >> >>> slider-env.sh
>> >> >>> >> >i.e.
>> >> >>> >> >   /data/latest/conf
>> >> >>> >> >   4. I emptied slider-client.xml. It just had the
>> >> >>> >> ><configuration></configuration>.
>> >> >>> >> >   The creation of spas worked but the Slider AM still shows
>>the
>> >> >>>same
>> >> >>> >> >issue.
>> >> >>> >> >   i.e. when RM1 goes from active to standby, slider AM goes
>> >>from
>> >> >>> >>RUNNING
>> >> >>> >> >to
>> >> >>> >> >   ACCPTED state with same error about TOKEN. Also NOTE that
>> >>when
>> >> >>> >> >   slider-client.xml is empty, the "slider destroy xxx"
>>command
>> >> >>>still
>> >> >>> >> >fails
>> >> >>> >> >   with Zookeeper connection errors.
>> >> >>> >> >   5. I then added same parameters (as my last email - except
>> >> >>> >> >   HADOOP_CONF_DIR) to slider-client.xml and ran. This time
>> >> >>> >>slider-env.sh
>> >> >>> >> >   has HADOOP_CONF_DIR pointing to /data/latest/conf and
>> >> >>> >>slider-client.xml
>> >> >>> >> >   does not have HADOOP_CONF_DIR. The same issue exists (but
>> >> >>>"slider
>> >> >>> >> >   destroy" does not fails)
>> >> >>> >> >   6. Could you explain what do you expect to pick up from
>> >>Hadoop
>> >> >>> >> >   configurations that will help you in RM Token ? If slider
>>has
>> >> >>>token
>> >> >>> >> >from
>> >> >>> >> >   RM1, and it switches to RM2, not clear what slider does to
>> >>get
>> >> >>> >> >delegation
>> >> >>> >> >   token for RM2 communication ?
>> >> >>> >> >   7. It is worth repeating again that issue happens only
>>when
>> >>RM1
>> >> >>>was
>> >> >>> >> >   active when slider app was created and then RM1 becomes
>> >> >>>standby. If
>> >> >>> >> >RM2 was
>> >> >>> >> >   active when slider app was created, then slider AM keeps
>> >>running
>> >> >>> for
>> >> >>> >> >any
>> >> >>> >> >   number of switches between RM2 and RM1 back and forth ...
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> >On Mon, Jul 25, 2016 at 4:21 PM, Gour Saha
>> >><gs...@hortonworks.com>
>> >> >>> >>wrote:
>> >> >>> >> >
>> >> >>> >> >> The node you are running slider from, is that a gateway
>>node?
>> >> >>>Sorry
>> >> >>> >>for
>> >> >>> >> >> not being explicit. I meant copy everything under
>> >> >>>/etc/hadoop/conf
>> >> >>> >>from
>> >> >>> >> >> your cluster into some temp directory (say
>>/tmp/hadoop_conf)
>> >>in
>> >> >>>your
>> >> >>> >> >> gateway node or local or whichever node you are running
>>slider
>> >> >>>from.
>> >> >>> >> >>Then
>> >> >>> >> >> set HADOOP_CONF_DIR to /tmp/hadoop_conf and clear
>>everything
>> >>out
>> >> >>> from
>> >> >>> >> >> slider-client.xml.
>> >> >>> >> >>
>> >> >>> >> >> On 7/25/16, 4:12 PM, "Manoj Samel"
>><ma...@gmail.com>
>> >> >>> wrote:
>> >> >>> >> >>
>> >> >>> >> >> >Hi Gour,
>> >> >>> >> >> >
>> >> >>> >> >> >Thanks for your prompt reply.
>> >> >>> >> >> >
>> >> >>> >> >> >FYI, issue happens when I create slider app when rm1 is
>> >>active
>> >> >>>and
>> >> >>> >>when
>> >> >>> >> >> >rm1
>> >> >>> >> >> >fails over to rm2. As soon as rm2 becomes active; the
>>slider
>> >>AM
>> >> >>> goes
>> >> >>> >> >>from
>> >> >>> >> >> >RUNNING to ACCEPTED state with above error.
>> >> >>> >> >> >
>> >> >>> >> >> >For your suggestion, I did following
>> >> >>> >> >> >
>> >> >>> >> >> >1) Copied core-site, hdfs-site, yarn-site, and mapred-site
>> >>from
>> >> >>> >> >> >HADOOP_CONF_DIR
>> >> >>> >> >> >to slider conf directory.
>> >> >>> >> >> >2) Our slider-env.sh already had HADOOP_CONF_DIR set
>> >> >>> >> >> >3) I removed all properties from slider-client.xml EXCEPT
>> >> >>>following
>> >> >>> >> >> >
>> >> >>> >> >> >   - HADOOP_CONF_DIR
>> >> >>> >> >> >   - slider.yarn.queue
>> >> >>> >> >> >   - slider.zookeeper.quorum
>> >> >>> >> >> >   - hadoop.registry.zk.quorum
>> >> >>> >> >> >   - hadoop.registry.zk.root
>> >> >>> >> >> >   - hadoop.security.authorization
>> >> >>> >> >> >   - hadoop.security.authentication
>> >> >>> >> >> >
>> >> >>> >> >> >Then I made rm1 active, installed and created slider app
>>and
>> >> >>> >>restarted
>> >> >>> >> >>rm1
>> >> >>> >> >> >(to make rm2) active. The slider-am again went from
>>RUNNING
>> >>to
>> >> >>> >>ACCEPTED
>> >> >>> >> >> >state.
>> >> >>> >> >> >
>> >> >>> >> >> >Let me know if you want me to try further changes.
>> >> >>> >> >> >
>> >> >>> >> >> >If I make the slider-client.xml completely empty per your
>> >> >>> >>suggestion,
>> >> >>> >> >>only
>> >> >>> >> >> >slider AM comes up but it
>> >> >>> >> >> >fails to start components. The AM log shows errors trying
>>to
>> >> >>> >>connect to
>> >> >>> >> >> >zookeeper like below.
>> >> >>> >> >> >2016-07-25 23:07:41,532
>> >> >>> >> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)]
>>WARN
>> >> >>> >> >> >zookeeper.ClientCnxn - Session 0x0 for server null,
>> >>unexpected
>> >> >>> >>error,
>> >> >>> >> >> >closing socket connection and attempting reconnect
>> >> >>> >> >> >java.net.ConnectException: Connection refused
>> >> >>> >> >> >
>> >> >>> >> >> >Hence I kept minimal info in slider-client.xml
>> >> >>> >> >> >
>> >> >>> >> >> >FYI This is slider version 0.80
>> >> >>> >> >> >
>> >> >>> >> >> >Thanks,
>> >> >>> >> >> >
>> >> >>> >> >> >Manoj
>> >> >>> >> >> >
>> >> >>> >> >> >On Mon, Jul 25, 2016 at 2:54 PM, Gour Saha
>> >> >>><gs...@hortonworks.com>
>> >> >>> >> >>wrote:
>> >> >>> >> >> >
>> >> >>> >> >> >> If possible, can you copy the entire content of the
>> >>directory
>> >> >>> >> >> >> /etc/hadoop/conf and then set HADOOP_CONF_DIR in
>> >> >>>slider-env.sh to
>> >> >>> >>it.
>> >> >>> >> >> >>Keep
>> >> >>> >> >> >> slider-client.xml empty.
>> >> >>> >> >> >>
>> >> >>> >> >> >> Now when you do the same rm1->rm2 and then the reverse
>> >> >>>failovers,
>> >> >>> >>do
>> >> >>> >> >>you
>> >> >>> >> >> >> see the same behaviors?
>> >> >>> >> >> >>
>> >> >>> >> >> >> -Gour
>> >> >>> >> >> >>
>> >> >>> >> >> >> On 7/25/16, 2:28 PM, "Manoj Samel"
>> >><ma...@gmail.com>
>> >> >>> >>wrote:
>> >> >>> >> >> >>
>> >> >>> >> >> >> >Another observation (whatever it is worth)
>> >> >>> >> >> >> >
>> >> >>> >> >> >> >If slider app is created and started when rm2 was
>>active,
>> >> >>>then
>> >> >>> it
>> >> >>> >> >> >>seems to
>> >> >>> >> >> >> >survive switches between rm2 and rm1 (and back). I.e
>> >> >>> >> >> >> >
>> >> >>> >> >> >> >* rm2 is active
>> >> >>> >> >> >> >* create and start slider application
>> >> >>> >> >> >> >* fail over to rm1. Now the Slider AM keeps running
>> >> >>> >> >> >> >* fail over to rm2 again. Slider AM still keeps running
>> >> >>> >> >> >> >
>> >> >>> >> >> >> >So, it seems if it starts with rm1 active, then the AM
>> >>goes
>> >> >>>to
>> >> >>> >> >> >>"ACCEPTED"
>> >> >>> >> >> >> >state when RM fails to rm2. If it starts with rm2
>>active,
>> >> >>>then
>> >> >>> it
>> >> >>> >> >>runs
>> >> >>> >> >> >> >fine
>> >> >>> >> >> >> >with any switches between rm1 and rm2.
>> >> >>> >> >> >> >
>> >> >>> >> >> >> >Any feedback ?
>> >> >>> >> >> >> >
>> >> >>> >> >> >> >Thanks,
>> >> >>> >> >> >> >
>> >> >>> >> >> >> >Manoj
>> >> >>> >> >> >> >
>> >> >>> >> >> >> >On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel
>> >> >>> >> >> >><ma...@gmail.com>
>> >> >>> >> >> >> >wrote:
>> >> >>> >> >> >> >
>> >> >>> >> >> >> >> Setup
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >> - Hadoop 2.6 with RM HA, Kerberos enabled
>> >> >>> >> >> >> >> - Slider 0.80
>> >> >>> >> >> >> >> - In my slider-client.xml, I have added all RM HA
>> >> >>>properties,
>> >> >>> >> >> >>including
>> >> >>> >> >> >> >> the ones mentioned in
>> >> >>> >> >>http://markmail.org/message/wnhpp2zn6ixo65e3.
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >> Following is the issue
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >> * rm1 is active, rm2 is standby
>> >> >>> >> >> >> >> * deploy and start slider application, it runs fine
>> >> >>> >> >> >> >> * restart rm1, rm2 is now active.
>> >> >>> >> >> >> >> * The slider-am now goes from running into "ACCEPTED"
>> >> >>>mode. It
>> >> >>> >> >>stays
>> >> >>> >> >> >> >>there
>> >> >>> >> >> >> >> till rm1 is made active again.
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >> In the slider-am log, it tries to connect to RM2 and
>> >> >>> connection
>> >> >>> >> >>fails
>> >> >>> >> >> >> >>due
>> >> >>> >> >> >> >> to org.apache.hadoop.security.AccessControlException:
>> >> >>>Client
>> >> >>> >> >>cannot
>> >> >>> >> >> >> >> authenticate via:[TOKEN]. See detailed log below
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >>  It seems it has some token (delegation token?) for
>>RM1
>> >>but
>> >> >>> >>tries
>> >> >>> >> >>to
>> >> >>> >> >> >>use
>> >> >>> >> >> >> >> same(?) for RM2 and fails. Am I missing some
>> >>configuration
>> >> >>>???
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >> Thanks,
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >>
>> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
>>INFO
>> >> >>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing
>> >>over to
>> >> >>> rm2
>> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
>>WARN
>> >> >>> >> >> >> >>  security.UserGroupInformation -
>> >>PriviledgedActionException
>> >> >>> >> >> >>as:abc@XYZ
>> >> >>> >> >> >> >> (auth:KERBEROS)
>> >> >>> >> >> >>cause:org.apache.hadoop.security.AccessControlException:
>> >> >>> >> >> >> >> Client cannot authenticate via:[TOKEN]
>> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
>>WARN
>> >> >>> >> >>ipc.Client -
>> >> >>> >> >> >> >> Exception encountered while connecting to the server
>>:
>> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>> >>Client
>> >> >>> >>cannot
>> >> >>> >> >> >> >> authenticate via:[TOKEN]
>> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread]
>>WARN
>> >> >>> >> >> >> >>  security.UserGroupInformation -
>> >>PriviledgedActionException
>> >> >>> >> >> >>as:abc@XYZ
>> >> >>> >> >> >> >> (auth:KERBEROS) cause:java.io.IOException:
>> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>> >>Client
>> >> >>> >>cannot
>> >> >>> >> >> >> >> authenticate via:[TOKEN]
>> >> >>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread]
>>INFO
>> >> >>> >> >> >> >>  retry.RetryInvocationHandler - Exception while
>>invoking
>> >> >>> >>allocate
>> >> >>> >> >>of
>> >> >>> >> >> >> >>class
>> >> >>> >> >> >> >> ApplicationMasterProtocolPBClientImpl over rm2 after
>>287
>> >> >>>fail
>> >> >>> >>over
>> >> >>> >> >> >> >> attempts. Trying to fail over immediately.
>> >> >>> >> >> >> >> java.io.IOException: Failed on local exception:
>> >> >>> >> >>java.io.IOException:
>> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>> >>Client
>> >> >>> >>cannot
>> >> >>> >> >> >> >> authenticate via:[TOKEN]; Host Details : local host
>>is:
>> >> >>> >>"<SliderAM
>> >> >>> >> >> >> >> HOST>/<slider AM Host IP>"; destination host is:
>>"<RM2
>> >> >>> >> >>HOST>":23130;
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >>
>> >> >>>>>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>> >> >>> >> >> >> >>         at
>> >> >>>org.apache.hadoop.ipc.Client.call(Client.java:1476)
>> >> >>> >> >> >> >>         at
>> >> >>>org.apache.hadoop.ipc.Client.call(Client.java:1403)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >> >>
>> >> >>> >> >> >>
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> 
>>>>>>>>>>>>>>>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(Proto
>>>>>>>>>>>>>>>bu
>> >>>>>>>>>>>>>fR
>> >> >>>>>>>>>>>pcE
>> >> >>> >>>>>>>>ng
>> >> >>> >> >>>>>>in
>> >> >>> >> >> >>>>e.
>> >> >>> >> >> >> >>java:230)
>> >> >>> >> >> >> >>         at com.sun.proxy.$Proxy23.allocate(Unknown
>> >>Source)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >> >>
>> >> >>> >> >> >>
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> 
>>>>>>>>>>>>>>>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterP
>>>>>>>>>>>>>>>ro
>> >>>>>>>>>>>>>to
>> >> >>>>>>>>>>>col
>> >> >>> >>>>>>>>PB
>> >> >>> >> >>>>>>Cl
>> >> >>> >> >> >>>>ie
>> >> >>> >> >> >>
>> >> >>>>>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >> >>> >> >> >> >>         at
>> >> >>> sun.reflect.GeneratedMethodAccessor10.invoke(Unknown
>> >> >>> >> >> >>Source)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >> >>
>> >> >>> >> >> >>
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> 
>>>>>>>>>>>>>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>>>>>>>>>>>>>>>th
>> >>>>>>>>>>>>>od
>> >> >>>>>>>>>>>Acc
>> >> >>> >>>>>>>>es
>> >> >>> >> >>>>>>so
>> >> >>> >> >> >>>>rI
>> >> >>> >> >> >> >>mpl.java:43)
>> >> >>> >> >> >> >>         at
>> >>java.lang.reflect.Method.invoke(Method.java:497)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >> >>
>> >> >>> >> >> >>
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> 
>>>>>>>>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMeth
>>>>>>>>>>>>>>>od
>> >>>>>>>>>>>>>(R
>> >> >>>>>>>>>>>etr
>> >> >>> >>>>>>>>yI
>> >> >>> >> >>>>>>nv
>> >> >>> >> >> >>>>oc
>> >> >>> >> >> >> >>ationHandler.java:252)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >> >>
>> >> >>> >> >> >>
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> 
>>>>>>>>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(Ret
>>>>>>>>>>>>>>>ry
>> >>>>>>>>>>>>>In
>> >> >>>>>>>>>>>voc
>> >> >>> >>>>>>>>at
>> >> >>> >> >>>>>>io
>> >> >>> >> >> >>>>nH
>> >> >>> >> >> >> >>andler.java:104)
>> >> >>> >> >> >> >>         at com.sun.proxy.$Proxy24.allocate(Unknown
>> >>Source)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >> >>
>> >> >>> >> >> >>
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> 
>>>>>>>>>>>>>>>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.alloca
>>>>>>>>>>>>>>>te
>> >>>>>>>>>>>>>(A
>> >> >>>>>>>>>>>MRM
>> >> >>> >>>>>>>>Cl
>> >> >>> >> >>>>>>ie
>> >> >>> >> >> >>>>nt
>> >> >>> >> >> >> >>Impl.java:278)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >> >>
>> >> >>> >> >> >>
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> 
>>>>>>>>>>>>>>>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsync
>>>>>>>>>>>>>>>Im
>> >>>>>>>>>>>>>pl
>> >> >>>>>>>>>>>$He
>> >> >>> >>>>>>>>ar
>> >> >>> >> >>>>>>tb
>> >> >>> >> >> >>>>ea
>> >> >>> >> >> >> >>tThread.run(AMRMClientAsyncImpl.java:224)
>> >> >>> >> >> >> >> Caused by: java.io.IOException:
>> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>> >>Client
>> >> >>> >>cannot
>> >> >>> >> >> >> >> authenticate via:[TOKEN]
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >>
>> >> >>>>>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
>> >> >>> >> >> >> >>         at
>> >> >>>java.security.AccessController.doPrivileged(Native
>> >> >>> >> >>Method)
>> >> >>> >> >> >> >>         at
>> >> >>>javax.security.auth.Subject.doAs(Subject.java:422)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >> >>
>> >> >>> >> >> >>
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> 
>>>>>>>>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
>>>>>>>>>>>>>>>up
>> >>>>>>>>>>>>>In
>> >> >>>>>>>>>>>for
>> >> >>> >>>>>>>>ma
>> >> >>> >> >>>>>>ti
>> >> >>> >> >> >>>>on
>> >> >>> >> >> >> >>.java:1671)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >> >>
>> >> >>> >> >> >>
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> 
>>>>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.handleSaslConnection
>>>>>>>>>>>>>>>Fa
>> >>>>>>>>>>>>>il
>> >> >>>>>>>>>>>ure
>> >> >>> >>>>>>>>(C
>> >> >>> >> >>>>>>li
>> >> >>> >> >> >>>>en
>> >> >>> >> >> >> >>t.java:645)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >> >>
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> 
>>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.
>>>>>>>>>>>>>ja
>> >>>>>>>>>>>va
>> >> >>>>>>>>>:73
>> >> >>> >>>>>>3)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >> >>
>> >> >>> >>
>> >>
>> 
>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:37
>>>>>>>>>0)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> 
>>>>org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
>> >> >>> >> >> >> >>         at
>> >> >>>org.apache.hadoop.ipc.Client.call(Client.java:1442)
>> >> >>> >> >> >> >>         ... 12 more
>> >> >>> >> >> >> >> Caused by:
>> >> >>>org.apache.hadoop.security.AccessControlException:
>> >> >>> >> >>Client
>> >> >>> >> >> >> >> cannot authenticate via:[TOKEN]
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >> >>
>> >> >>> >> >> >>
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> 
>>>>>>>>>>>>>>>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(Sa
>>>>>>>>>>>>>>>sl
>> >>>>>>>>>>>>>Rp
>> >> >>>>>>>>>>>cCl
>> >> >>> >>>>>>>>ie
>> >> >>> >> >>>>>>nt
>> >> >>> >> >> >>>>.j
>> >> >>> >> >> >> >>ava:172)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >> >>
>> >> >>> >> >> >>
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> 
>>>>>>>>>>>>>>>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpc
>>>>>>>>>>>>>>>Cl
>> >>>>>>>>>>>>>ie
>> >> >>>>>>>>>>>nt.
>> >> >>> >>>>>>>>ja
>> >> >>> >> >>>>>>va
>> >> >>> >> >> >>>>:3
>> >> >>> >> >> >> >>96)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >> >>
>> >> >>> >> >> >>
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> 
>>>>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(
>>>>>>>>>>>>>>>Cl
>> >>>>>>>>>>>>>ie
>> >> >>>>>>>>>>>nt.
>> >> >>> >>>>>>>>ja
>> >> >>> >> >>>>>>va
>> >> >>> >> >> >>>>:5
>> >> >>> >> >> >> >>55)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >> >>
>> >> >>> >>
>> >>
>> 
>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:37
>>>>>>>>>0)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >>
>> >> >>>>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >>
>> >> >>>>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
>> >> >>> >> >> >> >>         at
>> >> >>>java.security.AccessController.doPrivileged(Native
>> >> >>> >> >>Method)
>> >> >>> >> >> >> >>         at
>> >> >>>javax.security.auth.Subject.doAs(Subject.java:422)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >> >>
>> >> >>> >> >> >>
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> 
>>>>>>>>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
>>>>>>>>>>>>>>>up
>> >>>>>>>>>>>>>In
>> >> >>>>>>>>>>>for
>> >> >>> >>>>>>>>ma
>> >> >>> >> >>>>>>ti
>> >> >>> >> >> >>>>on
>> >> >>> >> >> >> >>.java:1671)
>> >> >>> >> >> >> >>         at
>> >> >>> >> >> >> >>
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >>
>> 
>>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.
>>>>>>>>>>>>>ja
>> >>>>>>>>>>>va
>> >> >>>>>>>>>:72
>> >> >>> >>>>>>0)
>> >> >>> >> >> >> >>         ... 15 more
>> >> >>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread]
>>INFO
>> >> >>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing
>> >>over to
>> >> >>> rm1
>> >> >>> >> >> >> >>
>> >> >>> >> >> >>
>> >> >>> >> >> >>
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>>
>> >> >>>
>> >> >>
>> >>
>> >>
>>
>>


Re: Slider AM fails to run when RM in HA setup fails over

Posted by Manoj Samel <ma...@gmail.com>.
Following slider specific properties are at present added in
/data/slider/conf/slider-client.xml. If you think they should be picked up
from HADOOP_CONF_DIR (/etc/hadoop/conf) file, which file in HADOOP_CONF_DIR
should these be added ?

   - slider.zookeeper.quorum
   - hadoop.registry.zk.quorum
   - hadoop.registry.zk.root
   - slider.yarn.queue


On Thu, Jul 28, 2016 at 4:37 PM, Gour Saha <gs...@hortonworks.com> wrote:

> That is strange, since it is indeed not required to contain anything in
> slider-client.xml (except <configuration></configuration>) if
> HADOOP_CONF_DIR has everything that Slider needs. This probably gives an
> indication that there might be some issue with cluster configuration based
> on files solely under HADOOP_CONF_DIR to begin with.
>
> Suggest you to upload all the config files to the jira to help debug this
> further.
>
> -Gour
>
> On 7/28/16, 4:27 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>
> >Thanks Gour for prompt reply
> >
> >BTW - Creating a empty slider-client.xml (with just
> ><configuration></configuration>) does not works. The AM starts but fails
> >to
> >create any components and shows errors like
> >
> >2016-07-28 23:18:46,018
> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
> > zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error,
> >closing socket connection and attempting reconnect
> >java.net.ConnectException: Connection refused
> >        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> >        at
> >sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> >        at
> >org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.j
> >ava:361)
> >        at
> >org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
> >
> >Also, command "slider destroy <app>" fails with zookeeper errors ...
> >
> >I had to keep a minimal slider-client.xml. It does not have any RM info
> >etc. but does contain slider ZK related properties like
> >"slider.zookeeper.quorum", "hadoop.registry.zk.quorum",
> >"hadoop.registry.zk.root". I haven't yet distilled the absolute minimal
> >set
> >of properties required, but this should suffice for now. All RM / HDFS
> >properties will be read from HADOOP_CONF_DIR files.
> >
> >Let me know if this could cause any issues.
> >
> >On Thu, Jul 28, 2016 at 3:36 PM, Gour Saha <gs...@hortonworks.com> wrote:
> >
> >> No need to copy any files. Pointing HADOOP_CONF_DIR to /etc/hadoop/conf
> >>is
> >> good.
> >>
> >> -Gour
> >>
> >> On 7/28/16, 3:24 PM, "Manoj Samel" <ma...@gmail.com> wrote:
> >>
> >> >Follow up question regarding Gour's comment in earlier thread -
> >> >
> >> >Slider is installed on one of the hadoop nodes. SLIDER_HOME/conf
> >>directory
> >> >(say /data/slider/conf) is different than HADOOP_CONF_DIR
> >> >(/etc/hadoop/conf). Is it required/recommended that files in
> >> >HADOOP_CONF_DIR be copied to SLIDER_HOME/conf and slider-env.sh script
> >> >sets
> >> >HADOOP_CONF_DIR to /data/slider/conf ?
> >> >
> >> >Or can the slider-env.sh set HADOOP_CONF_DIR to /etc/hadoop/conf ,
> >>without
> >> >copying the files ?
> >> >
> >> >Using slider .80 for now, but would like to know recommendation for
> >>this
> >> >and future versions as well.
> >> >
> >> >Thanks in advance,
> >> >
> >> >Manoj
> >> >
> >> >On Tue, Jul 26, 2016 at 3:27 PM, Manoj Samel <manojsameltech@gmail.com
> >
> >> >wrote:
> >> >
> >> >> Filed https://issues.apache.org/jira/browse/SLIDER-1158 with logs
> and
> >> my
> >> >> analysis of logs.
> >> >>
> >> >> On Tue, Jul 26, 2016 at 10:36 AM, Gour Saha <gs...@hortonworks.com>
> >> >>wrote:
> >> >>
> >> >>> Please file a JIRA and upload the logs to it.
> >> >>>
> >> >>> On 7/26/16, 10:21 AM, "Manoj Samel" <ma...@gmail.com>
> >>wrote:
> >> >>>
> >> >>> >Hi Gour,
> >> >>> >
> >> >>> >Can you please reach me using your own email-id? I will then send
> >> >>>logs to
> >> >>> >you, along with my analysis - I don't want to send logs on public
> >>list
> >> >>> >
> >> >>> >Thanks,
> >> >>> >
> >> >>> >On Mon, Jul 25, 2016 at 5:39 PM, Gour Saha <gs...@hortonworks.com>
> >> >>> wrote:
> >> >>> >
> >> >>> >> Ok, so this node is not a gateway. It is part of the cluster,
> >>which
> >> >>> >>means
> >> >>> >> you don¹t need slider-client.xml at all. Just have
> >>HADOOP_CONF_DIR
> >> >>> >> pointing to /etc/hadoop/conf in slider-env.sh and that should be
> >>it.
> >> >>> >>
> >> >>> >> So the above simplifies your config setup. It will not solve
> >>either
> >> >>>of
> >> >>> >>the
> >> >>> >> 2 problems you are facing.
> >> >>> >>
> >> >>> >> Now coming to the 2 issues you are facing, you have to provide
> >> >>> >>additional
> >> >>> >> logs for us to understand better. Let¹s start with  -
> >> >>> >> 1. RM logs (specifically between the time when rm1->rm2 failover
> >>is
> >> >>> >> simulated)
> >> >>> >> 2. Slider App logs
> >> >>> >>
> >> >>> >> -Gour
> >> >>> >>
> >> >>> >> On 7/25/16, 5:16 PM, "Manoj Samel" <ma...@gmail.com>
> >> wrote:
> >> >>> >>
> >> >>> >> >   1. Not clear about your question on "gateway" node. The node
> >> >>> running
> >> >>> >> >   slider is part of the hadoop cluster and there are other
> >> >>>services
> >> >>> >>like
> >> >>> >> >   Oozie that run on this node that utilizes hdfs and yarn. So
> >>if
> >> >>>your
> >> >>> >> >   question is whether the node is otherwise working for HDFS
> >>and
> >> >>>Yarn
> >> >>> >> >   configuration, it is working
> >> >>> >> >   2. I copied all files from HADOOP_CONF_DIR (say
> >> >>>/etc/hadoop/conf)
> >> >>> to
> >> >>> >> >the
> >> >>> >> >   directory containing slider-client.xml (say
> >>/data/latest/conf)
> >> >>> >> >   3. In earlier email, I had done a mistake where slider-env.sh
> >> >>>file
> >> >>> >> >HADOOP_CONF_DIR
> >> >>> >> >   was pointing to original directory /etc/hadoop/conf. I edited
> >> >>>it to
> >> >>> >> >   point to same directory containing slider-client.xml &
> >> >>> slider-env.sh
> >> >>> >> >i.e.
> >> >>> >> >   /data/latest/conf
> >> >>> >> >   4. I emptied slider-client.xml. It just had the
> >> >>> >> ><configuration></configuration>.
> >> >>> >> >   The creation of spas worked but the Slider AM still shows the
> >> >>>same
> >> >>> >> >issue.
> >> >>> >> >   i.e. when RM1 goes from active to standby, slider AM goes
> >>from
> >> >>> >>RUNNING
> >> >>> >> >to
> >> >>> >> >   ACCPTED state with same error about TOKEN. Also NOTE that
> >>when
> >> >>> >> >   slider-client.xml is empty, the "slider destroy xxx" command
> >> >>>still
> >> >>> >> >fails
> >> >>> >> >   with Zookeeper connection errors.
> >> >>> >> >   5. I then added same parameters (as my last email - except
> >> >>> >> >   HADOOP_CONF_DIR) to slider-client.xml and ran. This time
> >> >>> >>slider-env.sh
> >> >>> >> >   has HADOOP_CONF_DIR pointing to /data/latest/conf and
> >> >>> >>slider-client.xml
> >> >>> >> >   does not have HADOOP_CONF_DIR. The same issue exists (but
> >> >>>"slider
> >> >>> >> >   destroy" does not fails)
> >> >>> >> >   6. Could you explain what do you expect to pick up from
> >>Hadoop
> >> >>> >> >   configurations that will help you in RM Token ? If slider has
> >> >>>token
> >> >>> >> >from
> >> >>> >> >   RM1, and it switches to RM2, not clear what slider does to
> >>get
> >> >>> >> >delegation
> >> >>> >> >   token for RM2 communication ?
> >> >>> >> >   7. It is worth repeating again that issue happens only when
> >>RM1
> >> >>>was
> >> >>> >> >   active when slider app was created and then RM1 becomes
> >> >>>standby. If
> >> >>> >> >RM2 was
> >> >>> >> >   active when slider app was created, then slider AM keeps
> >>running
> >> >>> for
> >> >>> >> >any
> >> >>> >> >   number of switches between RM2 and RM1 back and forth ...
> >> >>> >> >
> >> >>> >> >
> >> >>> >> >On Mon, Jul 25, 2016 at 4:21 PM, Gour Saha
> >><gs...@hortonworks.com>
> >> >>> >>wrote:
> >> >>> >> >
> >> >>> >> >> The node you are running slider from, is that a gateway node?
> >> >>>Sorry
> >> >>> >>for
> >> >>> >> >> not being explicit. I meant copy everything under
> >> >>>/etc/hadoop/conf
> >> >>> >>from
> >> >>> >> >> your cluster into some temp directory (say /tmp/hadoop_conf)
> >>in
> >> >>>your
> >> >>> >> >> gateway node or local or whichever node you are running slider
> >> >>>from.
> >> >>> >> >>Then
> >> >>> >> >> set HADOOP_CONF_DIR to /tmp/hadoop_conf and clear everything
> >>out
> >> >>> from
> >> >>> >> >> slider-client.xml.
> >> >>> >> >>
> >> >>> >> >> On 7/25/16, 4:12 PM, "Manoj Samel" <ma...@gmail.com>
> >> >>> wrote:
> >> >>> >> >>
> >> >>> >> >> >Hi Gour,
> >> >>> >> >> >
> >> >>> >> >> >Thanks for your prompt reply.
> >> >>> >> >> >
> >> >>> >> >> >FYI, issue happens when I create slider app when rm1 is
> >>active
> >> >>>and
> >> >>> >>when
> >> >>> >> >> >rm1
> >> >>> >> >> >fails over to rm2. As soon as rm2 becomes active; the slider
> >>AM
> >> >>> goes
> >> >>> >> >>from
> >> >>> >> >> >RUNNING to ACCEPTED state with above error.
> >> >>> >> >> >
> >> >>> >> >> >For your suggestion, I did following
> >> >>> >> >> >
> >> >>> >> >> >1) Copied core-site, hdfs-site, yarn-site, and mapred-site
> >>from
> >> >>> >> >> >HADOOP_CONF_DIR
> >> >>> >> >> >to slider conf directory.
> >> >>> >> >> >2) Our slider-env.sh already had HADOOP_CONF_DIR set
> >> >>> >> >> >3) I removed all properties from slider-client.xml EXCEPT
> >> >>>following
> >> >>> >> >> >
> >> >>> >> >> >   - HADOOP_CONF_DIR
> >> >>> >> >> >   - slider.yarn.queue
> >> >>> >> >> >   - slider.zookeeper.quorum
> >> >>> >> >> >   - hadoop.registry.zk.quorum
> >> >>> >> >> >   - hadoop.registry.zk.root
> >> >>> >> >> >   - hadoop.security.authorization
> >> >>> >> >> >   - hadoop.security.authentication
> >> >>> >> >> >
> >> >>> >> >> >Then I made rm1 active, installed and created slider app and
> >> >>> >>restarted
> >> >>> >> >>rm1
> >> >>> >> >> >(to make rm2) active. The slider-am again went from RUNNING
> >>to
> >> >>> >>ACCEPTED
> >> >>> >> >> >state.
> >> >>> >> >> >
> >> >>> >> >> >Let me know if you want me to try further changes.
> >> >>> >> >> >
> >> >>> >> >> >If I make the slider-client.xml completely empty per your
> >> >>> >>suggestion,
> >> >>> >> >>only
> >> >>> >> >> >slider AM comes up but it
> >> >>> >> >> >fails to start components. The AM log shows errors trying to
> >> >>> >>connect to
> >> >>> >> >> >zookeeper like below.
> >> >>> >> >> >2016-07-25 23:07:41,532
> >> >>> >> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
> >> >>> >> >> >zookeeper.ClientCnxn - Session 0x0 for server null,
> >>unexpected
> >> >>> >>error,
> >> >>> >> >> >closing socket connection and attempting reconnect
> >> >>> >> >> >java.net.ConnectException: Connection refused
> >> >>> >> >> >
> >> >>> >> >> >Hence I kept minimal info in slider-client.xml
> >> >>> >> >> >
> >> >>> >> >> >FYI This is slider version 0.80
> >> >>> >> >> >
> >> >>> >> >> >Thanks,
> >> >>> >> >> >
> >> >>> >> >> >Manoj
> >> >>> >> >> >
> >> >>> >> >> >On Mon, Jul 25, 2016 at 2:54 PM, Gour Saha
> >> >>><gs...@hortonworks.com>
> >> >>> >> >>wrote:
> >> >>> >> >> >
> >> >>> >> >> >> If possible, can you copy the entire content of the
> >>directory
> >> >>> >> >> >> /etc/hadoop/conf and then set HADOOP_CONF_DIR in
> >> >>>slider-env.sh to
> >> >>> >>it.
> >> >>> >> >> >>Keep
> >> >>> >> >> >> slider-client.xml empty.
> >> >>> >> >> >>
> >> >>> >> >> >> Now when you do the same rm1->rm2 and then the reverse
> >> >>>failovers,
> >> >>> >>do
> >> >>> >> >>you
> >> >>> >> >> >> see the same behaviors?
> >> >>> >> >> >>
> >> >>> >> >> >> -Gour
> >> >>> >> >> >>
> >> >>> >> >> >> On 7/25/16, 2:28 PM, "Manoj Samel"
> >><ma...@gmail.com>
> >> >>> >>wrote:
> >> >>> >> >> >>
> >> >>> >> >> >> >Another observation (whatever it is worth)
> >> >>> >> >> >> >
> >> >>> >> >> >> >If slider app is created and started when rm2 was active,
> >> >>>then
> >> >>> it
> >> >>> >> >> >>seems to
> >> >>> >> >> >> >survive switches between rm2 and rm1 (and back). I.e
> >> >>> >> >> >> >
> >> >>> >> >> >> >* rm2 is active
> >> >>> >> >> >> >* create and start slider application
> >> >>> >> >> >> >* fail over to rm1. Now the Slider AM keeps running
> >> >>> >> >> >> >* fail over to rm2 again. Slider AM still keeps running
> >> >>> >> >> >> >
> >> >>> >> >> >> >So, it seems if it starts with rm1 active, then the AM
> >>goes
> >> >>>to
> >> >>> >> >> >>"ACCEPTED"
> >> >>> >> >> >> >state when RM fails to rm2. If it starts with rm2 active,
> >> >>>then
> >> >>> it
> >> >>> >> >>runs
> >> >>> >> >> >> >fine
> >> >>> >> >> >> >with any switches between rm1 and rm2.
> >> >>> >> >> >> >
> >> >>> >> >> >> >Any feedback ?
> >> >>> >> >> >> >
> >> >>> >> >> >> >Thanks,
> >> >>> >> >> >> >
> >> >>> >> >> >> >Manoj
> >> >>> >> >> >> >
> >> >>> >> >> >> >On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel
> >> >>> >> >> >><ma...@gmail.com>
> >> >>> >> >> >> >wrote:
> >> >>> >> >> >> >
> >> >>> >> >> >> >> Setup
> >> >>> >> >> >> >>
> >> >>> >> >> >> >> - Hadoop 2.6 with RM HA, Kerberos enabled
> >> >>> >> >> >> >> - Slider 0.80
> >> >>> >> >> >> >> - In my slider-client.xml, I have added all RM HA
> >> >>>properties,
> >> >>> >> >> >>including
> >> >>> >> >> >> >> the ones mentioned in
> >> >>> >> >>http://markmail.org/message/wnhpp2zn6ixo65e3.
> >> >>> >> >> >> >>
> >> >>> >> >> >> >> Following is the issue
> >> >>> >> >> >> >>
> >> >>> >> >> >> >> * rm1 is active, rm2 is standby
> >> >>> >> >> >> >> * deploy and start slider application, it runs fine
> >> >>> >> >> >> >> * restart rm1, rm2 is now active.
> >> >>> >> >> >> >> * The slider-am now goes from running into "ACCEPTED"
> >> >>>mode. It
> >> >>> >> >>stays
> >> >>> >> >> >> >>there
> >> >>> >> >> >> >> till rm1 is made active again.
> >> >>> >> >> >> >>
> >> >>> >> >> >> >> In the slider-am log, it tries to connect to RM2 and
> >> >>> connection
> >> >>> >> >>fails
> >> >>> >> >> >> >>due
> >> >>> >> >> >> >> to org.apache.hadoop.security.AccessControlException:
> >> >>>Client
> >> >>> >> >>cannot
> >> >>> >> >> >> >> authenticate via:[TOKEN]. See detailed log below
> >> >>> >> >> >> >>
> >> >>> >> >> >> >>  It seems it has some token (delegation token?) for RM1
> >>but
> >> >>> >>tries
> >> >>> >> >>to
> >> >>> >> >> >>use
> >> >>> >> >> >> >> same(?) for RM2 and fails. Am I missing some
> >>configuration
> >> >>>???
> >> >>> >> >> >> >>
> >> >>> >> >> >> >> Thanks,
> >> >>> >> >> >> >>
> >> >>> >> >> >> >>
> >> >>> >> >> >> >>
> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] INFO
> >> >>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing
> >>over to
> >> >>> rm2
> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
> >> >>> >> >> >> >>  security.UserGroupInformation -
> >>PriviledgedActionException
> >> >>> >> >> >>as:abc@XYZ
> >> >>> >> >> >> >> (auth:KERBEROS)
> >> >>> >> >> >>cause:org.apache.hadoop.security.AccessControlException:
> >> >>> >> >> >> >> Client cannot authenticate via:[TOKEN]
> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
> >> >>> >> >>ipc.Client -
> >> >>> >> >> >> >> Exception encountered while connecting to the server :
> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
> >>Client
> >> >>> >>cannot
> >> >>> >> >> >> >> authenticate via:[TOKEN]
> >> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
> >> >>> >> >> >> >>  security.UserGroupInformation -
> >>PriviledgedActionException
> >> >>> >> >> >>as:abc@XYZ
> >> >>> >> >> >> >> (auth:KERBEROS) cause:java.io.IOException:
> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
> >>Client
> >> >>> >>cannot
> >> >>> >> >> >> >> authenticate via:[TOKEN]
> >> >>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
> >> >>> >> >> >> >>  retry.RetryInvocationHandler - Exception while invoking
> >> >>> >>allocate
> >> >>> >> >>of
> >> >>> >> >> >> >>class
> >> >>> >> >> >> >> ApplicationMasterProtocolPBClientImpl over rm2 after 287
> >> >>>fail
> >> >>> >>over
> >> >>> >> >> >> >> attempts. Trying to fail over immediately.
> >> >>> >> >> >> >> java.io.IOException: Failed on local exception:
> >> >>> >> >>java.io.IOException:
> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
> >>Client
> >> >>> >>cannot
> >> >>> >> >> >> >> authenticate via:[TOKEN]; Host Details : local host is:
> >> >>> >>"<SliderAM
> >> >>> >> >> >> >> HOST>/<slider AM Host IP>"; destination host is: "<RM2
> >> >>> >> >>HOST>":23130;
> >> >>> >> >> >> >>         at
> >> >>> >> >> >>
> >> >>>>>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> >> >>> >> >> >> >>         at
> >> >>>org.apache.hadoop.ipc.Client.call(Client.java:1476)
> >> >>> >> >> >> >>         at
> >> >>>org.apache.hadoop.ipc.Client.call(Client.java:1403)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >> >>
> >> >>> >> >> >>
> >> >>> >> >>
> >> >>> >>
> >> >>>
> >> >>>
> >>
> >>>>>>>>>>>>>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(Protobu
> >>>>>>>>>>>>>fR
> >> >>>>>>>>>>>pcE
> >> >>> >>>>>>>>ng
> >> >>> >> >>>>>>in
> >> >>> >> >> >>>>e.
> >> >>> >> >> >> >>java:230)
> >> >>> >> >> >> >>         at com.sun.proxy.$Proxy23.allocate(Unknown
> >>Source)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >> >>
> >> >>> >> >> >>
> >> >>> >> >>
> >> >>> >>
> >> >>>
> >> >>>
> >>
> >>>>>>>>>>>>>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterPro
> >>>>>>>>>>>>>to
> >> >>>>>>>>>>>col
> >> >>> >>>>>>>>PB
> >> >>> >> >>>>>>Cl
> >> >>> >> >> >>>>ie
> >> >>> >> >> >>
> >> >>>>>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> >> >>> >> >> >> >>         at
> >> >>> sun.reflect.GeneratedMethodAccessor10.invoke(Unknown
> >> >>> >> >> >>Source)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >> >>
> >> >>> >> >> >>
> >> >>> >> >>
> >> >>> >>
> >> >>>
> >> >>>
> >>
> >>>>>>>>>>>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
> >>>>>>>>>>>>>od
> >> >>>>>>>>>>>Acc
> >> >>> >>>>>>>>es
> >> >>> >> >>>>>>so
> >> >>> >> >> >>>>rI
> >> >>> >> >> >> >>mpl.java:43)
> >> >>> >> >> >> >>         at
> >>java.lang.reflect.Method.invoke(Method.java:497)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >> >>
> >> >>> >> >> >>
> >> >>> >> >>
> >> >>> >>
> >> >>>
> >> >>>
> >>
> >>>>>>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod
> >>>>>>>>>>>>>(R
> >> >>>>>>>>>>>etr
> >> >>> >>>>>>>>yI
> >> >>> >> >>>>>>nv
> >> >>> >> >> >>>>oc
> >> >>> >> >> >> >>ationHandler.java:252)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >> >>
> >> >>> >> >> >>
> >> >>> >> >>
> >> >>> >>
> >> >>>
> >> >>>
> >>
> >>>>>>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(Retry
> >>>>>>>>>>>>>In
> >> >>>>>>>>>>>voc
> >> >>> >>>>>>>>at
> >> >>> >> >>>>>>io
> >> >>> >> >> >>>>nH
> >> >>> >> >> >> >>andler.java:104)
> >> >>> >> >> >> >>         at com.sun.proxy.$Proxy24.allocate(Unknown
> >>Source)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >> >>
> >> >>> >> >> >>
> >> >>> >> >>
> >> >>> >>
> >> >>>
> >> >>>
> >>
> >>>>>>>>>>>>>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate
> >>>>>>>>>>>>>(A
> >> >>>>>>>>>>>MRM
> >> >>> >>>>>>>>Cl
> >> >>> >> >>>>>>ie
> >> >>> >> >> >>>>nt
> >> >>> >> >> >> >>Impl.java:278)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >> >>
> >> >>> >> >> >>
> >> >>> >> >>
> >> >>> >>
> >> >>>
> >> >>>
> >>
> >>>>>>>>>>>>>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncIm
> >>>>>>>>>>>>>pl
> >> >>>>>>>>>>>$He
> >> >>> >>>>>>>>ar
> >> >>> >> >>>>>>tb
> >> >>> >> >> >>>>ea
> >> >>> >> >> >> >>tThread.run(AMRMClientAsyncImpl.java:224)
> >> >>> >> >> >> >> Caused by: java.io.IOException:
> >> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
> >>Client
> >> >>> >>cannot
> >> >>> >> >> >> >> authenticate via:[TOKEN]
> >> >>> >> >> >> >>         at
> >> >>> >> >> >>
> >> >>>>>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
> >> >>> >> >> >> >>         at
> >> >>>java.security.AccessController.doPrivileged(Native
> >> >>> >> >>Method)
> >> >>> >> >> >> >>         at
> >> >>>javax.security.auth.Subject.doAs(Subject.java:422)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >> >>
> >> >>> >> >> >>
> >> >>> >> >>
> >> >>> >>
> >> >>>
> >> >>>
> >>
> >>>>>>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroup
> >>>>>>>>>>>>>In
> >> >>>>>>>>>>>for
> >> >>> >>>>>>>>ma
> >> >>> >> >>>>>>ti
> >> >>> >> >> >>>>on
> >> >>> >> >> >> >>.java:1671)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >> >>
> >> >>> >> >> >>
> >> >>> >> >>
> >> >>> >>
> >> >>>
> >> >>>
> >>
> >>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFa
> >>>>>>>>>>>>>il
> >> >>>>>>>>>>>ure
> >> >>> >>>>>>>>(C
> >> >>> >> >>>>>>li
> >> >>> >> >> >>>>en
> >> >>> >> >> >> >>t.java:645)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >> >>
> >> >>> >> >>
> >> >>> >>
> >> >>>
> >> >>>
> >>
> >>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.ja
> >>>>>>>>>>>va
> >> >>>>>>>>>:73
> >> >>> >>>>>>3)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >> >>
> >> >>> >>
> >>
> >>>>>>>org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >>org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
> >> >>> >> >> >> >>         at
> >> >>>org.apache.hadoop.ipc.Client.call(Client.java:1442)
> >> >>> >> >> >> >>         ... 12 more
> >> >>> >> >> >> >> Caused by:
> >> >>>org.apache.hadoop.security.AccessControlException:
> >> >>> >> >>Client
> >> >>> >> >> >> >> cannot authenticate via:[TOKEN]
> >> >>> >> >> >> >>         at
> >> >>> >> >> >> >>
> >> >>> >> >> >>
> >> >>> >> >>
> >> >>> >>
> >> >>>
> >> >>>
> >>
> >>>>>>>>>>>>>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(Sasl
> >>>>>>>>>>>>>Rp
> >> >>>>>>>>>>>cCl
> >> >>> >>>>>>>>ie
> >> >>> >> >>>>>>nt
> >> >>> >> >> >>>>.j
> >> >>> >> >> >> >>ava:172)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >> >>
> >> >>> >> >> >>
> >> >>> >> >>
> >> >>> >>
> >> >>>
> >> >>>
> >>
> >>>>>>>>>>>>>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcCl
> >>>>>>>>>>>>>ie
> >> >>>>>>>>>>>nt.
> >> >>> >>>>>>>>ja
> >> >>> >> >>>>>>va
> >> >>> >> >> >>>>:3
> >> >>> >> >> >> >>96)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >> >>
> >> >>> >> >> >>
> >> >>> >> >>
> >> >>> >>
> >> >>>
> >> >>>
> >>
> >>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Cl
> >>>>>>>>>>>>>ie
> >> >>>>>>>>>>>nt.
> >> >>> >>>>>>>>ja
> >> >>> >> >>>>>>va
> >> >>> >> >> >>>>:5
> >> >>> >> >> >> >>55)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >> >>
> >> >>> >>
> >>
> >>>>>>>org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >>
> >> >>>>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >>
> >> >>>>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
> >> >>> >> >> >> >>         at
> >> >>>java.security.AccessController.doPrivileged(Native
> >> >>> >> >>Method)
> >> >>> >> >> >> >>         at
> >> >>>javax.security.auth.Subject.doAs(Subject.java:422)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >> >>
> >> >>> >> >> >>
> >> >>> >> >>
> >> >>> >>
> >> >>>
> >> >>>
> >>
> >>>>>>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroup
> >>>>>>>>>>>>>In
> >> >>>>>>>>>>>for
> >> >>> >>>>>>>>ma
> >> >>> >> >>>>>>ti
> >> >>> >> >> >>>>on
> >> >>> >> >> >> >>.java:1671)
> >> >>> >> >> >> >>         at
> >> >>> >> >> >> >>
> >> >>> >> >>
> >> >>> >>
> >> >>>
> >> >>>
> >>
> >>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.ja
> >>>>>>>>>>>va
> >> >>>>>>>>>:72
> >> >>> >>>>>>0)
> >> >>> >> >> >> >>         ... 15 more
> >> >>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
> >> >>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing
> >>over to
> >> >>> rm1
> >> >>> >> >> >> >>
> >> >>> >> >> >>
> >> >>> >> >> >>
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >>
> >> >>> >>
> >> >>>
> >> >>>
> >> >>
> >>
> >>
>
>

Re: Slider AM fails to run when RM in HA setup fails over

Posted by Gour Saha <gs...@hortonworks.com>.
That is strange, since it is indeed not required to contain anything in
slider-client.xml (except <configuration></configuration>) if
HADOOP_CONF_DIR has everything that Slider needs. This probably gives an
indication that there might be some issue with cluster configuration based
on files solely under HADOOP_CONF_DIR to begin with.

Suggest you to upload all the config files to the jira to help debug this
further.

-Gour

On 7/28/16, 4:27 PM, "Manoj Samel" <ma...@gmail.com> wrote:

>Thanks Gour for prompt reply
>
>BTW - Creating a empty slider-client.xml (with just
><configuration></configuration>) does not works. The AM starts but fails
>to
>create any components and shows errors like
>
>2016-07-28 23:18:46,018
>[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
> zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error,
>closing socket connection and attempting reconnect
>java.net.ConnectException: Connection refused
>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>        at
>sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>        at
>org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.j
>ava:361)
>        at
>org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>
>Also, command "slider destroy <app>" fails with zookeeper errors ...
>
>I had to keep a minimal slider-client.xml. It does not have any RM info
>etc. but does contain slider ZK related properties like
>"slider.zookeeper.quorum", "hadoop.registry.zk.quorum",
>"hadoop.registry.zk.root". I haven't yet distilled the absolute minimal
>set
>of properties required, but this should suffice for now. All RM / HDFS
>properties will be read from HADOOP_CONF_DIR files.
>
>Let me know if this could cause any issues.
>
>On Thu, Jul 28, 2016 at 3:36 PM, Gour Saha <gs...@hortonworks.com> wrote:
>
>> No need to copy any files. Pointing HADOOP_CONF_DIR to /etc/hadoop/conf
>>is
>> good.
>>
>> -Gour
>>
>> On 7/28/16, 3:24 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>>
>> >Follow up question regarding Gour's comment in earlier thread -
>> >
>> >Slider is installed on one of the hadoop nodes. SLIDER_HOME/conf
>>directory
>> >(say /data/slider/conf) is different than HADOOP_CONF_DIR
>> >(/etc/hadoop/conf). Is it required/recommended that files in
>> >HADOOP_CONF_DIR be copied to SLIDER_HOME/conf and slider-env.sh script
>> >sets
>> >HADOOP_CONF_DIR to /data/slider/conf ?
>> >
>> >Or can the slider-env.sh set HADOOP_CONF_DIR to /etc/hadoop/conf ,
>>without
>> >copying the files ?
>> >
>> >Using slider .80 for now, but would like to know recommendation for
>>this
>> >and future versions as well.
>> >
>> >Thanks in advance,
>> >
>> >Manoj
>> >
>> >On Tue, Jul 26, 2016 at 3:27 PM, Manoj Samel <ma...@gmail.com>
>> >wrote:
>> >
>> >> Filed https://issues.apache.org/jira/browse/SLIDER-1158 with logs and
>> my
>> >> analysis of logs.
>> >>
>> >> On Tue, Jul 26, 2016 at 10:36 AM, Gour Saha <gs...@hortonworks.com>
>> >>wrote:
>> >>
>> >>> Please file a JIRA and upload the logs to it.
>> >>>
>> >>> On 7/26/16, 10:21 AM, "Manoj Samel" <ma...@gmail.com>
>>wrote:
>> >>>
>> >>> >Hi Gour,
>> >>> >
>> >>> >Can you please reach me using your own email-id? I will then send
>> >>>logs to
>> >>> >you, along with my analysis - I don't want to send logs on public
>>list
>> >>> >
>> >>> >Thanks,
>> >>> >
>> >>> >On Mon, Jul 25, 2016 at 5:39 PM, Gour Saha <gs...@hortonworks.com>
>> >>> wrote:
>> >>> >
>> >>> >> Ok, so this node is not a gateway. It is part of the cluster,
>>which
>> >>> >>means
>> >>> >> you don¹t need slider-client.xml at all. Just have
>>HADOOP_CONF_DIR
>> >>> >> pointing to /etc/hadoop/conf in slider-env.sh and that should be
>>it.
>> >>> >>
>> >>> >> So the above simplifies your config setup. It will not solve
>>either
>> >>>of
>> >>> >>the
>> >>> >> 2 problems you are facing.
>> >>> >>
>> >>> >> Now coming to the 2 issues you are facing, you have to provide
>> >>> >>additional
>> >>> >> logs for us to understand better. Let¹s start with  -
>> >>> >> 1. RM logs (specifically between the time when rm1->rm2 failover
>>is
>> >>> >> simulated)
>> >>> >> 2. Slider App logs
>> >>> >>
>> >>> >> -Gour
>> >>> >>
>> >>> >> On 7/25/16, 5:16 PM, "Manoj Samel" <ma...@gmail.com>
>> wrote:
>> >>> >>
>> >>> >> >   1. Not clear about your question on "gateway" node. The node
>> >>> running
>> >>> >> >   slider is part of the hadoop cluster and there are other
>> >>>services
>> >>> >>like
>> >>> >> >   Oozie that run on this node that utilizes hdfs and yarn. So
>>if
>> >>>your
>> >>> >> >   question is whether the node is otherwise working for HDFS
>>and
>> >>>Yarn
>> >>> >> >   configuration, it is working
>> >>> >> >   2. I copied all files from HADOOP_CONF_DIR (say
>> >>>/etc/hadoop/conf)
>> >>> to
>> >>> >> >the
>> >>> >> >   directory containing slider-client.xml (say
>>/data/latest/conf)
>> >>> >> >   3. In earlier email, I had done a mistake where slider-env.sh
>> >>>file
>> >>> >> >HADOOP_CONF_DIR
>> >>> >> >   was pointing to original directory /etc/hadoop/conf. I edited
>> >>>it to
>> >>> >> >   point to same directory containing slider-client.xml &
>> >>> slider-env.sh
>> >>> >> >i.e.
>> >>> >> >   /data/latest/conf
>> >>> >> >   4. I emptied slider-client.xml. It just had the
>> >>> >> ><configuration></configuration>.
>> >>> >> >   The creation of spas worked but the Slider AM still shows the
>> >>>same
>> >>> >> >issue.
>> >>> >> >   i.e. when RM1 goes from active to standby, slider AM goes
>>from
>> >>> >>RUNNING
>> >>> >> >to
>> >>> >> >   ACCPTED state with same error about TOKEN. Also NOTE that
>>when
>> >>> >> >   slider-client.xml is empty, the "slider destroy xxx" command
>> >>>still
>> >>> >> >fails
>> >>> >> >   with Zookeeper connection errors.
>> >>> >> >   5. I then added same parameters (as my last email - except
>> >>> >> >   HADOOP_CONF_DIR) to slider-client.xml and ran. This time
>> >>> >>slider-env.sh
>> >>> >> >   has HADOOP_CONF_DIR pointing to /data/latest/conf and
>> >>> >>slider-client.xml
>> >>> >> >   does not have HADOOP_CONF_DIR. The same issue exists (but
>> >>>"slider
>> >>> >> >   destroy" does not fails)
>> >>> >> >   6. Could you explain what do you expect to pick up from
>>Hadoop
>> >>> >> >   configurations that will help you in RM Token ? If slider has
>> >>>token
>> >>> >> >from
>> >>> >> >   RM1, and it switches to RM2, not clear what slider does to
>>get
>> >>> >> >delegation
>> >>> >> >   token for RM2 communication ?
>> >>> >> >   7. It is worth repeating again that issue happens only when
>>RM1
>> >>>was
>> >>> >> >   active when slider app was created and then RM1 becomes
>> >>>standby. If
>> >>> >> >RM2 was
>> >>> >> >   active when slider app was created, then slider AM keeps
>>running
>> >>> for
>> >>> >> >any
>> >>> >> >   number of switches between RM2 and RM1 back and forth ...
>> >>> >> >
>> >>> >> >
>> >>> >> >On Mon, Jul 25, 2016 at 4:21 PM, Gour Saha
>><gs...@hortonworks.com>
>> >>> >>wrote:
>> >>> >> >
>> >>> >> >> The node you are running slider from, is that a gateway node?
>> >>>Sorry
>> >>> >>for
>> >>> >> >> not being explicit. I meant copy everything under
>> >>>/etc/hadoop/conf
>> >>> >>from
>> >>> >> >> your cluster into some temp directory (say /tmp/hadoop_conf)
>>in
>> >>>your
>> >>> >> >> gateway node or local or whichever node you are running slider
>> >>>from.
>> >>> >> >>Then
>> >>> >> >> set HADOOP_CONF_DIR to /tmp/hadoop_conf and clear everything
>>out
>> >>> from
>> >>> >> >> slider-client.xml.
>> >>> >> >>
>> >>> >> >> On 7/25/16, 4:12 PM, "Manoj Samel" <ma...@gmail.com>
>> >>> wrote:
>> >>> >> >>
>> >>> >> >> >Hi Gour,
>> >>> >> >> >
>> >>> >> >> >Thanks for your prompt reply.
>> >>> >> >> >
>> >>> >> >> >FYI, issue happens when I create slider app when rm1 is
>>active
>> >>>and
>> >>> >>when
>> >>> >> >> >rm1
>> >>> >> >> >fails over to rm2. As soon as rm2 becomes active; the slider
>>AM
>> >>> goes
>> >>> >> >>from
>> >>> >> >> >RUNNING to ACCEPTED state with above error.
>> >>> >> >> >
>> >>> >> >> >For your suggestion, I did following
>> >>> >> >> >
>> >>> >> >> >1) Copied core-site, hdfs-site, yarn-site, and mapred-site
>>from
>> >>> >> >> >HADOOP_CONF_DIR
>> >>> >> >> >to slider conf directory.
>> >>> >> >> >2) Our slider-env.sh already had HADOOP_CONF_DIR set
>> >>> >> >> >3) I removed all properties from slider-client.xml EXCEPT
>> >>>following
>> >>> >> >> >
>> >>> >> >> >   - HADOOP_CONF_DIR
>> >>> >> >> >   - slider.yarn.queue
>> >>> >> >> >   - slider.zookeeper.quorum
>> >>> >> >> >   - hadoop.registry.zk.quorum
>> >>> >> >> >   - hadoop.registry.zk.root
>> >>> >> >> >   - hadoop.security.authorization
>> >>> >> >> >   - hadoop.security.authentication
>> >>> >> >> >
>> >>> >> >> >Then I made rm1 active, installed and created slider app and
>> >>> >>restarted
>> >>> >> >>rm1
>> >>> >> >> >(to make rm2) active. The slider-am again went from RUNNING
>>to
>> >>> >>ACCEPTED
>> >>> >> >> >state.
>> >>> >> >> >
>> >>> >> >> >Let me know if you want me to try further changes.
>> >>> >> >> >
>> >>> >> >> >If I make the slider-client.xml completely empty per your
>> >>> >>suggestion,
>> >>> >> >>only
>> >>> >> >> >slider AM comes up but it
>> >>> >> >> >fails to start components. The AM log shows errors trying to
>> >>> >>connect to
>> >>> >> >> >zookeeper like below.
>> >>> >> >> >2016-07-25 23:07:41,532
>> >>> >> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
>> >>> >> >> >zookeeper.ClientCnxn - Session 0x0 for server null,
>>unexpected
>> >>> >>error,
>> >>> >> >> >closing socket connection and attempting reconnect
>> >>> >> >> >java.net.ConnectException: Connection refused
>> >>> >> >> >
>> >>> >> >> >Hence I kept minimal info in slider-client.xml
>> >>> >> >> >
>> >>> >> >> >FYI This is slider version 0.80
>> >>> >> >> >
>> >>> >> >> >Thanks,
>> >>> >> >> >
>> >>> >> >> >Manoj
>> >>> >> >> >
>> >>> >> >> >On Mon, Jul 25, 2016 at 2:54 PM, Gour Saha
>> >>><gs...@hortonworks.com>
>> >>> >> >>wrote:
>> >>> >> >> >
>> >>> >> >> >> If possible, can you copy the entire content of the
>>directory
>> >>> >> >> >> /etc/hadoop/conf and then set HADOOP_CONF_DIR in
>> >>>slider-env.sh to
>> >>> >>it.
>> >>> >> >> >>Keep
>> >>> >> >> >> slider-client.xml empty.
>> >>> >> >> >>
>> >>> >> >> >> Now when you do the same rm1->rm2 and then the reverse
>> >>>failovers,
>> >>> >>do
>> >>> >> >>you
>> >>> >> >> >> see the same behaviors?
>> >>> >> >> >>
>> >>> >> >> >> -Gour
>> >>> >> >> >>
>> >>> >> >> >> On 7/25/16, 2:28 PM, "Manoj Samel"
>><ma...@gmail.com>
>> >>> >>wrote:
>> >>> >> >> >>
>> >>> >> >> >> >Another observation (whatever it is worth)
>> >>> >> >> >> >
>> >>> >> >> >> >If slider app is created and started when rm2 was active,
>> >>>then
>> >>> it
>> >>> >> >> >>seems to
>> >>> >> >> >> >survive switches between rm2 and rm1 (and back). I.e
>> >>> >> >> >> >
>> >>> >> >> >> >* rm2 is active
>> >>> >> >> >> >* create and start slider application
>> >>> >> >> >> >* fail over to rm1. Now the Slider AM keeps running
>> >>> >> >> >> >* fail over to rm2 again. Slider AM still keeps running
>> >>> >> >> >> >
>> >>> >> >> >> >So, it seems if it starts with rm1 active, then the AM
>>goes
>> >>>to
>> >>> >> >> >>"ACCEPTED"
>> >>> >> >> >> >state when RM fails to rm2. If it starts with rm2 active,
>> >>>then
>> >>> it
>> >>> >> >>runs
>> >>> >> >> >> >fine
>> >>> >> >> >> >with any switches between rm1 and rm2.
>> >>> >> >> >> >
>> >>> >> >> >> >Any feedback ?
>> >>> >> >> >> >
>> >>> >> >> >> >Thanks,
>> >>> >> >> >> >
>> >>> >> >> >> >Manoj
>> >>> >> >> >> >
>> >>> >> >> >> >On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel
>> >>> >> >> >><ma...@gmail.com>
>> >>> >> >> >> >wrote:
>> >>> >> >> >> >
>> >>> >> >> >> >> Setup
>> >>> >> >> >> >>
>> >>> >> >> >> >> - Hadoop 2.6 with RM HA, Kerberos enabled
>> >>> >> >> >> >> - Slider 0.80
>> >>> >> >> >> >> - In my slider-client.xml, I have added all RM HA
>> >>>properties,
>> >>> >> >> >>including
>> >>> >> >> >> >> the ones mentioned in
>> >>> >> >>http://markmail.org/message/wnhpp2zn6ixo65e3.
>> >>> >> >> >> >>
>> >>> >> >> >> >> Following is the issue
>> >>> >> >> >> >>
>> >>> >> >> >> >> * rm1 is active, rm2 is standby
>> >>> >> >> >> >> * deploy and start slider application, it runs fine
>> >>> >> >> >> >> * restart rm1, rm2 is now active.
>> >>> >> >> >> >> * The slider-am now goes from running into "ACCEPTED"
>> >>>mode. It
>> >>> >> >>stays
>> >>> >> >> >> >>there
>> >>> >> >> >> >> till rm1 is made active again.
>> >>> >> >> >> >>
>> >>> >> >> >> >> In the slider-am log, it tries to connect to RM2 and
>> >>> connection
>> >>> >> >>fails
>> >>> >> >> >> >>due
>> >>> >> >> >> >> to org.apache.hadoop.security.AccessControlException:
>> >>>Client
>> >>> >> >>cannot
>> >>> >> >> >> >> authenticate via:[TOKEN]. See detailed log below
>> >>> >> >> >> >>
>> >>> >> >> >> >>  It seems it has some token (delegation token?) for RM1
>>but
>> >>> >>tries
>> >>> >> >>to
>> >>> >> >> >>use
>> >>> >> >> >> >> same(?) for RM2 and fails. Am I missing some
>>configuration
>> >>>???
>> >>> >> >> >> >>
>> >>> >> >> >> >> Thanks,
>> >>> >> >> >> >>
>> >>> >> >> >> >>
>> >>> >> >> >> >>
>> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] INFO
>> >>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing
>>over to
>> >>> rm2
>> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>> >>> >> >> >> >>  security.UserGroupInformation -
>>PriviledgedActionException
>> >>> >> >> >>as:abc@XYZ
>> >>> >> >> >> >> (auth:KERBEROS)
>> >>> >> >> >>cause:org.apache.hadoop.security.AccessControlException:
>> >>> >> >> >> >> Client cannot authenticate via:[TOKEN]
>> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>> >>> >> >>ipc.Client -
>> >>> >> >> >> >> Exception encountered while connecting to the server :
>> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>>Client
>> >>> >>cannot
>> >>> >> >> >> >> authenticate via:[TOKEN]
>> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>> >>> >> >> >> >>  security.UserGroupInformation -
>>PriviledgedActionException
>> >>> >> >> >>as:abc@XYZ
>> >>> >> >> >> >> (auth:KERBEROS) cause:java.io.IOException:
>> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>>Client
>> >>> >>cannot
>> >>> >> >> >> >> authenticate via:[TOKEN]
>> >>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
>> >>> >> >> >> >>  retry.RetryInvocationHandler - Exception while invoking
>> >>> >>allocate
>> >>> >> >>of
>> >>> >> >> >> >>class
>> >>> >> >> >> >> ApplicationMasterProtocolPBClientImpl over rm2 after 287
>> >>>fail
>> >>> >>over
>> >>> >> >> >> >> attempts. Trying to fail over immediately.
>> >>> >> >> >> >> java.io.IOException: Failed on local exception:
>> >>> >> >>java.io.IOException:
>> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>>Client
>> >>> >>cannot
>> >>> >> >> >> >> authenticate via:[TOKEN]; Host Details : local host is:
>> >>> >>"<SliderAM
>> >>> >> >> >> >> HOST>/<slider AM Host IP>"; destination host is: "<RM2
>> >>> >> >>HOST>":23130;
>> >>> >> >> >> >>         at
>> >>> >> >> >>
>> >>>>>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>> >>> >> >> >> >>         at
>> >>>org.apache.hadoop.ipc.Client.call(Client.java:1476)
>> >>> >> >> >> >>         at
>> >>>org.apache.hadoop.ipc.Client.call(Client.java:1403)
>> >>> >> >> >> >>         at
>> >>> >> >> >> >>
>> >>> >> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>> 
>>>>>>>>>>>>>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(Protobu
>>>>>>>>>>>>>fR
>> >>>>>>>>>>>pcE
>> >>> >>>>>>>>ng
>> >>> >> >>>>>>in
>> >>> >> >> >>>>e.
>> >>> >> >> >> >>java:230)
>> >>> >> >> >> >>         at com.sun.proxy.$Proxy23.allocate(Unknown
>>Source)
>> >>> >> >> >> >>         at
>> >>> >> >> >> >>
>> >>> >> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>> 
>>>>>>>>>>>>>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterPro
>>>>>>>>>>>>>to
>> >>>>>>>>>>>col
>> >>> >>>>>>>>PB
>> >>> >> >>>>>>Cl
>> >>> >> >> >>>>ie
>> >>> >> >> >>
>> >>>>>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >>> >> >> >> >>         at
>> >>> sun.reflect.GeneratedMethodAccessor10.invoke(Unknown
>> >>> >> >> >>Source)
>> >>> >> >> >> >>         at
>> >>> >> >> >> >>
>> >>> >> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>> 
>>>>>>>>>>>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
>>>>>>>>>>>>>od
>> >>>>>>>>>>>Acc
>> >>> >>>>>>>>es
>> >>> >> >>>>>>so
>> >>> >> >> >>>>rI
>> >>> >> >> >> >>mpl.java:43)
>> >>> >> >> >> >>         at
>>java.lang.reflect.Method.invoke(Method.java:497)
>> >>> >> >> >> >>         at
>> >>> >> >> >> >>
>> >>> >> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>> 
>>>>>>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod
>>>>>>>>>>>>>(R
>> >>>>>>>>>>>etr
>> >>> >>>>>>>>yI
>> >>> >> >>>>>>nv
>> >>> >> >> >>>>oc
>> >>> >> >> >> >>ationHandler.java:252)
>> >>> >> >> >> >>         at
>> >>> >> >> >> >>
>> >>> >> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>> 
>>>>>>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(Retry
>>>>>>>>>>>>>In
>> >>>>>>>>>>>voc
>> >>> >>>>>>>>at
>> >>> >> >>>>>>io
>> >>> >> >> >>>>nH
>> >>> >> >> >> >>andler.java:104)
>> >>> >> >> >> >>         at com.sun.proxy.$Proxy24.allocate(Unknown
>>Source)
>> >>> >> >> >> >>         at
>> >>> >> >> >> >>
>> >>> >> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>> 
>>>>>>>>>>>>>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate
>>>>>>>>>>>>>(A
>> >>>>>>>>>>>MRM
>> >>> >>>>>>>>Cl
>> >>> >> >>>>>>ie
>> >>> >> >> >>>>nt
>> >>> >> >> >> >>Impl.java:278)
>> >>> >> >> >> >>         at
>> >>> >> >> >> >>
>> >>> >> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>> 
>>>>>>>>>>>>>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncIm
>>>>>>>>>>>>>pl
>> >>>>>>>>>>>$He
>> >>> >>>>>>>>ar
>> >>> >> >>>>>>tb
>> >>> >> >> >>>>ea
>> >>> >> >> >> >>tThread.run(AMRMClientAsyncImpl.java:224)
>> >>> >> >> >> >> Caused by: java.io.IOException:
>> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException:
>>Client
>> >>> >>cannot
>> >>> >> >> >> >> authenticate via:[TOKEN]
>> >>> >> >> >> >>         at
>> >>> >> >> >>
>> >>>>>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
>> >>> >> >> >> >>         at
>> >>>java.security.AccessController.doPrivileged(Native
>> >>> >> >>Method)
>> >>> >> >> >> >>         at
>> >>>javax.security.auth.Subject.doAs(Subject.java:422)
>> >>> >> >> >> >>         at
>> >>> >> >> >> >>
>> >>> >> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>> 
>>>>>>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroup
>>>>>>>>>>>>>In
>> >>>>>>>>>>>for
>> >>> >>>>>>>>ma
>> >>> >> >>>>>>ti
>> >>> >> >> >>>>on
>> >>> >> >> >> >>.java:1671)
>> >>> >> >> >> >>         at
>> >>> >> >> >> >>
>> >>> >> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>> 
>>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFa
>>>>>>>>>>>>>il
>> >>>>>>>>>>>ure
>> >>> >>>>>>>>(C
>> >>> >> >>>>>>li
>> >>> >> >> >>>>en
>> >>> >> >> >> >>t.java:645)
>> >>> >> >> >> >>         at
>> >>> >> >> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>> 
>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.ja
>>>>>>>>>>>va
>> >>>>>>>>>:73
>> >>> >>>>>>3)
>> >>> >> >> >> >>         at
>> >>> >> >> >> >>
>> >>> >>
>> 
>>>>>>>org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
>> >>> >> >> >> >>         at
>> >>> >> >> >>org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
>> >>> >> >> >> >>         at
>> >>>org.apache.hadoop.ipc.Client.call(Client.java:1442)
>> >>> >> >> >> >>         ... 12 more
>> >>> >> >> >> >> Caused by:
>> >>>org.apache.hadoop.security.AccessControlException:
>> >>> >> >>Client
>> >>> >> >> >> >> cannot authenticate via:[TOKEN]
>> >>> >> >> >> >>         at
>> >>> >> >> >> >>
>> >>> >> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>> 
>>>>>>>>>>>>>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(Sasl
>>>>>>>>>>>>>Rp
>> >>>>>>>>>>>cCl
>> >>> >>>>>>>>ie
>> >>> >> >>>>>>nt
>> >>> >> >> >>>>.j
>> >>> >> >> >> >>ava:172)
>> >>> >> >> >> >>         at
>> >>> >> >> >> >>
>> >>> >> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>> 
>>>>>>>>>>>>>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcCl
>>>>>>>>>>>>>ie
>> >>>>>>>>>>>nt.
>> >>> >>>>>>>>ja
>> >>> >> >>>>>>va
>> >>> >> >> >>>>:3
>> >>> >> >> >> >>96)
>> >>> >> >> >> >>         at
>> >>> >> >> >> >>
>> >>> >> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>> 
>>>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Cl
>>>>>>>>>>>>>ie
>> >>>>>>>>>>>nt.
>> >>> >>>>>>>>ja
>> >>> >> >>>>>>va
>> >>> >> >> >>>>:5
>> >>> >> >> >> >>55)
>> >>> >> >> >> >>         at
>> >>> >> >> >> >>
>> >>> >>
>> 
>>>>>>>org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
>> >>> >> >> >> >>         at
>> >>> >> >> >>
>> >>>>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
>> >>> >> >> >> >>         at
>> >>> >> >> >>
>> >>>>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
>> >>> >> >> >> >>         at
>> >>>java.security.AccessController.doPrivileged(Native
>> >>> >> >>Method)
>> >>> >> >> >> >>         at
>> >>>javax.security.auth.Subject.doAs(Subject.java:422)
>> >>> >> >> >> >>         at
>> >>> >> >> >> >>
>> >>> >> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>> 
>>>>>>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroup
>>>>>>>>>>>>>In
>> >>>>>>>>>>>for
>> >>> >>>>>>>>ma
>> >>> >> >>>>>>ti
>> >>> >> >> >>>>on
>> >>> >> >> >> >>.java:1671)
>> >>> >> >> >> >>         at
>> >>> >> >> >> >>
>> >>> >> >>
>> >>> >>
>> >>>
>> >>>
>> 
>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.ja
>>>>>>>>>>>va
>> >>>>>>>>>:72
>> >>> >>>>>>0)
>> >>> >> >> >> >>         ... 15 more
>> >>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
>> >>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing
>>over to
>> >>> rm1
>> >>> >> >> >> >>
>> >>> >> >> >>
>> >>> >> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >>
>> >>> >>
>> >>>
>> >>>
>> >>
>>
>>


Re: Slider AM fails to run when RM in HA setup fails over

Posted by Manoj Samel <ma...@gmail.com>.
Thanks Gour for prompt reply

BTW - Creating a empty slider-client.xml (with just
<configuration></configuration>) does not works. The AM starts but fails to
create any components and shows errors like

2016-07-28 23:18:46,018
[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
 zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error,
closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)

Also, command "slider destroy <app>" fails with zookeeper errors ...

I had to keep a minimal slider-client.xml. It does not have any RM info
etc. but does contain slider ZK related properties like
"slider.zookeeper.quorum", "hadoop.registry.zk.quorum",
"hadoop.registry.zk.root". I haven't yet distilled the absolute minimal set
of properties required, but this should suffice for now. All RM / HDFS
properties will be read from HADOOP_CONF_DIR files.

Let me know if this could cause any issues.

On Thu, Jul 28, 2016 at 3:36 PM, Gour Saha <gs...@hortonworks.com> wrote:

> No need to copy any files. Pointing HADOOP_CONF_DIR to /etc/hadoop/conf is
> good.
>
> -Gour
>
> On 7/28/16, 3:24 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>
> >Follow up question regarding Gour's comment in earlier thread -
> >
> >Slider is installed on one of the hadoop nodes. SLIDER_HOME/conf directory
> >(say /data/slider/conf) is different than HADOOP_CONF_DIR
> >(/etc/hadoop/conf). Is it required/recommended that files in
> >HADOOP_CONF_DIR be copied to SLIDER_HOME/conf and slider-env.sh script
> >sets
> >HADOOP_CONF_DIR to /data/slider/conf ?
> >
> >Or can the slider-env.sh set HADOOP_CONF_DIR to /etc/hadoop/conf , without
> >copying the files ?
> >
> >Using slider .80 for now, but would like to know recommendation for this
> >and future versions as well.
> >
> >Thanks in advance,
> >
> >Manoj
> >
> >On Tue, Jul 26, 2016 at 3:27 PM, Manoj Samel <ma...@gmail.com>
> >wrote:
> >
> >> Filed https://issues.apache.org/jira/browse/SLIDER-1158 with logs and
> my
> >> analysis of logs.
> >>
> >> On Tue, Jul 26, 2016 at 10:36 AM, Gour Saha <gs...@hortonworks.com>
> >>wrote:
> >>
> >>> Please file a JIRA and upload the logs to it.
> >>>
> >>> On 7/26/16, 10:21 AM, "Manoj Samel" <ma...@gmail.com> wrote:
> >>>
> >>> >Hi Gour,
> >>> >
> >>> >Can you please reach me using your own email-id? I will then send
> >>>logs to
> >>> >you, along with my analysis - I don't want to send logs on public list
> >>> >
> >>> >Thanks,
> >>> >
> >>> >On Mon, Jul 25, 2016 at 5:39 PM, Gour Saha <gs...@hortonworks.com>
> >>> wrote:
> >>> >
> >>> >> Ok, so this node is not a gateway. It is part of the cluster, which
> >>> >>means
> >>> >> you don¹t need slider-client.xml at all. Just have HADOOP_CONF_DIR
> >>> >> pointing to /etc/hadoop/conf in slider-env.sh and that should be it.
> >>> >>
> >>> >> So the above simplifies your config setup. It will not solve either
> >>>of
> >>> >>the
> >>> >> 2 problems you are facing.
> >>> >>
> >>> >> Now coming to the 2 issues you are facing, you have to provide
> >>> >>additional
> >>> >> logs for us to understand better. Let¹s start with  -
> >>> >> 1. RM logs (specifically between the time when rm1->rm2 failover is
> >>> >> simulated)
> >>> >> 2. Slider App logs
> >>> >>
> >>> >> -Gour
> >>> >>
> >>> >> On 7/25/16, 5:16 PM, "Manoj Samel" <ma...@gmail.com>
> wrote:
> >>> >>
> >>> >> >   1. Not clear about your question on "gateway" node. The node
> >>> running
> >>> >> >   slider is part of the hadoop cluster and there are other
> >>>services
> >>> >>like
> >>> >> >   Oozie that run on this node that utilizes hdfs and yarn. So if
> >>>your
> >>> >> >   question is whether the node is otherwise working for HDFS and
> >>>Yarn
> >>> >> >   configuration, it is working
> >>> >> >   2. I copied all files from HADOOP_CONF_DIR (say
> >>>/etc/hadoop/conf)
> >>> to
> >>> >> >the
> >>> >> >   directory containing slider-client.xml (say /data/latest/conf)
> >>> >> >   3. In earlier email, I had done a mistake where slider-env.sh
> >>>file
> >>> >> >HADOOP_CONF_DIR
> >>> >> >   was pointing to original directory /etc/hadoop/conf. I edited
> >>>it to
> >>> >> >   point to same directory containing slider-client.xml &
> >>> slider-env.sh
> >>> >> >i.e.
> >>> >> >   /data/latest/conf
> >>> >> >   4. I emptied slider-client.xml. It just had the
> >>> >> ><configuration></configuration>.
> >>> >> >   The creation of spas worked but the Slider AM still shows the
> >>>same
> >>> >> >issue.
> >>> >> >   i.e. when RM1 goes from active to standby, slider AM goes from
> >>> >>RUNNING
> >>> >> >to
> >>> >> >   ACCPTED state with same error about TOKEN. Also NOTE that when
> >>> >> >   slider-client.xml is empty, the "slider destroy xxx" command
> >>>still
> >>> >> >fails
> >>> >> >   with Zookeeper connection errors.
> >>> >> >   5. I then added same parameters (as my last email - except
> >>> >> >   HADOOP_CONF_DIR) to slider-client.xml and ran. This time
> >>> >>slider-env.sh
> >>> >> >   has HADOOP_CONF_DIR pointing to /data/latest/conf and
> >>> >>slider-client.xml
> >>> >> >   does not have HADOOP_CONF_DIR. The same issue exists (but
> >>>"slider
> >>> >> >   destroy" does not fails)
> >>> >> >   6. Could you explain what do you expect to pick up from Hadoop
> >>> >> >   configurations that will help you in RM Token ? If slider has
> >>>token
> >>> >> >from
> >>> >> >   RM1, and it switches to RM2, not clear what slider does to get
> >>> >> >delegation
> >>> >> >   token for RM2 communication ?
> >>> >> >   7. It is worth repeating again that issue happens only when RM1
> >>>was
> >>> >> >   active when slider app was created and then RM1 becomes
> >>>standby. If
> >>> >> >RM2 was
> >>> >> >   active when slider app was created, then slider AM keeps running
> >>> for
> >>> >> >any
> >>> >> >   number of switches between RM2 and RM1 back and forth ...
> >>> >> >
> >>> >> >
> >>> >> >On Mon, Jul 25, 2016 at 4:21 PM, Gour Saha <gs...@hortonworks.com>
> >>> >>wrote:
> >>> >> >
> >>> >> >> The node you are running slider from, is that a gateway node?
> >>>Sorry
> >>> >>for
> >>> >> >> not being explicit. I meant copy everything under
> >>>/etc/hadoop/conf
> >>> >>from
> >>> >> >> your cluster into some temp directory (say /tmp/hadoop_conf) in
> >>>your
> >>> >> >> gateway node or local or whichever node you are running slider
> >>>from.
> >>> >> >>Then
> >>> >> >> set HADOOP_CONF_DIR to /tmp/hadoop_conf and clear everything out
> >>> from
> >>> >> >> slider-client.xml.
> >>> >> >>
> >>> >> >> On 7/25/16, 4:12 PM, "Manoj Samel" <ma...@gmail.com>
> >>> wrote:
> >>> >> >>
> >>> >> >> >Hi Gour,
> >>> >> >> >
> >>> >> >> >Thanks for your prompt reply.
> >>> >> >> >
> >>> >> >> >FYI, issue happens when I create slider app when rm1 is active
> >>>and
> >>> >>when
> >>> >> >> >rm1
> >>> >> >> >fails over to rm2. As soon as rm2 becomes active; the slider AM
> >>> goes
> >>> >> >>from
> >>> >> >> >RUNNING to ACCEPTED state with above error.
> >>> >> >> >
> >>> >> >> >For your suggestion, I did following
> >>> >> >> >
> >>> >> >> >1) Copied core-site, hdfs-site, yarn-site, and mapred-site from
> >>> >> >> >HADOOP_CONF_DIR
> >>> >> >> >to slider conf directory.
> >>> >> >> >2) Our slider-env.sh already had HADOOP_CONF_DIR set
> >>> >> >> >3) I removed all properties from slider-client.xml EXCEPT
> >>>following
> >>> >> >> >
> >>> >> >> >   - HADOOP_CONF_DIR
> >>> >> >> >   - slider.yarn.queue
> >>> >> >> >   - slider.zookeeper.quorum
> >>> >> >> >   - hadoop.registry.zk.quorum
> >>> >> >> >   - hadoop.registry.zk.root
> >>> >> >> >   - hadoop.security.authorization
> >>> >> >> >   - hadoop.security.authentication
> >>> >> >> >
> >>> >> >> >Then I made rm1 active, installed and created slider app and
> >>> >>restarted
> >>> >> >>rm1
> >>> >> >> >(to make rm2) active. The slider-am again went from RUNNING to
> >>> >>ACCEPTED
> >>> >> >> >state.
> >>> >> >> >
> >>> >> >> >Let me know if you want me to try further changes.
> >>> >> >> >
> >>> >> >> >If I make the slider-client.xml completely empty per your
> >>> >>suggestion,
> >>> >> >>only
> >>> >> >> >slider AM comes up but it
> >>> >> >> >fails to start components. The AM log shows errors trying to
> >>> >>connect to
> >>> >> >> >zookeeper like below.
> >>> >> >> >2016-07-25 23:07:41,532
> >>> >> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
> >>> >> >> >zookeeper.ClientCnxn - Session 0x0 for server null, unexpected
> >>> >>error,
> >>> >> >> >closing socket connection and attempting reconnect
> >>> >> >> >java.net.ConnectException: Connection refused
> >>> >> >> >
> >>> >> >> >Hence I kept minimal info in slider-client.xml
> >>> >> >> >
> >>> >> >> >FYI This is slider version 0.80
> >>> >> >> >
> >>> >> >> >Thanks,
> >>> >> >> >
> >>> >> >> >Manoj
> >>> >> >> >
> >>> >> >> >On Mon, Jul 25, 2016 at 2:54 PM, Gour Saha
> >>><gs...@hortonworks.com>
> >>> >> >>wrote:
> >>> >> >> >
> >>> >> >> >> If possible, can you copy the entire content of the directory
> >>> >> >> >> /etc/hadoop/conf and then set HADOOP_CONF_DIR in
> >>>slider-env.sh to
> >>> >>it.
> >>> >> >> >>Keep
> >>> >> >> >> slider-client.xml empty.
> >>> >> >> >>
> >>> >> >> >> Now when you do the same rm1->rm2 and then the reverse
> >>>failovers,
> >>> >>do
> >>> >> >>you
> >>> >> >> >> see the same behaviors?
> >>> >> >> >>
> >>> >> >> >> -Gour
> >>> >> >> >>
> >>> >> >> >> On 7/25/16, 2:28 PM, "Manoj Samel" <ma...@gmail.com>
> >>> >>wrote:
> >>> >> >> >>
> >>> >> >> >> >Another observation (whatever it is worth)
> >>> >> >> >> >
> >>> >> >> >> >If slider app is created and started when rm2 was active,
> >>>then
> >>> it
> >>> >> >> >>seems to
> >>> >> >> >> >survive switches between rm2 and rm1 (and back). I.e
> >>> >> >> >> >
> >>> >> >> >> >* rm2 is active
> >>> >> >> >> >* create and start slider application
> >>> >> >> >> >* fail over to rm1. Now the Slider AM keeps running
> >>> >> >> >> >* fail over to rm2 again. Slider AM still keeps running
> >>> >> >> >> >
> >>> >> >> >> >So, it seems if it starts with rm1 active, then the AM goes
> >>>to
> >>> >> >> >>"ACCEPTED"
> >>> >> >> >> >state when RM fails to rm2. If it starts with rm2 active,
> >>>then
> >>> it
> >>> >> >>runs
> >>> >> >> >> >fine
> >>> >> >> >> >with any switches between rm1 and rm2.
> >>> >> >> >> >
> >>> >> >> >> >Any feedback ?
> >>> >> >> >> >
> >>> >> >> >> >Thanks,
> >>> >> >> >> >
> >>> >> >> >> >Manoj
> >>> >> >> >> >
> >>> >> >> >> >On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel
> >>> >> >> >><ma...@gmail.com>
> >>> >> >> >> >wrote:
> >>> >> >> >> >
> >>> >> >> >> >> Setup
> >>> >> >> >> >>
> >>> >> >> >> >> - Hadoop 2.6 with RM HA, Kerberos enabled
> >>> >> >> >> >> - Slider 0.80
> >>> >> >> >> >> - In my slider-client.xml, I have added all RM HA
> >>>properties,
> >>> >> >> >>including
> >>> >> >> >> >> the ones mentioned in
> >>> >> >>http://markmail.org/message/wnhpp2zn6ixo65e3.
> >>> >> >> >> >>
> >>> >> >> >> >> Following is the issue
> >>> >> >> >> >>
> >>> >> >> >> >> * rm1 is active, rm2 is standby
> >>> >> >> >> >> * deploy and start slider application, it runs fine
> >>> >> >> >> >> * restart rm1, rm2 is now active.
> >>> >> >> >> >> * The slider-am now goes from running into "ACCEPTED"
> >>>mode. It
> >>> >> >>stays
> >>> >> >> >> >>there
> >>> >> >> >> >> till rm1 is made active again.
> >>> >> >> >> >>
> >>> >> >> >> >> In the slider-am log, it tries to connect to RM2 and
> >>> connection
> >>> >> >>fails
> >>> >> >> >> >>due
> >>> >> >> >> >> to org.apache.hadoop.security.AccessControlException:
> >>>Client
> >>> >> >>cannot
> >>> >> >> >> >> authenticate via:[TOKEN]. See detailed log below
> >>> >> >> >> >>
> >>> >> >> >> >>  It seems it has some token (delegation token?) for RM1 but
> >>> >>tries
> >>> >> >>to
> >>> >> >> >>use
> >>> >> >> >> >> same(?) for RM2 and fails. Am I missing some configuration
> >>>???
> >>> >> >> >> >>
> >>> >> >> >> >> Thanks,
> >>> >> >> >> >>
> >>> >> >> >> >>
> >>> >> >> >> >>
> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] INFO
> >>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to
> >>> rm2
> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
> >>> >> >> >> >>  security.UserGroupInformation - PriviledgedActionException
> >>> >> >> >>as:abc@XYZ
> >>> >> >> >> >> (auth:KERBEROS)
> >>> >> >> >>cause:org.apache.hadoop.security.AccessControlException:
> >>> >> >> >> >> Client cannot authenticate via:[TOKEN]
> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
> >>> >> >>ipc.Client -
> >>> >> >> >> >> Exception encountered while connecting to the server :
> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException: Client
> >>> >>cannot
> >>> >> >> >> >> authenticate via:[TOKEN]
> >>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
> >>> >> >> >> >>  security.UserGroupInformation - PriviledgedActionException
> >>> >> >> >>as:abc@XYZ
> >>> >> >> >> >> (auth:KERBEROS) cause:java.io.IOException:
> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException: Client
> >>> >>cannot
> >>> >> >> >> >> authenticate via:[TOKEN]
> >>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
> >>> >> >> >> >>  retry.RetryInvocationHandler - Exception while invoking
> >>> >>allocate
> >>> >> >>of
> >>> >> >> >> >>class
> >>> >> >> >> >> ApplicationMasterProtocolPBClientImpl over rm2 after 287
> >>>fail
> >>> >>over
> >>> >> >> >> >> attempts. Trying to fail over immediately.
> >>> >> >> >> >> java.io.IOException: Failed on local exception:
> >>> >> >>java.io.IOException:
> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException: Client
> >>> >>cannot
> >>> >> >> >> >> authenticate via:[TOKEN]; Host Details : local host is:
> >>> >>"<SliderAM
> >>> >> >> >> >> HOST>/<slider AM Host IP>"; destination host is: "<RM2
> >>> >> >>HOST>":23130;
> >>> >> >> >> >>         at
> >>> >> >> >>
> >>>>>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> >>> >> >> >> >>         at
> >>>org.apache.hadoop.ipc.Client.call(Client.java:1476)
> >>> >> >> >> >>         at
> >>>org.apache.hadoop.ipc.Client.call(Client.java:1403)
> >>> >> >> >> >>         at
> >>> >> >> >> >>
> >>> >> >> >>
> >>> >> >>
> >>> >>
> >>>
> >>>
> >>>>>>>>>>>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufR
> >>>>>>>>>>>pcE
> >>> >>>>>>>>ng
> >>> >> >>>>>>in
> >>> >> >> >>>>e.
> >>> >> >> >> >>java:230)
> >>> >> >> >> >>         at com.sun.proxy.$Proxy23.allocate(Unknown Source)
> >>> >> >> >> >>         at
> >>> >> >> >> >>
> >>> >> >> >>
> >>> >> >>
> >>> >>
> >>>
> >>>
> >>>>>>>>>>>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProto
> >>>>>>>>>>>col
> >>> >>>>>>>>PB
> >>> >> >>>>>>Cl
> >>> >> >> >>>>ie
> >>> >> >> >>
> >>>>>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> >>> >> >> >> >>         at
> >>> sun.reflect.GeneratedMethodAccessor10.invoke(Unknown
> >>> >> >> >>Source)
> >>> >> >> >> >>         at
> >>> >> >> >> >>
> >>> >> >> >>
> >>> >> >>
> >>> >>
> >>>
> >>>
> >>>>>>>>>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethod
> >>>>>>>>>>>Acc
> >>> >>>>>>>>es
> >>> >> >>>>>>so
> >>> >> >> >>>>rI
> >>> >> >> >> >>mpl.java:43)
> >>> >> >> >> >>         at java.lang.reflect.Method.invoke(Method.java:497)
> >>> >> >> >> >>         at
> >>> >> >> >> >>
> >>> >> >> >>
> >>> >> >>
> >>> >>
> >>>
> >>>
> >>>>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(R
> >>>>>>>>>>>etr
> >>> >>>>>>>>yI
> >>> >> >>>>>>nv
> >>> >> >> >>>>oc
> >>> >> >> >> >>ationHandler.java:252)
> >>> >> >> >> >>         at
> >>> >> >> >> >>
> >>> >> >> >>
> >>> >> >>
> >>> >>
> >>>
> >>>
> >>>>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryIn
> >>>>>>>>>>>voc
> >>> >>>>>>>>at
> >>> >> >>>>>>io
> >>> >> >> >>>>nH
> >>> >> >> >> >>andler.java:104)
> >>> >> >> >> >>         at com.sun.proxy.$Proxy24.allocate(Unknown Source)
> >>> >> >> >> >>         at
> >>> >> >> >> >>
> >>> >> >> >>
> >>> >> >>
> >>> >>
> >>>
> >>>
> >>>>>>>>>>>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(A
> >>>>>>>>>>>MRM
> >>> >>>>>>>>Cl
> >>> >> >>>>>>ie
> >>> >> >> >>>>nt
> >>> >> >> >> >>Impl.java:278)
> >>> >> >> >> >>         at
> >>> >> >> >> >>
> >>> >> >> >>
> >>> >> >>
> >>> >>
> >>>
> >>>
> >>>>>>>>>>>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl
> >>>>>>>>>>>$He
> >>> >>>>>>>>ar
> >>> >> >>>>>>tb
> >>> >> >> >>>>ea
> >>> >> >> >> >>tThread.run(AMRMClientAsyncImpl.java:224)
> >>> >> >> >> >> Caused by: java.io.IOException:
> >>> >> >> >> >> org.apache.hadoop.security.AccessControlException: Client
> >>> >>cannot
> >>> >> >> >> >> authenticate via:[TOKEN]
> >>> >> >> >> >>         at
> >>> >> >> >>
> >>>>>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
> >>> >> >> >> >>         at
> >>>java.security.AccessController.doPrivileged(Native
> >>> >> >>Method)
> >>> >> >> >> >>         at
> >>>javax.security.auth.Subject.doAs(Subject.java:422)
> >>> >> >> >> >>         at
> >>> >> >> >> >>
> >>> >> >> >>
> >>> >> >>
> >>> >>
> >>>
> >>>
> >>>>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupIn
> >>>>>>>>>>>for
> >>> >>>>>>>>ma
> >>> >> >>>>>>ti
> >>> >> >> >>>>on
> >>> >> >> >> >>.java:1671)
> >>> >> >> >> >>         at
> >>> >> >> >> >>
> >>> >> >> >>
> >>> >> >>
> >>> >>
> >>>
> >>>
> >>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFail
> >>>>>>>>>>>ure
> >>> >>>>>>>>(C
> >>> >> >>>>>>li
> >>> >> >> >>>>en
> >>> >> >> >> >>t.java:645)
> >>> >> >> >> >>         at
> >>> >> >> >> >>
> >>> >> >>
> >>> >>
> >>>
> >>>
> >>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java
> >>>>>>>>>:73
> >>> >>>>>>3)
> >>> >> >> >> >>         at
> >>> >> >> >> >>
> >>> >>
> >>>>>org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
> >>> >> >> >> >>         at
> >>> >> >> >>org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
> >>> >> >> >> >>         at
> >>>org.apache.hadoop.ipc.Client.call(Client.java:1442)
> >>> >> >> >> >>         ... 12 more
> >>> >> >> >> >> Caused by:
> >>>org.apache.hadoop.security.AccessControlException:
> >>> >> >>Client
> >>> >> >> >> >> cannot authenticate via:[TOKEN]
> >>> >> >> >> >>         at
> >>> >> >> >> >>
> >>> >> >> >>
> >>> >> >>
> >>> >>
> >>>
> >>>
> >>>>>>>>>>>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRp
> >>>>>>>>>>>cCl
> >>> >>>>>>>>ie
> >>> >> >>>>>>nt
> >>> >> >> >>>>.j
> >>> >> >> >> >>ava:172)
> >>> >> >> >> >>         at
> >>> >> >> >> >>
> >>> >> >> >>
> >>> >> >>
> >>> >>
> >>>
> >>>
> >>>>>>>>>>>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClie
> >>>>>>>>>>>nt.
> >>> >>>>>>>>ja
> >>> >> >>>>>>va
> >>> >> >> >>>>:3
> >>> >> >> >> >>96)
> >>> >> >> >> >>         at
> >>> >> >> >> >>
> >>> >> >> >>
> >>> >> >>
> >>> >>
> >>>
> >>>
> >>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Clie
> >>>>>>>>>>>nt.
> >>> >>>>>>>>ja
> >>> >> >>>>>>va
> >>> >> >> >>>>:5
> >>> >> >> >> >>55)
> >>> >> >> >> >>         at
> >>> >> >> >> >>
> >>> >>
> >>>>>org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
> >>> >> >> >> >>         at
> >>> >> >> >>
> >>>>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
> >>> >> >> >> >>         at
> >>> >> >> >>
> >>>>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
> >>> >> >> >> >>         at
> >>>java.security.AccessController.doPrivileged(Native
> >>> >> >>Method)
> >>> >> >> >> >>         at
> >>>javax.security.auth.Subject.doAs(Subject.java:422)
> >>> >> >> >> >>         at
> >>> >> >> >> >>
> >>> >> >> >>
> >>> >> >>
> >>> >>
> >>>
> >>>
> >>>>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupIn
> >>>>>>>>>>>for
> >>> >>>>>>>>ma
> >>> >> >>>>>>ti
> >>> >> >> >>>>on
> >>> >> >> >> >>.java:1671)
> >>> >> >> >> >>         at
> >>> >> >> >> >>
> >>> >> >>
> >>> >>
> >>>
> >>>
> >>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java
> >>>>>>>>>:72
> >>> >>>>>>0)
> >>> >> >> >> >>         ... 15 more
> >>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
> >>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to
> >>> rm1
> >>> >> >> >> >>
> >>> >> >> >>
> >>> >> >> >>
> >>> >> >>
> >>> >> >>
> >>> >>
> >>> >>
> >>>
> >>>
> >>
>
>

Re: Slider AM fails to run when RM in HA setup fails over

Posted by Gour Saha <gs...@hortonworks.com>.
No need to copy any files. Pointing HADOOP_CONF_DIR to /etc/hadoop/conf is
good.

-Gour

On 7/28/16, 3:24 PM, "Manoj Samel" <ma...@gmail.com> wrote:

>Follow up question regarding Gour's comment in earlier thread -
>
>Slider is installed on one of the hadoop nodes. SLIDER_HOME/conf directory
>(say /data/slider/conf) is different than HADOOP_CONF_DIR
>(/etc/hadoop/conf). Is it required/recommended that files in
>HADOOP_CONF_DIR be copied to SLIDER_HOME/conf and slider-env.sh script
>sets
>HADOOP_CONF_DIR to /data/slider/conf ?
>
>Or can the slider-env.sh set HADOOP_CONF_DIR to /etc/hadoop/conf , without
>copying the files ?
>
>Using slider .80 for now, but would like to know recommendation for this
>and future versions as well.
>
>Thanks in advance,
>
>Manoj
>
>On Tue, Jul 26, 2016 at 3:27 PM, Manoj Samel <ma...@gmail.com>
>wrote:
>
>> Filed https://issues.apache.org/jira/browse/SLIDER-1158 with logs and my
>> analysis of logs.
>>
>> On Tue, Jul 26, 2016 at 10:36 AM, Gour Saha <gs...@hortonworks.com>
>>wrote:
>>
>>> Please file a JIRA and upload the logs to it.
>>>
>>> On 7/26/16, 10:21 AM, "Manoj Samel" <ma...@gmail.com> wrote:
>>>
>>> >Hi Gour,
>>> >
>>> >Can you please reach me using your own email-id? I will then send
>>>logs to
>>> >you, along with my analysis - I don't want to send logs on public list
>>> >
>>> >Thanks,
>>> >
>>> >On Mon, Jul 25, 2016 at 5:39 PM, Gour Saha <gs...@hortonworks.com>
>>> wrote:
>>> >
>>> >> Ok, so this node is not a gateway. It is part of the cluster, which
>>> >>means
>>> >> you don¹t need slider-client.xml at all. Just have HADOOP_CONF_DIR
>>> >> pointing to /etc/hadoop/conf in slider-env.sh and that should be it.
>>> >>
>>> >> So the above simplifies your config setup. It will not solve either
>>>of
>>> >>the
>>> >> 2 problems you are facing.
>>> >>
>>> >> Now coming to the 2 issues you are facing, you have to provide
>>> >>additional
>>> >> logs for us to understand better. Let¹s start with  -
>>> >> 1. RM logs (specifically between the time when rm1->rm2 failover is
>>> >> simulated)
>>> >> 2. Slider App logs
>>> >>
>>> >> -Gour
>>> >>
>>> >> On 7/25/16, 5:16 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>>> >>
>>> >> >   1. Not clear about your question on "gateway" node. The node
>>> running
>>> >> >   slider is part of the hadoop cluster and there are other
>>>services
>>> >>like
>>> >> >   Oozie that run on this node that utilizes hdfs and yarn. So if
>>>your
>>> >> >   question is whether the node is otherwise working for HDFS and
>>>Yarn
>>> >> >   configuration, it is working
>>> >> >   2. I copied all files from HADOOP_CONF_DIR (say
>>>/etc/hadoop/conf)
>>> to
>>> >> >the
>>> >> >   directory containing slider-client.xml (say /data/latest/conf)
>>> >> >   3. In earlier email, I had done a mistake where slider-env.sh
>>>file
>>> >> >HADOOP_CONF_DIR
>>> >> >   was pointing to original directory /etc/hadoop/conf. I edited
>>>it to
>>> >> >   point to same directory containing slider-client.xml &
>>> slider-env.sh
>>> >> >i.e.
>>> >> >   /data/latest/conf
>>> >> >   4. I emptied slider-client.xml. It just had the
>>> >> ><configuration></configuration>.
>>> >> >   The creation of spas worked but the Slider AM still shows the
>>>same
>>> >> >issue.
>>> >> >   i.e. when RM1 goes from active to standby, slider AM goes from
>>> >>RUNNING
>>> >> >to
>>> >> >   ACCPTED state with same error about TOKEN. Also NOTE that when
>>> >> >   slider-client.xml is empty, the "slider destroy xxx" command
>>>still
>>> >> >fails
>>> >> >   with Zookeeper connection errors.
>>> >> >   5. I then added same parameters (as my last email - except
>>> >> >   HADOOP_CONF_DIR) to slider-client.xml and ran. This time
>>> >>slider-env.sh
>>> >> >   has HADOOP_CONF_DIR pointing to /data/latest/conf and
>>> >>slider-client.xml
>>> >> >   does not have HADOOP_CONF_DIR. The same issue exists (but
>>>"slider
>>> >> >   destroy" does not fails)
>>> >> >   6. Could you explain what do you expect to pick up from Hadoop
>>> >> >   configurations that will help you in RM Token ? If slider has
>>>token
>>> >> >from
>>> >> >   RM1, and it switches to RM2, not clear what slider does to get
>>> >> >delegation
>>> >> >   token for RM2 communication ?
>>> >> >   7. It is worth repeating again that issue happens only when RM1
>>>was
>>> >> >   active when slider app was created and then RM1 becomes
>>>standby. If
>>> >> >RM2 was
>>> >> >   active when slider app was created, then slider AM keeps running
>>> for
>>> >> >any
>>> >> >   number of switches between RM2 and RM1 back and forth ...
>>> >> >
>>> >> >
>>> >> >On Mon, Jul 25, 2016 at 4:21 PM, Gour Saha <gs...@hortonworks.com>
>>> >>wrote:
>>> >> >
>>> >> >> The node you are running slider from, is that a gateway node?
>>>Sorry
>>> >>for
>>> >> >> not being explicit. I meant copy everything under
>>>/etc/hadoop/conf
>>> >>from
>>> >> >> your cluster into some temp directory (say /tmp/hadoop_conf) in
>>>your
>>> >> >> gateway node or local or whichever node you are running slider
>>>from.
>>> >> >>Then
>>> >> >> set HADOOP_CONF_DIR to /tmp/hadoop_conf and clear everything out
>>> from
>>> >> >> slider-client.xml.
>>> >> >>
>>> >> >> On 7/25/16, 4:12 PM, "Manoj Samel" <ma...@gmail.com>
>>> wrote:
>>> >> >>
>>> >> >> >Hi Gour,
>>> >> >> >
>>> >> >> >Thanks for your prompt reply.
>>> >> >> >
>>> >> >> >FYI, issue happens when I create slider app when rm1 is active
>>>and
>>> >>when
>>> >> >> >rm1
>>> >> >> >fails over to rm2. As soon as rm2 becomes active; the slider AM
>>> goes
>>> >> >>from
>>> >> >> >RUNNING to ACCEPTED state with above error.
>>> >> >> >
>>> >> >> >For your suggestion, I did following
>>> >> >> >
>>> >> >> >1) Copied core-site, hdfs-site, yarn-site, and mapred-site from
>>> >> >> >HADOOP_CONF_DIR
>>> >> >> >to slider conf directory.
>>> >> >> >2) Our slider-env.sh already had HADOOP_CONF_DIR set
>>> >> >> >3) I removed all properties from slider-client.xml EXCEPT
>>>following
>>> >> >> >
>>> >> >> >   - HADOOP_CONF_DIR
>>> >> >> >   - slider.yarn.queue
>>> >> >> >   - slider.zookeeper.quorum
>>> >> >> >   - hadoop.registry.zk.quorum
>>> >> >> >   - hadoop.registry.zk.root
>>> >> >> >   - hadoop.security.authorization
>>> >> >> >   - hadoop.security.authentication
>>> >> >> >
>>> >> >> >Then I made rm1 active, installed and created slider app and
>>> >>restarted
>>> >> >>rm1
>>> >> >> >(to make rm2) active. The slider-am again went from RUNNING to
>>> >>ACCEPTED
>>> >> >> >state.
>>> >> >> >
>>> >> >> >Let me know if you want me to try further changes.
>>> >> >> >
>>> >> >> >If I make the slider-client.xml completely empty per your
>>> >>suggestion,
>>> >> >>only
>>> >> >> >slider AM comes up but it
>>> >> >> >fails to start components. The AM log shows errors trying to
>>> >>connect to
>>> >> >> >zookeeper like below.
>>> >> >> >2016-07-25 23:07:41,532
>>> >> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
>>> >> >> >zookeeper.ClientCnxn - Session 0x0 for server null, unexpected
>>> >>error,
>>> >> >> >closing socket connection and attempting reconnect
>>> >> >> >java.net.ConnectException: Connection refused
>>> >> >> >
>>> >> >> >Hence I kept minimal info in slider-client.xml
>>> >> >> >
>>> >> >> >FYI This is slider version 0.80
>>> >> >> >
>>> >> >> >Thanks,
>>> >> >> >
>>> >> >> >Manoj
>>> >> >> >
>>> >> >> >On Mon, Jul 25, 2016 at 2:54 PM, Gour Saha
>>><gs...@hortonworks.com>
>>> >> >>wrote:
>>> >> >> >
>>> >> >> >> If possible, can you copy the entire content of the directory
>>> >> >> >> /etc/hadoop/conf and then set HADOOP_CONF_DIR in
>>>slider-env.sh to
>>> >>it.
>>> >> >> >>Keep
>>> >> >> >> slider-client.xml empty.
>>> >> >> >>
>>> >> >> >> Now when you do the same rm1->rm2 and then the reverse
>>>failovers,
>>> >>do
>>> >> >>you
>>> >> >> >> see the same behaviors?
>>> >> >> >>
>>> >> >> >> -Gour
>>> >> >> >>
>>> >> >> >> On 7/25/16, 2:28 PM, "Manoj Samel" <ma...@gmail.com>
>>> >>wrote:
>>> >> >> >>
>>> >> >> >> >Another observation (whatever it is worth)
>>> >> >> >> >
>>> >> >> >> >If slider app is created and started when rm2 was active,
>>>then
>>> it
>>> >> >> >>seems to
>>> >> >> >> >survive switches between rm2 and rm1 (and back). I.e
>>> >> >> >> >
>>> >> >> >> >* rm2 is active
>>> >> >> >> >* create and start slider application
>>> >> >> >> >* fail over to rm1. Now the Slider AM keeps running
>>> >> >> >> >* fail over to rm2 again. Slider AM still keeps running
>>> >> >> >> >
>>> >> >> >> >So, it seems if it starts with rm1 active, then the AM goes
>>>to
>>> >> >> >>"ACCEPTED"
>>> >> >> >> >state when RM fails to rm2. If it starts with rm2 active,
>>>then
>>> it
>>> >> >>runs
>>> >> >> >> >fine
>>> >> >> >> >with any switches between rm1 and rm2.
>>> >> >> >> >
>>> >> >> >> >Any feedback ?
>>> >> >> >> >
>>> >> >> >> >Thanks,
>>> >> >> >> >
>>> >> >> >> >Manoj
>>> >> >> >> >
>>> >> >> >> >On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel
>>> >> >> >><ma...@gmail.com>
>>> >> >> >> >wrote:
>>> >> >> >> >
>>> >> >> >> >> Setup
>>> >> >> >> >>
>>> >> >> >> >> - Hadoop 2.6 with RM HA, Kerberos enabled
>>> >> >> >> >> - Slider 0.80
>>> >> >> >> >> - In my slider-client.xml, I have added all RM HA
>>>properties,
>>> >> >> >>including
>>> >> >> >> >> the ones mentioned in
>>> >> >>http://markmail.org/message/wnhpp2zn6ixo65e3.
>>> >> >> >> >>
>>> >> >> >> >> Following is the issue
>>> >> >> >> >>
>>> >> >> >> >> * rm1 is active, rm2 is standby
>>> >> >> >> >> * deploy and start slider application, it runs fine
>>> >> >> >> >> * restart rm1, rm2 is now active.
>>> >> >> >> >> * The slider-am now goes from running into "ACCEPTED"
>>>mode. It
>>> >> >>stays
>>> >> >> >> >>there
>>> >> >> >> >> till rm1 is made active again.
>>> >> >> >> >>
>>> >> >> >> >> In the slider-am log, it tries to connect to RM2 and
>>> connection
>>> >> >>fails
>>> >> >> >> >>due
>>> >> >> >> >> to org.apache.hadoop.security.AccessControlException:
>>>Client
>>> >> >>cannot
>>> >> >> >> >> authenticate via:[TOKEN]. See detailed log below
>>> >> >> >> >>
>>> >> >> >> >>  It seems it has some token (delegation token?) for RM1 but
>>> >>tries
>>> >> >>to
>>> >> >> >>use
>>> >> >> >> >> same(?) for RM2 and fails. Am I missing some configuration
>>>???
>>> >> >> >> >>
>>> >> >> >> >> Thanks,
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] INFO
>>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to
>>> rm2
>>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>>> >> >> >> >>  security.UserGroupInformation - PriviledgedActionException
>>> >> >> >>as:abc@XYZ
>>> >> >> >> >> (auth:KERBEROS)
>>> >> >> >>cause:org.apache.hadoop.security.AccessControlException:
>>> >> >> >> >> Client cannot authenticate via:[TOKEN]
>>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>>> >> >>ipc.Client -
>>> >> >> >> >> Exception encountered while connecting to the server :
>>> >> >> >> >> org.apache.hadoop.security.AccessControlException: Client
>>> >>cannot
>>> >> >> >> >> authenticate via:[TOKEN]
>>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>>> >> >> >> >>  security.UserGroupInformation - PriviledgedActionException
>>> >> >> >>as:abc@XYZ
>>> >> >> >> >> (auth:KERBEROS) cause:java.io.IOException:
>>> >> >> >> >> org.apache.hadoop.security.AccessControlException: Client
>>> >>cannot
>>> >> >> >> >> authenticate via:[TOKEN]
>>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
>>> >> >> >> >>  retry.RetryInvocationHandler - Exception while invoking
>>> >>allocate
>>> >> >>of
>>> >> >> >> >>class
>>> >> >> >> >> ApplicationMasterProtocolPBClientImpl over rm2 after 287
>>>fail
>>> >>over
>>> >> >> >> >> attempts. Trying to fail over immediately.
>>> >> >> >> >> java.io.IOException: Failed on local exception:
>>> >> >>java.io.IOException:
>>> >> >> >> >> org.apache.hadoop.security.AccessControlException: Client
>>> >>cannot
>>> >> >> >> >> authenticate via:[TOKEN]; Host Details : local host is:
>>> >>"<SliderAM
>>> >> >> >> >> HOST>/<slider AM Host IP>"; destination host is: "<RM2
>>> >> >>HOST>":23130;
>>> >> >> >> >>         at
>>> >> >> >> 
>>>>>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>>> >> >> >> >>         at
>>>org.apache.hadoop.ipc.Client.call(Client.java:1476)
>>> >> >> >> >>         at
>>>org.apache.hadoop.ipc.Client.call(Client.java:1403)
>>> >> >> >> >>         at
>>> >> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >>
>>>
>>> 
>>>>>>>>>>>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufR
>>>>>>>>>>>pcE
>>> >>>>>>>>ng
>>> >> >>>>>>in
>>> >> >> >>>>e.
>>> >> >> >> >>java:230)
>>> >> >> >> >>         at com.sun.proxy.$Proxy23.allocate(Unknown Source)
>>> >> >> >> >>         at
>>> >> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >>
>>>
>>> 
>>>>>>>>>>>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProto
>>>>>>>>>>>col
>>> >>>>>>>>PB
>>> >> >>>>>>Cl
>>> >> >> >>>>ie
>>> >> >> >> 
>>>>>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>>> >> >> >> >>         at
>>> sun.reflect.GeneratedMethodAccessor10.invoke(Unknown
>>> >> >> >>Source)
>>> >> >> >> >>         at
>>> >> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >>
>>>
>>> 
>>>>>>>>>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethod
>>>>>>>>>>>Acc
>>> >>>>>>>>es
>>> >> >>>>>>so
>>> >> >> >>>>rI
>>> >> >> >> >>mpl.java:43)
>>> >> >> >> >>         at java.lang.reflect.Method.invoke(Method.java:497)
>>> >> >> >> >>         at
>>> >> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >>
>>>
>>> 
>>>>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(R
>>>>>>>>>>>etr
>>> >>>>>>>>yI
>>> >> >>>>>>nv
>>> >> >> >>>>oc
>>> >> >> >> >>ationHandler.java:252)
>>> >> >> >> >>         at
>>> >> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >>
>>>
>>> 
>>>>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryIn
>>>>>>>>>>>voc
>>> >>>>>>>>at
>>> >> >>>>>>io
>>> >> >> >>>>nH
>>> >> >> >> >>andler.java:104)
>>> >> >> >> >>         at com.sun.proxy.$Proxy24.allocate(Unknown Source)
>>> >> >> >> >>         at
>>> >> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >>
>>>
>>> 
>>>>>>>>>>>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(A
>>>>>>>>>>>MRM
>>> >>>>>>>>Cl
>>> >> >>>>>>ie
>>> >> >> >>>>nt
>>> >> >> >> >>Impl.java:278)
>>> >> >> >> >>         at
>>> >> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >>
>>>
>>> 
>>>>>>>>>>>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl
>>>>>>>>>>>$He
>>> >>>>>>>>ar
>>> >> >>>>>>tb
>>> >> >> >>>>ea
>>> >> >> >> >>tThread.run(AMRMClientAsyncImpl.java:224)
>>> >> >> >> >> Caused by: java.io.IOException:
>>> >> >> >> >> org.apache.hadoop.security.AccessControlException: Client
>>> >>cannot
>>> >> >> >> >> authenticate via:[TOKEN]
>>> >> >> >> >>         at
>>> >> >> >> 
>>>>>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
>>> >> >> >> >>         at
>>>java.security.AccessController.doPrivileged(Native
>>> >> >>Method)
>>> >> >> >> >>         at
>>>javax.security.auth.Subject.doAs(Subject.java:422)
>>> >> >> >> >>         at
>>> >> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >>
>>>
>>> 
>>>>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupIn
>>>>>>>>>>>for
>>> >>>>>>>>ma
>>> >> >>>>>>ti
>>> >> >> >>>>on
>>> >> >> >> >>.java:1671)
>>> >> >> >> >>         at
>>> >> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >>
>>>
>>> 
>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFail
>>>>>>>>>>>ure
>>> >>>>>>>>(C
>>> >> >>>>>>li
>>> >> >> >>>>en
>>> >> >> >> >>t.java:645)
>>> >> >> >> >>         at
>>> >> >> >> >>
>>> >> >>
>>> >>
>>>
>>> 
>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java
>>>>>>>>>:73
>>> >>>>>>3)
>>> >> >> >> >>         at
>>> >> >> >> >>
>>> >> 
>>>>>org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
>>> >> >> >> >>         at
>>> >> >> >>org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
>>> >> >> >> >>         at
>>>org.apache.hadoop.ipc.Client.call(Client.java:1442)
>>> >> >> >> >>         ... 12 more
>>> >> >> >> >> Caused by:
>>>org.apache.hadoop.security.AccessControlException:
>>> >> >>Client
>>> >> >> >> >> cannot authenticate via:[TOKEN]
>>> >> >> >> >>         at
>>> >> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >>
>>>
>>> 
>>>>>>>>>>>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRp
>>>>>>>>>>>cCl
>>> >>>>>>>>ie
>>> >> >>>>>>nt
>>> >> >> >>>>.j
>>> >> >> >> >>ava:172)
>>> >> >> >> >>         at
>>> >> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >>
>>>
>>> 
>>>>>>>>>>>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClie
>>>>>>>>>>>nt.
>>> >>>>>>>>ja
>>> >> >>>>>>va
>>> >> >> >>>>:3
>>> >> >> >> >>96)
>>> >> >> >> >>         at
>>> >> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >>
>>>
>>> 
>>>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Clie
>>>>>>>>>>>nt.
>>> >>>>>>>>ja
>>> >> >>>>>>va
>>> >> >> >>>>:5
>>> >> >> >> >>55)
>>> >> >> >> >>         at
>>> >> >> >> >>
>>> >> 
>>>>>org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
>>> >> >> >> >>         at
>>> >> >> >> 
>>>>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
>>> >> >> >> >>         at
>>> >> >> >> 
>>>>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
>>> >> >> >> >>         at
>>>java.security.AccessController.doPrivileged(Native
>>> >> >>Method)
>>> >> >> >> >>         at
>>>javax.security.auth.Subject.doAs(Subject.java:422)
>>> >> >> >> >>         at
>>> >> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >>
>>>
>>> 
>>>>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupIn
>>>>>>>>>>>for
>>> >>>>>>>>ma
>>> >> >>>>>>ti
>>> >> >> >>>>on
>>> >> >> >> >>.java:1671)
>>> >> >> >> >>         at
>>> >> >> >> >>
>>> >> >>
>>> >>
>>>
>>> 
>>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java
>>>>>>>>>:72
>>> >>>>>>0)
>>> >> >> >> >>         ... 15 more
>>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
>>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to
>>> rm1
>>> >> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >> >>
>>> >>
>>> >>
>>>
>>>
>>


Re: Slider AM fails to run when RM in HA setup fails over

Posted by Manoj Samel <ma...@gmail.com>.
Follow up question regarding Gour's comment in earlier thread -

Slider is installed on one of the hadoop nodes. SLIDER_HOME/conf directory
(say /data/slider/conf) is different than HADOOP_CONF_DIR
(/etc/hadoop/conf). Is it required/recommended that files in
HADOOP_CONF_DIR be copied to SLIDER_HOME/conf and slider-env.sh script sets
HADOOP_CONF_DIR to /data/slider/conf ?

Or can the slider-env.sh set HADOOP_CONF_DIR to /etc/hadoop/conf , without
copying the files ?

Using slider .80 for now, but would like to know recommendation for this
and future versions as well.

Thanks in advance,

Manoj

On Tue, Jul 26, 2016 at 3:27 PM, Manoj Samel <ma...@gmail.com>
wrote:

> Filed https://issues.apache.org/jira/browse/SLIDER-1158 with logs and my
> analysis of logs.
>
> On Tue, Jul 26, 2016 at 10:36 AM, Gour Saha <gs...@hortonworks.com> wrote:
>
>> Please file a JIRA and upload the logs to it.
>>
>> On 7/26/16, 10:21 AM, "Manoj Samel" <ma...@gmail.com> wrote:
>>
>> >Hi Gour,
>> >
>> >Can you please reach me using your own email-id? I will then send logs to
>> >you, along with my analysis - I don't want to send logs on public list
>> >
>> >Thanks,
>> >
>> >On Mon, Jul 25, 2016 at 5:39 PM, Gour Saha <gs...@hortonworks.com>
>> wrote:
>> >
>> >> Ok, so this node is not a gateway. It is part of the cluster, which
>> >>means
>> >> you don¹t need slider-client.xml at all. Just have HADOOP_CONF_DIR
>> >> pointing to /etc/hadoop/conf in slider-env.sh and that should be it.
>> >>
>> >> So the above simplifies your config setup. It will not solve either of
>> >>the
>> >> 2 problems you are facing.
>> >>
>> >> Now coming to the 2 issues you are facing, you have to provide
>> >>additional
>> >> logs for us to understand better. Let¹s start with  -
>> >> 1. RM logs (specifically between the time when rm1->rm2 failover is
>> >> simulated)
>> >> 2. Slider App logs
>> >>
>> >> -Gour
>> >>
>> >> On 7/25/16, 5:16 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>> >>
>> >> >   1. Not clear about your question on "gateway" node. The node
>> running
>> >> >   slider is part of the hadoop cluster and there are other services
>> >>like
>> >> >   Oozie that run on this node that utilizes hdfs and yarn. So if your
>> >> >   question is whether the node is otherwise working for HDFS and Yarn
>> >> >   configuration, it is working
>> >> >   2. I copied all files from HADOOP_CONF_DIR (say /etc/hadoop/conf)
>> to
>> >> >the
>> >> >   directory containing slider-client.xml (say /data/latest/conf)
>> >> >   3. In earlier email, I had done a mistake where slider-env.sh file
>> >> >HADOOP_CONF_DIR
>> >> >   was pointing to original directory /etc/hadoop/conf. I edited it to
>> >> >   point to same directory containing slider-client.xml &
>> slider-env.sh
>> >> >i.e.
>> >> >   /data/latest/conf
>> >> >   4. I emptied slider-client.xml. It just had the
>> >> ><configuration></configuration>.
>> >> >   The creation of spas worked but the Slider AM still shows the same
>> >> >issue.
>> >> >   i.e. when RM1 goes from active to standby, slider AM goes from
>> >>RUNNING
>> >> >to
>> >> >   ACCPTED state with same error about TOKEN. Also NOTE that when
>> >> >   slider-client.xml is empty, the "slider destroy xxx" command still
>> >> >fails
>> >> >   with Zookeeper connection errors.
>> >> >   5. I then added same parameters (as my last email - except
>> >> >   HADOOP_CONF_DIR) to slider-client.xml and ran. This time
>> >>slider-env.sh
>> >> >   has HADOOP_CONF_DIR pointing to /data/latest/conf and
>> >>slider-client.xml
>> >> >   does not have HADOOP_CONF_DIR. The same issue exists (but "slider
>> >> >   destroy" does not fails)
>> >> >   6. Could you explain what do you expect to pick up from Hadoop
>> >> >   configurations that will help you in RM Token ? If slider has token
>> >> >from
>> >> >   RM1, and it switches to RM2, not clear what slider does to get
>> >> >delegation
>> >> >   token for RM2 communication ?
>> >> >   7. It is worth repeating again that issue happens only when RM1 was
>> >> >   active when slider app was created and then RM1 becomes standby. If
>> >> >RM2 was
>> >> >   active when slider app was created, then slider AM keeps running
>> for
>> >> >any
>> >> >   number of switches between RM2 and RM1 back and forth ...
>> >> >
>> >> >
>> >> >On Mon, Jul 25, 2016 at 4:21 PM, Gour Saha <gs...@hortonworks.com>
>> >>wrote:
>> >> >
>> >> >> The node you are running slider from, is that a gateway node? Sorry
>> >>for
>> >> >> not being explicit. I meant copy everything under /etc/hadoop/conf
>> >>from
>> >> >> your cluster into some temp directory (say /tmp/hadoop_conf) in your
>> >> >> gateway node or local or whichever node you are running slider from.
>> >> >>Then
>> >> >> set HADOOP_CONF_DIR to /tmp/hadoop_conf and clear everything out
>> from
>> >> >> slider-client.xml.
>> >> >>
>> >> >> On 7/25/16, 4:12 PM, "Manoj Samel" <ma...@gmail.com>
>> wrote:
>> >> >>
>> >> >> >Hi Gour,
>> >> >> >
>> >> >> >Thanks for your prompt reply.
>> >> >> >
>> >> >> >FYI, issue happens when I create slider app when rm1 is active and
>> >>when
>> >> >> >rm1
>> >> >> >fails over to rm2. As soon as rm2 becomes active; the slider AM
>> goes
>> >> >>from
>> >> >> >RUNNING to ACCEPTED state with above error.
>> >> >> >
>> >> >> >For your suggestion, I did following
>> >> >> >
>> >> >> >1) Copied core-site, hdfs-site, yarn-site, and mapred-site from
>> >> >> >HADOOP_CONF_DIR
>> >> >> >to slider conf directory.
>> >> >> >2) Our slider-env.sh already had HADOOP_CONF_DIR set
>> >> >> >3) I removed all properties from slider-client.xml EXCEPT following
>> >> >> >
>> >> >> >   - HADOOP_CONF_DIR
>> >> >> >   - slider.yarn.queue
>> >> >> >   - slider.zookeeper.quorum
>> >> >> >   - hadoop.registry.zk.quorum
>> >> >> >   - hadoop.registry.zk.root
>> >> >> >   - hadoop.security.authorization
>> >> >> >   - hadoop.security.authentication
>> >> >> >
>> >> >> >Then I made rm1 active, installed and created slider app and
>> >>restarted
>> >> >>rm1
>> >> >> >(to make rm2) active. The slider-am again went from RUNNING to
>> >>ACCEPTED
>> >> >> >state.
>> >> >> >
>> >> >> >Let me know if you want me to try further changes.
>> >> >> >
>> >> >> >If I make the slider-client.xml completely empty per your
>> >>suggestion,
>> >> >>only
>> >> >> >slider AM comes up but it
>> >> >> >fails to start components. The AM log shows errors trying to
>> >>connect to
>> >> >> >zookeeper like below.
>> >> >> >2016-07-25 23:07:41,532
>> >> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
>> >> >> >zookeeper.ClientCnxn - Session 0x0 for server null, unexpected
>> >>error,
>> >> >> >closing socket connection and attempting reconnect
>> >> >> >java.net.ConnectException: Connection refused
>> >> >> >
>> >> >> >Hence I kept minimal info in slider-client.xml
>> >> >> >
>> >> >> >FYI This is slider version 0.80
>> >> >> >
>> >> >> >Thanks,
>> >> >> >
>> >> >> >Manoj
>> >> >> >
>> >> >> >On Mon, Jul 25, 2016 at 2:54 PM, Gour Saha <gs...@hortonworks.com>
>> >> >>wrote:
>> >> >> >
>> >> >> >> If possible, can you copy the entire content of the directory
>> >> >> >> /etc/hadoop/conf and then set HADOOP_CONF_DIR in slider-env.sh to
>> >>it.
>> >> >> >>Keep
>> >> >> >> slider-client.xml empty.
>> >> >> >>
>> >> >> >> Now when you do the same rm1->rm2 and then the reverse failovers,
>> >>do
>> >> >>you
>> >> >> >> see the same behaviors?
>> >> >> >>
>> >> >> >> -Gour
>> >> >> >>
>> >> >> >> On 7/25/16, 2:28 PM, "Manoj Samel" <ma...@gmail.com>
>> >>wrote:
>> >> >> >>
>> >> >> >> >Another observation (whatever it is worth)
>> >> >> >> >
>> >> >> >> >If slider app is created and started when rm2 was active, then
>> it
>> >> >> >>seems to
>> >> >> >> >survive switches between rm2 and rm1 (and back). I.e
>> >> >> >> >
>> >> >> >> >* rm2 is active
>> >> >> >> >* create and start slider application
>> >> >> >> >* fail over to rm1. Now the Slider AM keeps running
>> >> >> >> >* fail over to rm2 again. Slider AM still keeps running
>> >> >> >> >
>> >> >> >> >So, it seems if it starts with rm1 active, then the AM goes to
>> >> >> >>"ACCEPTED"
>> >> >> >> >state when RM fails to rm2. If it starts with rm2 active, then
>> it
>> >> >>runs
>> >> >> >> >fine
>> >> >> >> >with any switches between rm1 and rm2.
>> >> >> >> >
>> >> >> >> >Any feedback ?
>> >> >> >> >
>> >> >> >> >Thanks,
>> >> >> >> >
>> >> >> >> >Manoj
>> >> >> >> >
>> >> >> >> >On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel
>> >> >> >><ma...@gmail.com>
>> >> >> >> >wrote:
>> >> >> >> >
>> >> >> >> >> Setup
>> >> >> >> >>
>> >> >> >> >> - Hadoop 2.6 with RM HA, Kerberos enabled
>> >> >> >> >> - Slider 0.80
>> >> >> >> >> - In my slider-client.xml, I have added all RM HA properties,
>> >> >> >>including
>> >> >> >> >> the ones mentioned in
>> >> >>http://markmail.org/message/wnhpp2zn6ixo65e3.
>> >> >> >> >>
>> >> >> >> >> Following is the issue
>> >> >> >> >>
>> >> >> >> >> * rm1 is active, rm2 is standby
>> >> >> >> >> * deploy and start slider application, it runs fine
>> >> >> >> >> * restart rm1, rm2 is now active.
>> >> >> >> >> * The slider-am now goes from running into "ACCEPTED" mode. It
>> >> >>stays
>> >> >> >> >>there
>> >> >> >> >> till rm1 is made active again.
>> >> >> >> >>
>> >> >> >> >> In the slider-am log, it tries to connect to RM2 and
>> connection
>> >> >>fails
>> >> >> >> >>due
>> >> >> >> >> to org.apache.hadoop.security.AccessControlException: Client
>> >> >>cannot
>> >> >> >> >> authenticate via:[TOKEN]. See detailed log below
>> >> >> >> >>
>> >> >> >> >>  It seems it has some token (delegation token?) for RM1 but
>> >>tries
>> >> >>to
>> >> >> >>use
>> >> >> >> >> same(?) for RM2 and fails. Am I missing some configuration ???
>> >> >> >> >>
>> >> >> >> >> Thanks,
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] INFO
>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to
>> rm2
>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>> >> >> >> >>  security.UserGroupInformation - PriviledgedActionException
>> >> >> >>as:abc@XYZ
>> >> >> >> >> (auth:KERBEROS)
>> >> >> >>cause:org.apache.hadoop.security.AccessControlException:
>> >> >> >> >> Client cannot authenticate via:[TOKEN]
>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>> >> >>ipc.Client -
>> >> >> >> >> Exception encountered while connecting to the server :
>> >> >> >> >> org.apache.hadoop.security.AccessControlException: Client
>> >>cannot
>> >> >> >> >> authenticate via:[TOKEN]
>> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>> >> >> >> >>  security.UserGroupInformation - PriviledgedActionException
>> >> >> >>as:abc@XYZ
>> >> >> >> >> (auth:KERBEROS) cause:java.io.IOException:
>> >> >> >> >> org.apache.hadoop.security.AccessControlException: Client
>> >>cannot
>> >> >> >> >> authenticate via:[TOKEN]
>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
>> >> >> >> >>  retry.RetryInvocationHandler - Exception while invoking
>> >>allocate
>> >> >>of
>> >> >> >> >>class
>> >> >> >> >> ApplicationMasterProtocolPBClientImpl over rm2 after 287 fail
>> >>over
>> >> >> >> >> attempts. Trying to fail over immediately.
>> >> >> >> >> java.io.IOException: Failed on local exception:
>> >> >>java.io.IOException:
>> >> >> >> >> org.apache.hadoop.security.AccessControlException: Client
>> >>cannot
>> >> >> >> >> authenticate via:[TOKEN]; Host Details : local host is:
>> >>"<SliderAM
>> >> >> >> >> HOST>/<slider AM Host IP>"; destination host is: "<RM2
>> >> >>HOST>":23130;
>> >> >> >> >>         at
>> >> >> >> >>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>> >> >> >> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1476)
>> >> >> >> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1403)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>>
>> >>>>>>>>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcE
>> >>>>>>>>ng
>> >> >>>>>>in
>> >> >> >>>>e.
>> >> >> >> >>java:230)
>> >> >> >> >>         at com.sun.proxy.$Proxy23.allocate(Unknown Source)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>>
>> >>>>>>>>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocol
>> >>>>>>>>PB
>> >> >>>>>>Cl
>> >> >> >>>>ie
>> >> >> >> >>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >> >> >> >>         at
>> sun.reflect.GeneratedMethodAccessor10.invoke(Unknown
>> >> >> >>Source)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>>
>> >>>>>>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc
>> >>>>>>>>es
>> >> >>>>>>so
>> >> >> >>>>rI
>> >> >> >> >>mpl.java:43)
>> >> >> >> >>         at java.lang.reflect.Method.invoke(Method.java:497)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>>
>> >>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(Retr
>> >>>>>>>>yI
>> >> >>>>>>nv
>> >> >> >>>>oc
>> >> >> >> >>ationHandler.java:252)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>>
>> >>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvoc
>> >>>>>>>>at
>> >> >>>>>>io
>> >> >> >>>>nH
>> >> >> >> >>andler.java:104)
>> >> >> >> >>         at com.sun.proxy.$Proxy24.allocate(Unknown Source)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>>
>> >>>>>>>>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRM
>> >>>>>>>>Cl
>> >> >>>>>>ie
>> >> >> >>>>nt
>> >> >> >> >>Impl.java:278)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>>
>> >>>>>>>>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$He
>> >>>>>>>>ar
>> >> >>>>>>tb
>> >> >> >>>>ea
>> >> >> >> >>tThread.run(AMRMClientAsyncImpl.java:224)
>> >> >> >> >> Caused by: java.io.IOException:
>> >> >> >> >> org.apache.hadoop.security.AccessControlException: Client
>> >>cannot
>> >> >> >> >> authenticate via:[TOKEN]
>> >> >> >> >>         at
>> >> >> >> >>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
>> >> >> >> >>         at java.security.AccessController.doPrivileged(Native
>> >> >>Method)
>> >> >> >> >>         at javax.security.auth.Subject.doAs(Subject.java:422)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>>
>> >>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInfor
>> >>>>>>>>ma
>> >> >>>>>>ti
>> >> >> >>>>on
>> >> >> >> >>.java:1671)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>>
>> >>>>>>>>org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure
>> >>>>>>>>(C
>> >> >>>>>>li
>> >> >> >>>>en
>> >> >> >> >>t.java:645)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >>
>> >>
>>
>> >>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:73
>> >>>>>>3)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >>org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
>> >> >> >> >>         at
>> >> >> >>org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
>> >> >> >> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1442)
>> >> >> >> >>         ... 12 more
>> >> >> >> >> Caused by: org.apache.hadoop.security.AccessControlException:
>> >> >>Client
>> >> >> >> >> cannot authenticate via:[TOKEN]
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>>
>> >>>>>>>>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcCl
>> >>>>>>>>ie
>> >> >>>>>>nt
>> >> >> >>>>.j
>> >> >> >> >>ava:172)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>>
>> >>>>>>>>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.
>> >>>>>>>>ja
>> >> >>>>>>va
>> >> >> >>>>:3
>> >> >> >> >>96)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>>
>> >>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.
>> >>>>>>>>ja
>> >> >>>>>>va
>> >> >> >>>>:5
>> >> >> >> >>55)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >>org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
>> >> >> >> >>         at
>> >> >> >> >>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
>> >> >> >> >>         at
>> >> >> >> >>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
>> >> >> >> >>         at java.security.AccessController.doPrivileged(Native
>> >> >>Method)
>> >> >> >> >>         at javax.security.auth.Subject.doAs(Subject.java:422)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>>
>> >>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInfor
>> >>>>>>>>ma
>> >> >>>>>>ti
>> >> >> >>>>on
>> >> >> >> >>.java:1671)
>> >> >> >> >>         at
>> >> >> >> >>
>> >> >>
>> >>
>>
>> >>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:72
>> >>>>>>0)
>> >> >> >> >>         ... 15 more
>> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
>> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to
>> rm1
>> >> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>>
>>
>

Re: Slider AM fails to run when RM in HA setup fails over

Posted by Manoj Samel <ma...@gmail.com>.
Filed https://issues.apache.org/jira/browse/SLIDER-1158 with logs and my
analysis of logs.

On Tue, Jul 26, 2016 at 10:36 AM, Gour Saha <gs...@hortonworks.com> wrote:

> Please file a JIRA and upload the logs to it.
>
> On 7/26/16, 10:21 AM, "Manoj Samel" <ma...@gmail.com> wrote:
>
> >Hi Gour,
> >
> >Can you please reach me using your own email-id? I will then send logs to
> >you, along with my analysis - I don't want to send logs on public list
> >
> >Thanks,
> >
> >On Mon, Jul 25, 2016 at 5:39 PM, Gour Saha <gs...@hortonworks.com> wrote:
> >
> >> Ok, so this node is not a gateway. It is part of the cluster, which
> >>means
> >> you don¹t need slider-client.xml at all. Just have HADOOP_CONF_DIR
> >> pointing to /etc/hadoop/conf in slider-env.sh and that should be it.
> >>
> >> So the above simplifies your config setup. It will not solve either of
> >>the
> >> 2 problems you are facing.
> >>
> >> Now coming to the 2 issues you are facing, you have to provide
> >>additional
> >> logs for us to understand better. Let¹s start with  -
> >> 1. RM logs (specifically between the time when rm1->rm2 failover is
> >> simulated)
> >> 2. Slider App logs
> >>
> >> -Gour
> >>
> >> On 7/25/16, 5:16 PM, "Manoj Samel" <ma...@gmail.com> wrote:
> >>
> >> >   1. Not clear about your question on "gateway" node. The node running
> >> >   slider is part of the hadoop cluster and there are other services
> >>like
> >> >   Oozie that run on this node that utilizes hdfs and yarn. So if your
> >> >   question is whether the node is otherwise working for HDFS and Yarn
> >> >   configuration, it is working
> >> >   2. I copied all files from HADOOP_CONF_DIR (say /etc/hadoop/conf) to
> >> >the
> >> >   directory containing slider-client.xml (say /data/latest/conf)
> >> >   3. In earlier email, I had done a mistake where slider-env.sh file
> >> >HADOOP_CONF_DIR
> >> >   was pointing to original directory /etc/hadoop/conf. I edited it to
> >> >   point to same directory containing slider-client.xml & slider-env.sh
> >> >i.e.
> >> >   /data/latest/conf
> >> >   4. I emptied slider-client.xml. It just had the
> >> ><configuration></configuration>.
> >> >   The creation of spas worked but the Slider AM still shows the same
> >> >issue.
> >> >   i.e. when RM1 goes from active to standby, slider AM goes from
> >>RUNNING
> >> >to
> >> >   ACCPTED state with same error about TOKEN. Also NOTE that when
> >> >   slider-client.xml is empty, the "slider destroy xxx" command still
> >> >fails
> >> >   with Zookeeper connection errors.
> >> >   5. I then added same parameters (as my last email - except
> >> >   HADOOP_CONF_DIR) to slider-client.xml and ran. This time
> >>slider-env.sh
> >> >   has HADOOP_CONF_DIR pointing to /data/latest/conf and
> >>slider-client.xml
> >> >   does not have HADOOP_CONF_DIR. The same issue exists (but "slider
> >> >   destroy" does not fails)
> >> >   6. Could you explain what do you expect to pick up from Hadoop
> >> >   configurations that will help you in RM Token ? If slider has token
> >> >from
> >> >   RM1, and it switches to RM2, not clear what slider does to get
> >> >delegation
> >> >   token for RM2 communication ?
> >> >   7. It is worth repeating again that issue happens only when RM1 was
> >> >   active when slider app was created and then RM1 becomes standby. If
> >> >RM2 was
> >> >   active when slider app was created, then slider AM keeps running for
> >> >any
> >> >   number of switches between RM2 and RM1 back and forth ...
> >> >
> >> >
> >> >On Mon, Jul 25, 2016 at 4:21 PM, Gour Saha <gs...@hortonworks.com>
> >>wrote:
> >> >
> >> >> The node you are running slider from, is that a gateway node? Sorry
> >>for
> >> >> not being explicit. I meant copy everything under /etc/hadoop/conf
> >>from
> >> >> your cluster into some temp directory (say /tmp/hadoop_conf) in your
> >> >> gateway node or local or whichever node you are running slider from.
> >> >>Then
> >> >> set HADOOP_CONF_DIR to /tmp/hadoop_conf and clear everything out from
> >> >> slider-client.xml.
> >> >>
> >> >> On 7/25/16, 4:12 PM, "Manoj Samel" <ma...@gmail.com> wrote:
> >> >>
> >> >> >Hi Gour,
> >> >> >
> >> >> >Thanks for your prompt reply.
> >> >> >
> >> >> >FYI, issue happens when I create slider app when rm1 is active and
> >>when
> >> >> >rm1
> >> >> >fails over to rm2. As soon as rm2 becomes active; the slider AM goes
> >> >>from
> >> >> >RUNNING to ACCEPTED state with above error.
> >> >> >
> >> >> >For your suggestion, I did following
> >> >> >
> >> >> >1) Copied core-site, hdfs-site, yarn-site, and mapred-site from
> >> >> >HADOOP_CONF_DIR
> >> >> >to slider conf directory.
> >> >> >2) Our slider-env.sh already had HADOOP_CONF_DIR set
> >> >> >3) I removed all properties from slider-client.xml EXCEPT following
> >> >> >
> >> >> >   - HADOOP_CONF_DIR
> >> >> >   - slider.yarn.queue
> >> >> >   - slider.zookeeper.quorum
> >> >> >   - hadoop.registry.zk.quorum
> >> >> >   - hadoop.registry.zk.root
> >> >> >   - hadoop.security.authorization
> >> >> >   - hadoop.security.authentication
> >> >> >
> >> >> >Then I made rm1 active, installed and created slider app and
> >>restarted
> >> >>rm1
> >> >> >(to make rm2) active. The slider-am again went from RUNNING to
> >>ACCEPTED
> >> >> >state.
> >> >> >
> >> >> >Let me know if you want me to try further changes.
> >> >> >
> >> >> >If I make the slider-client.xml completely empty per your
> >>suggestion,
> >> >>only
> >> >> >slider AM comes up but it
> >> >> >fails to start components. The AM log shows errors trying to
> >>connect to
> >> >> >zookeeper like below.
> >> >> >2016-07-25 23:07:41,532
> >> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
> >> >> >zookeeper.ClientCnxn - Session 0x0 for server null, unexpected
> >>error,
> >> >> >closing socket connection and attempting reconnect
> >> >> >java.net.ConnectException: Connection refused
> >> >> >
> >> >> >Hence I kept minimal info in slider-client.xml
> >> >> >
> >> >> >FYI This is slider version 0.80
> >> >> >
> >> >> >Thanks,
> >> >> >
> >> >> >Manoj
> >> >> >
> >> >> >On Mon, Jul 25, 2016 at 2:54 PM, Gour Saha <gs...@hortonworks.com>
> >> >>wrote:
> >> >> >
> >> >> >> If possible, can you copy the entire content of the directory
> >> >> >> /etc/hadoop/conf and then set HADOOP_CONF_DIR in slider-env.sh to
> >>it.
> >> >> >>Keep
> >> >> >> slider-client.xml empty.
> >> >> >>
> >> >> >> Now when you do the same rm1->rm2 and then the reverse failovers,
> >>do
> >> >>you
> >> >> >> see the same behaviors?
> >> >> >>
> >> >> >> -Gour
> >> >> >>
> >> >> >> On 7/25/16, 2:28 PM, "Manoj Samel" <ma...@gmail.com>
> >>wrote:
> >> >> >>
> >> >> >> >Another observation (whatever it is worth)
> >> >> >> >
> >> >> >> >If slider app is created and started when rm2 was active, then it
> >> >> >>seems to
> >> >> >> >survive switches between rm2 and rm1 (and back). I.e
> >> >> >> >
> >> >> >> >* rm2 is active
> >> >> >> >* create and start slider application
> >> >> >> >* fail over to rm1. Now the Slider AM keeps running
> >> >> >> >* fail over to rm2 again. Slider AM still keeps running
> >> >> >> >
> >> >> >> >So, it seems if it starts with rm1 active, then the AM goes to
> >> >> >>"ACCEPTED"
> >> >> >> >state when RM fails to rm2. If it starts with rm2 active, then it
> >> >>runs
> >> >> >> >fine
> >> >> >> >with any switches between rm1 and rm2.
> >> >> >> >
> >> >> >> >Any feedback ?
> >> >> >> >
> >> >> >> >Thanks,
> >> >> >> >
> >> >> >> >Manoj
> >> >> >> >
> >> >> >> >On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel
> >> >> >><ma...@gmail.com>
> >> >> >> >wrote:
> >> >> >> >
> >> >> >> >> Setup
> >> >> >> >>
> >> >> >> >> - Hadoop 2.6 with RM HA, Kerberos enabled
> >> >> >> >> - Slider 0.80
> >> >> >> >> - In my slider-client.xml, I have added all RM HA properties,
> >> >> >>including
> >> >> >> >> the ones mentioned in
> >> >>http://markmail.org/message/wnhpp2zn6ixo65e3.
> >> >> >> >>
> >> >> >> >> Following is the issue
> >> >> >> >>
> >> >> >> >> * rm1 is active, rm2 is standby
> >> >> >> >> * deploy and start slider application, it runs fine
> >> >> >> >> * restart rm1, rm2 is now active.
> >> >> >> >> * The slider-am now goes from running into "ACCEPTED" mode. It
> >> >>stays
> >> >> >> >>there
> >> >> >> >> till rm1 is made active again.
> >> >> >> >>
> >> >> >> >> In the slider-am log, it tries to connect to RM2 and connection
> >> >>fails
> >> >> >> >>due
> >> >> >> >> to org.apache.hadoop.security.AccessControlException: Client
> >> >>cannot
> >> >> >> >> authenticate via:[TOKEN]. See detailed log below
> >> >> >> >>
> >> >> >> >>  It seems it has some token (delegation token?) for RM1 but
> >>tries
> >> >>to
> >> >> >>use
> >> >> >> >> same(?) for RM2 and fails. Am I missing some configuration ???
> >> >> >> >>
> >> >> >> >> Thanks,
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] INFO
> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to rm2
> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
> >> >> >> >>  security.UserGroupInformation - PriviledgedActionException
> >> >> >>as:abc@XYZ
> >> >> >> >> (auth:KERBEROS)
> >> >> >>cause:org.apache.hadoop.security.AccessControlException:
> >> >> >> >> Client cannot authenticate via:[TOKEN]
> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
> >> >>ipc.Client -
> >> >> >> >> Exception encountered while connecting to the server :
> >> >> >> >> org.apache.hadoop.security.AccessControlException: Client
> >>cannot
> >> >> >> >> authenticate via:[TOKEN]
> >> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
> >> >> >> >>  security.UserGroupInformation - PriviledgedActionException
> >> >> >>as:abc@XYZ
> >> >> >> >> (auth:KERBEROS) cause:java.io.IOException:
> >> >> >> >> org.apache.hadoop.security.AccessControlException: Client
> >>cannot
> >> >> >> >> authenticate via:[TOKEN]
> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
> >> >> >> >>  retry.RetryInvocationHandler - Exception while invoking
> >>allocate
> >> >>of
> >> >> >> >>class
> >> >> >> >> ApplicationMasterProtocolPBClientImpl over rm2 after 287 fail
> >>over
> >> >> >> >> attempts. Trying to fail over immediately.
> >> >> >> >> java.io.IOException: Failed on local exception:
> >> >>java.io.IOException:
> >> >> >> >> org.apache.hadoop.security.AccessControlException: Client
> >>cannot
> >> >> >> >> authenticate via:[TOKEN]; Host Details : local host is:
> >>"<SliderAM
> >> >> >> >> HOST>/<slider AM Host IP>"; destination host is: "<RM2
> >> >>HOST>":23130;
> >> >> >> >>         at
> >> >> >> >>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> >> >> >> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1476)
> >> >> >> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1403)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcE
> >>>>>>>>ng
> >> >>>>>>in
> >> >> >>>>e.
> >> >> >> >>java:230)
> >> >> >> >>         at com.sun.proxy.$Proxy23.allocate(Unknown Source)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocol
> >>>>>>>>PB
> >> >>>>>>Cl
> >> >> >>>>ie
> >> >> >> >>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> >> >> >> >>         at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown
> >> >> >>Source)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc
> >>>>>>>>es
> >> >>>>>>so
> >> >> >>>>rI
> >> >> >> >>mpl.java:43)
> >> >> >> >>         at java.lang.reflect.Method.invoke(Method.java:497)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(Retr
> >>>>>>>>yI
> >> >>>>>>nv
> >> >> >>>>oc
> >> >> >> >>ationHandler.java:252)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvoc
> >>>>>>>>at
> >> >>>>>>io
> >> >> >>>>nH
> >> >> >> >>andler.java:104)
> >> >> >> >>         at com.sun.proxy.$Proxy24.allocate(Unknown Source)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRM
> >>>>>>>>Cl
> >> >>>>>>ie
> >> >> >>>>nt
> >> >> >> >>Impl.java:278)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$He
> >>>>>>>>ar
> >> >>>>>>tb
> >> >> >>>>ea
> >> >> >> >>tThread.run(AMRMClientAsyncImpl.java:224)
> >> >> >> >> Caused by: java.io.IOException:
> >> >> >> >> org.apache.hadoop.security.AccessControlException: Client
> >>cannot
> >> >> >> >> authenticate via:[TOKEN]
> >> >> >> >>         at
> >> >> >> >>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
> >> >> >> >>         at java.security.AccessController.doPrivileged(Native
> >> >>Method)
> >> >> >> >>         at javax.security.auth.Subject.doAs(Subject.java:422)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInfor
> >>>>>>>>ma
> >> >>>>>>ti
> >> >> >>>>on
> >> >> >> >>.java:1671)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure
> >>>>>>>>(C
> >> >>>>>>li
> >> >> >>>>en
> >> >> >> >>t.java:645)
> >> >> >> >>         at
> >> >> >> >>
> >> >>
> >>
> >>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:73
> >>>>>>3)
> >> >> >> >>         at
> >> >> >> >>
> >> >>org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
> >> >> >> >>         at
> >> >> >>org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
> >> >> >> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1442)
> >> >> >> >>         ... 12 more
> >> >> >> >> Caused by: org.apache.hadoop.security.AccessControlException:
> >> >>Client
> >> >> >> >> cannot authenticate via:[TOKEN]
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcCl
> >>>>>>>>ie
> >> >>>>>>nt
> >> >> >>>>.j
> >> >> >> >>ava:172)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.
> >>>>>>>>ja
> >> >>>>>>va
> >> >> >>>>:3
> >> >> >> >>96)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.
> >>>>>>>>ja
> >> >>>>>>va
> >> >> >>>>:5
> >> >> >> >>55)
> >> >> >> >>         at
> >> >> >> >>
> >> >>org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
> >> >> >> >>         at
> >> >> >> >>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
> >> >> >> >>         at
> >> >> >> >>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
> >> >> >> >>         at java.security.AccessController.doPrivileged(Native
> >> >>Method)
> >> >> >> >>         at javax.security.auth.Subject.doAs(Subject.java:422)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInfor
> >>>>>>>>ma
> >> >>>>>>ti
> >> >> >>>>on
> >> >> >> >>.java:1671)
> >> >> >> >>         at
> >> >> >> >>
> >> >>
> >>
> >>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:72
> >>>>>>0)
> >> >> >> >>         ... 15 more
> >> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
> >> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to rm1
> >> >> >> >>
> >> >> >>
> >> >> >>
> >> >>
> >> >>
> >>
> >>
>
>

Re: Slider AM fails to run when RM in HA setup fails over

Posted by Gour Saha <gs...@hortonworks.com>.
Please file a JIRA and upload the logs to it.

On 7/26/16, 10:21 AM, "Manoj Samel" <ma...@gmail.com> wrote:

>Hi Gour,
>
>Can you please reach me using your own email-id? I will then send logs to
>you, along with my analysis - I don't want to send logs on public list
>
>Thanks,
>
>On Mon, Jul 25, 2016 at 5:39 PM, Gour Saha <gs...@hortonworks.com> wrote:
>
>> Ok, so this node is not a gateway. It is part of the cluster, which
>>means
>> you don¹t need slider-client.xml at all. Just have HADOOP_CONF_DIR
>> pointing to /etc/hadoop/conf in slider-env.sh and that should be it.
>>
>> So the above simplifies your config setup. It will not solve either of
>>the
>> 2 problems you are facing.
>>
>> Now coming to the 2 issues you are facing, you have to provide
>>additional
>> logs for us to understand better. Let¹s start with  -
>> 1. RM logs (specifically between the time when rm1->rm2 failover is
>> simulated)
>> 2. Slider App logs
>>
>> -Gour
>>
>> On 7/25/16, 5:16 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>>
>> >   1. Not clear about your question on "gateway" node. The node running
>> >   slider is part of the hadoop cluster and there are other services
>>like
>> >   Oozie that run on this node that utilizes hdfs and yarn. So if your
>> >   question is whether the node is otherwise working for HDFS and Yarn
>> >   configuration, it is working
>> >   2. I copied all files from HADOOP_CONF_DIR (say /etc/hadoop/conf) to
>> >the
>> >   directory containing slider-client.xml (say /data/latest/conf)
>> >   3. In earlier email, I had done a mistake where slider-env.sh file
>> >HADOOP_CONF_DIR
>> >   was pointing to original directory /etc/hadoop/conf. I edited it to
>> >   point to same directory containing slider-client.xml & slider-env.sh
>> >i.e.
>> >   /data/latest/conf
>> >   4. I emptied slider-client.xml. It just had the
>> ><configuration></configuration>.
>> >   The creation of spas worked but the Slider AM still shows the same
>> >issue.
>> >   i.e. when RM1 goes from active to standby, slider AM goes from
>>RUNNING
>> >to
>> >   ACCPTED state with same error about TOKEN. Also NOTE that when
>> >   slider-client.xml is empty, the "slider destroy xxx" command still
>> >fails
>> >   with Zookeeper connection errors.
>> >   5. I then added same parameters (as my last email - except
>> >   HADOOP_CONF_DIR) to slider-client.xml and ran. This time
>>slider-env.sh
>> >   has HADOOP_CONF_DIR pointing to /data/latest/conf and
>>slider-client.xml
>> >   does not have HADOOP_CONF_DIR. The same issue exists (but "slider
>> >   destroy" does not fails)
>> >   6. Could you explain what do you expect to pick up from Hadoop
>> >   configurations that will help you in RM Token ? If slider has token
>> >from
>> >   RM1, and it switches to RM2, not clear what slider does to get
>> >delegation
>> >   token for RM2 communication ?
>> >   7. It is worth repeating again that issue happens only when RM1 was
>> >   active when slider app was created and then RM1 becomes standby. If
>> >RM2 was
>> >   active when slider app was created, then slider AM keeps running for
>> >any
>> >   number of switches between RM2 and RM1 back and forth ...
>> >
>> >
>> >On Mon, Jul 25, 2016 at 4:21 PM, Gour Saha <gs...@hortonworks.com>
>>wrote:
>> >
>> >> The node you are running slider from, is that a gateway node? Sorry
>>for
>> >> not being explicit. I meant copy everything under /etc/hadoop/conf
>>from
>> >> your cluster into some temp directory (say /tmp/hadoop_conf) in your
>> >> gateway node or local or whichever node you are running slider from.
>> >>Then
>> >> set HADOOP_CONF_DIR to /tmp/hadoop_conf and clear everything out from
>> >> slider-client.xml.
>> >>
>> >> On 7/25/16, 4:12 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>> >>
>> >> >Hi Gour,
>> >> >
>> >> >Thanks for your prompt reply.
>> >> >
>> >> >FYI, issue happens when I create slider app when rm1 is active and
>>when
>> >> >rm1
>> >> >fails over to rm2. As soon as rm2 becomes active; the slider AM goes
>> >>from
>> >> >RUNNING to ACCEPTED state with above error.
>> >> >
>> >> >For your suggestion, I did following
>> >> >
>> >> >1) Copied core-site, hdfs-site, yarn-site, and mapred-site from
>> >> >HADOOP_CONF_DIR
>> >> >to slider conf directory.
>> >> >2) Our slider-env.sh already had HADOOP_CONF_DIR set
>> >> >3) I removed all properties from slider-client.xml EXCEPT following
>> >> >
>> >> >   - HADOOP_CONF_DIR
>> >> >   - slider.yarn.queue
>> >> >   - slider.zookeeper.quorum
>> >> >   - hadoop.registry.zk.quorum
>> >> >   - hadoop.registry.zk.root
>> >> >   - hadoop.security.authorization
>> >> >   - hadoop.security.authentication
>> >> >
>> >> >Then I made rm1 active, installed and created slider app and
>>restarted
>> >>rm1
>> >> >(to make rm2) active. The slider-am again went from RUNNING to
>>ACCEPTED
>> >> >state.
>> >> >
>> >> >Let me know if you want me to try further changes.
>> >> >
>> >> >If I make the slider-client.xml completely empty per your
>>suggestion,
>> >>only
>> >> >slider AM comes up but it
>> >> >fails to start components. The AM log shows errors trying to
>>connect to
>> >> >zookeeper like below.
>> >> >2016-07-25 23:07:41,532
>> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
>> >> >zookeeper.ClientCnxn - Session 0x0 for server null, unexpected
>>error,
>> >> >closing socket connection and attempting reconnect
>> >> >java.net.ConnectException: Connection refused
>> >> >
>> >> >Hence I kept minimal info in slider-client.xml
>> >> >
>> >> >FYI This is slider version 0.80
>> >> >
>> >> >Thanks,
>> >> >
>> >> >Manoj
>> >> >
>> >> >On Mon, Jul 25, 2016 at 2:54 PM, Gour Saha <gs...@hortonworks.com>
>> >>wrote:
>> >> >
>> >> >> If possible, can you copy the entire content of the directory
>> >> >> /etc/hadoop/conf and then set HADOOP_CONF_DIR in slider-env.sh to
>>it.
>> >> >>Keep
>> >> >> slider-client.xml empty.
>> >> >>
>> >> >> Now when you do the same rm1->rm2 and then the reverse failovers,
>>do
>> >>you
>> >> >> see the same behaviors?
>> >> >>
>> >> >> -Gour
>> >> >>
>> >> >> On 7/25/16, 2:28 PM, "Manoj Samel" <ma...@gmail.com>
>>wrote:
>> >> >>
>> >> >> >Another observation (whatever it is worth)
>> >> >> >
>> >> >> >If slider app is created and started when rm2 was active, then it
>> >> >>seems to
>> >> >> >survive switches between rm2 and rm1 (and back). I.e
>> >> >> >
>> >> >> >* rm2 is active
>> >> >> >* create and start slider application
>> >> >> >* fail over to rm1. Now the Slider AM keeps running
>> >> >> >* fail over to rm2 again. Slider AM still keeps running
>> >> >> >
>> >> >> >So, it seems if it starts with rm1 active, then the AM goes to
>> >> >>"ACCEPTED"
>> >> >> >state when RM fails to rm2. If it starts with rm2 active, then it
>> >>runs
>> >> >> >fine
>> >> >> >with any switches between rm1 and rm2.
>> >> >> >
>> >> >> >Any feedback ?
>> >> >> >
>> >> >> >Thanks,
>> >> >> >
>> >> >> >Manoj
>> >> >> >
>> >> >> >On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel
>> >> >><ma...@gmail.com>
>> >> >> >wrote:
>> >> >> >
>> >> >> >> Setup
>> >> >> >>
>> >> >> >> - Hadoop 2.6 with RM HA, Kerberos enabled
>> >> >> >> - Slider 0.80
>> >> >> >> - In my slider-client.xml, I have added all RM HA properties,
>> >> >>including
>> >> >> >> the ones mentioned in
>> >>http://markmail.org/message/wnhpp2zn6ixo65e3.
>> >> >> >>
>> >> >> >> Following is the issue
>> >> >> >>
>> >> >> >> * rm1 is active, rm2 is standby
>> >> >> >> * deploy and start slider application, it runs fine
>> >> >> >> * restart rm1, rm2 is now active.
>> >> >> >> * The slider-am now goes from running into "ACCEPTED" mode. It
>> >>stays
>> >> >> >>there
>> >> >> >> till rm1 is made active again.
>> >> >> >>
>> >> >> >> In the slider-am log, it tries to connect to RM2 and connection
>> >>fails
>> >> >> >>due
>> >> >> >> to org.apache.hadoop.security.AccessControlException: Client
>> >>cannot
>> >> >> >> authenticate via:[TOKEN]. See detailed log below
>> >> >> >>
>> >> >> >>  It seems it has some token (delegation token?) for RM1 but
>>tries
>> >>to
>> >> >>use
>> >> >> >> same(?) for RM2 and fails. Am I missing some configuration ???
>> >> >> >>
>> >> >> >> Thanks,
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] INFO
>> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to rm2
>> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>> >> >> >>  security.UserGroupInformation - PriviledgedActionException
>> >> >>as:abc@XYZ
>> >> >> >> (auth:KERBEROS)
>> >> >>cause:org.apache.hadoop.security.AccessControlException:
>> >> >> >> Client cannot authenticate via:[TOKEN]
>> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>> >>ipc.Client -
>> >> >> >> Exception encountered while connecting to the server :
>> >> >> >> org.apache.hadoop.security.AccessControlException: Client
>>cannot
>> >> >> >> authenticate via:[TOKEN]
>> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>> >> >> >>  security.UserGroupInformation - PriviledgedActionException
>> >> >>as:abc@XYZ
>> >> >> >> (auth:KERBEROS) cause:java.io.IOException:
>> >> >> >> org.apache.hadoop.security.AccessControlException: Client
>>cannot
>> >> >> >> authenticate via:[TOKEN]
>> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
>> >> >> >>  retry.RetryInvocationHandler - Exception while invoking
>>allocate
>> >>of
>> >> >> >>class
>> >> >> >> ApplicationMasterProtocolPBClientImpl over rm2 after 287 fail
>>over
>> >> >> >> attempts. Trying to fail over immediately.
>> >> >> >> java.io.IOException: Failed on local exception:
>> >>java.io.IOException:
>> >> >> >> org.apache.hadoop.security.AccessControlException: Client
>>cannot
>> >> >> >> authenticate via:[TOKEN]; Host Details : local host is:
>>"<SliderAM
>> >> >> >> HOST>/<slider AM Host IP>"; destination host is: "<RM2
>> >>HOST>":23130;
>> >> >> >>         at
>> >> >> >>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>> >> >> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1476)
>> >> >> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1403)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> 
>>>>>>>>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcE
>>>>>>>>ng
>> >>>>>>in
>> >> >>>>e.
>> >> >> >>java:230)
>> >> >> >>         at com.sun.proxy.$Proxy23.allocate(Unknown Source)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> 
>>>>>>>>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocol
>>>>>>>>PB
>> >>>>>>Cl
>> >> >>>>ie
>> >> >> >>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >> >> >>         at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown
>> >> >>Source)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> 
>>>>>>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc
>>>>>>>>es
>> >>>>>>so
>> >> >>>>rI
>> >> >> >>mpl.java:43)
>> >> >> >>         at java.lang.reflect.Method.invoke(Method.java:497)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> 
>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(Retr
>>>>>>>>yI
>> >>>>>>nv
>> >> >>>>oc
>> >> >> >>ationHandler.java:252)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> 
>>>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvoc
>>>>>>>>at
>> >>>>>>io
>> >> >>>>nH
>> >> >> >>andler.java:104)
>> >> >> >>         at com.sun.proxy.$Proxy24.allocate(Unknown Source)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> 
>>>>>>>>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRM
>>>>>>>>Cl
>> >>>>>>ie
>> >> >>>>nt
>> >> >> >>Impl.java:278)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> 
>>>>>>>>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$He
>>>>>>>>ar
>> >>>>>>tb
>> >> >>>>ea
>> >> >> >>tThread.run(AMRMClientAsyncImpl.java:224)
>> >> >> >> Caused by: java.io.IOException:
>> >> >> >> org.apache.hadoop.security.AccessControlException: Client
>>cannot
>> >> >> >> authenticate via:[TOKEN]
>> >> >> >>         at
>> >> >> >>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
>> >> >> >>         at java.security.AccessController.doPrivileged(Native
>> >>Method)
>> >> >> >>         at javax.security.auth.Subject.doAs(Subject.java:422)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> 
>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInfor
>>>>>>>>ma
>> >>>>>>ti
>> >> >>>>on
>> >> >> >>.java:1671)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> 
>>>>>>>>org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure
>>>>>>>>(C
>> >>>>>>li
>> >> >>>>en
>> >> >> >>t.java:645)
>> >> >> >>         at
>> >> >> >>
>> >>
>> 
>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:73
>>>>>>3)
>> >> >> >>         at
>> >> >> >>
>> >>org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
>> >> >> >>         at
>> >> >>org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
>> >> >> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1442)
>> >> >> >>         ... 12 more
>> >> >> >> Caused by: org.apache.hadoop.security.AccessControlException:
>> >>Client
>> >> >> >> cannot authenticate via:[TOKEN]
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> 
>>>>>>>>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcCl
>>>>>>>>ie
>> >>>>>>nt
>> >> >>>>.j
>> >> >> >>ava:172)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> 
>>>>>>>>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.
>>>>>>>>ja
>> >>>>>>va
>> >> >>>>:3
>> >> >> >>96)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> 
>>>>>>>>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.
>>>>>>>>ja
>> >>>>>>va
>> >> >>>>:5
>> >> >> >>55)
>> >> >> >>         at
>> >> >> >>
>> >>org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
>> >> >> >>         at
>> >> >> >>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
>> >> >> >>         at
>> >> >> >>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
>> >> >> >>         at java.security.AccessController.doPrivileged(Native
>> >>Method)
>> >> >> >>         at javax.security.auth.Subject.doAs(Subject.java:422)
>> >> >> >>         at
>> >> >> >>
>> >> >>
>> >>
>> 
>>>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInfor
>>>>>>>>ma
>> >>>>>>ti
>> >> >>>>on
>> >> >> >>.java:1671)
>> >> >> >>         at
>> >> >> >>
>> >>
>> 
>>>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:72
>>>>>>0)
>> >> >> >>         ... 15 more
>> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
>> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to rm1
>> >> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>>
>>


Re: Slider AM fails to run when RM in HA setup fails over

Posted by Manoj Samel <ma...@gmail.com>.
Hi Gour,

Can you please reach me using your own email-id? I will then send logs to
you, along with my analysis - I don't want to send logs on public list

Thanks,

On Mon, Jul 25, 2016 at 5:39 PM, Gour Saha <gs...@hortonworks.com> wrote:

> Ok, so this node is not a gateway. It is part of the cluster, which means
> you don¹t need slider-client.xml at all. Just have HADOOP_CONF_DIR
> pointing to /etc/hadoop/conf in slider-env.sh and that should be it.
>
> So the above simplifies your config setup. It will not solve either of the
> 2 problems you are facing.
>
> Now coming to the 2 issues you are facing, you have to provide additional
> logs for us to understand better. Let¹s start with  -
> 1. RM logs (specifically between the time when rm1->rm2 failover is
> simulated)
> 2. Slider App logs
>
> -Gour
>
> On 7/25/16, 5:16 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>
> >   1. Not clear about your question on "gateway" node. The node running
> >   slider is part of the hadoop cluster and there are other services like
> >   Oozie that run on this node that utilizes hdfs and yarn. So if your
> >   question is whether the node is otherwise working for HDFS and Yarn
> >   configuration, it is working
> >   2. I copied all files from HADOOP_CONF_DIR (say /etc/hadoop/conf) to
> >the
> >   directory containing slider-client.xml (say /data/latest/conf)
> >   3. In earlier email, I had done a mistake where slider-env.sh file
> >HADOOP_CONF_DIR
> >   was pointing to original directory /etc/hadoop/conf. I edited it to
> >   point to same directory containing slider-client.xml & slider-env.sh
> >i.e.
> >   /data/latest/conf
> >   4. I emptied slider-client.xml. It just had the
> ><configuration></configuration>.
> >   The creation of spas worked but the Slider AM still shows the same
> >issue.
> >   i.e. when RM1 goes from active to standby, slider AM goes from RUNNING
> >to
> >   ACCPTED state with same error about TOKEN. Also NOTE that when
> >   slider-client.xml is empty, the "slider destroy xxx" command still
> >fails
> >   with Zookeeper connection errors.
> >   5. I then added same parameters (as my last email - except
> >   HADOOP_CONF_DIR) to slider-client.xml and ran. This time slider-env.sh
> >   has HADOOP_CONF_DIR pointing to /data/latest/conf and slider-client.xml
> >   does not have HADOOP_CONF_DIR. The same issue exists (but "slider
> >   destroy" does not fails)
> >   6. Could you explain what do you expect to pick up from Hadoop
> >   configurations that will help you in RM Token ? If slider has token
> >from
> >   RM1, and it switches to RM2, not clear what slider does to get
> >delegation
> >   token for RM2 communication ?
> >   7. It is worth repeating again that issue happens only when RM1 was
> >   active when slider app was created and then RM1 becomes standby. If
> >RM2 was
> >   active when slider app was created, then slider AM keeps running for
> >any
> >   number of switches between RM2 and RM1 back and forth ...
> >
> >
> >On Mon, Jul 25, 2016 at 4:21 PM, Gour Saha <gs...@hortonworks.com> wrote:
> >
> >> The node you are running slider from, is that a gateway node? Sorry for
> >> not being explicit. I meant copy everything under /etc/hadoop/conf from
> >> your cluster into some temp directory (say /tmp/hadoop_conf) in your
> >> gateway node or local or whichever node you are running slider from.
> >>Then
> >> set HADOOP_CONF_DIR to /tmp/hadoop_conf and clear everything out from
> >> slider-client.xml.
> >>
> >> On 7/25/16, 4:12 PM, "Manoj Samel" <ma...@gmail.com> wrote:
> >>
> >> >Hi Gour,
> >> >
> >> >Thanks for your prompt reply.
> >> >
> >> >FYI, issue happens when I create slider app when rm1 is active and when
> >> >rm1
> >> >fails over to rm2. As soon as rm2 becomes active; the slider AM goes
> >>from
> >> >RUNNING to ACCEPTED state with above error.
> >> >
> >> >For your suggestion, I did following
> >> >
> >> >1) Copied core-site, hdfs-site, yarn-site, and mapred-site from
> >> >HADOOP_CONF_DIR
> >> >to slider conf directory.
> >> >2) Our slider-env.sh already had HADOOP_CONF_DIR set
> >> >3) I removed all properties from slider-client.xml EXCEPT following
> >> >
> >> >   - HADOOP_CONF_DIR
> >> >   - slider.yarn.queue
> >> >   - slider.zookeeper.quorum
> >> >   - hadoop.registry.zk.quorum
> >> >   - hadoop.registry.zk.root
> >> >   - hadoop.security.authorization
> >> >   - hadoop.security.authentication
> >> >
> >> >Then I made rm1 active, installed and created slider app and restarted
> >>rm1
> >> >(to make rm2) active. The slider-am again went from RUNNING to ACCEPTED
> >> >state.
> >> >
> >> >Let me know if you want me to try further changes.
> >> >
> >> >If I make the slider-client.xml completely empty per your suggestion,
> >>only
> >> >slider AM comes up but it
> >> >fails to start components. The AM log shows errors trying to connect to
> >> >zookeeper like below.
> >> >2016-07-25 23:07:41,532
> >> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
> >> >zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error,
> >> >closing socket connection and attempting reconnect
> >> >java.net.ConnectException: Connection refused
> >> >
> >> >Hence I kept minimal info in slider-client.xml
> >> >
> >> >FYI This is slider version 0.80
> >> >
> >> >Thanks,
> >> >
> >> >Manoj
> >> >
> >> >On Mon, Jul 25, 2016 at 2:54 PM, Gour Saha <gs...@hortonworks.com>
> >>wrote:
> >> >
> >> >> If possible, can you copy the entire content of the directory
> >> >> /etc/hadoop/conf and then set HADOOP_CONF_DIR in slider-env.sh to it.
> >> >>Keep
> >> >> slider-client.xml empty.
> >> >>
> >> >> Now when you do the same rm1->rm2 and then the reverse failovers, do
> >>you
> >> >> see the same behaviors?
> >> >>
> >> >> -Gour
> >> >>
> >> >> On 7/25/16, 2:28 PM, "Manoj Samel" <ma...@gmail.com> wrote:
> >> >>
> >> >> >Another observation (whatever it is worth)
> >> >> >
> >> >> >If slider app is created and started when rm2 was active, then it
> >> >>seems to
> >> >> >survive switches between rm2 and rm1 (and back). I.e
> >> >> >
> >> >> >* rm2 is active
> >> >> >* create and start slider application
> >> >> >* fail over to rm1. Now the Slider AM keeps running
> >> >> >* fail over to rm2 again. Slider AM still keeps running
> >> >> >
> >> >> >So, it seems if it starts with rm1 active, then the AM goes to
> >> >>"ACCEPTED"
> >> >> >state when RM fails to rm2. If it starts with rm2 active, then it
> >>runs
> >> >> >fine
> >> >> >with any switches between rm1 and rm2.
> >> >> >
> >> >> >Any feedback ?
> >> >> >
> >> >> >Thanks,
> >> >> >
> >> >> >Manoj
> >> >> >
> >> >> >On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel
> >> >><ma...@gmail.com>
> >> >> >wrote:
> >> >> >
> >> >> >> Setup
> >> >> >>
> >> >> >> - Hadoop 2.6 with RM HA, Kerberos enabled
> >> >> >> - Slider 0.80
> >> >> >> - In my slider-client.xml, I have added all RM HA properties,
> >> >>including
> >> >> >> the ones mentioned in
> >>http://markmail.org/message/wnhpp2zn6ixo65e3.
> >> >> >>
> >> >> >> Following is the issue
> >> >> >>
> >> >> >> * rm1 is active, rm2 is standby
> >> >> >> * deploy and start slider application, it runs fine
> >> >> >> * restart rm1, rm2 is now active.
> >> >> >> * The slider-am now goes from running into "ACCEPTED" mode. It
> >>stays
> >> >> >>there
> >> >> >> till rm1 is made active again.
> >> >> >>
> >> >> >> In the slider-am log, it tries to connect to RM2 and connection
> >>fails
> >> >> >>due
> >> >> >> to org.apache.hadoop.security.AccessControlException: Client
> >>cannot
> >> >> >> authenticate via:[TOKEN]. See detailed log below
> >> >> >>
> >> >> >>  It seems it has some token (delegation token?) for RM1 but tries
> >>to
> >> >>use
> >> >> >> same(?) for RM2 and fails. Am I missing some configuration ???
> >> >> >>
> >> >> >> Thanks,
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] INFO
> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to rm2
> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
> >> >> >>  security.UserGroupInformation - PriviledgedActionException
> >> >>as:abc@XYZ
> >> >> >> (auth:KERBEROS)
> >> >>cause:org.apache.hadoop.security.AccessControlException:
> >> >> >> Client cannot authenticate via:[TOKEN]
> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
> >>ipc.Client -
> >> >> >> Exception encountered while connecting to the server :
> >> >> >> org.apache.hadoop.security.AccessControlException: Client cannot
> >> >> >> authenticate via:[TOKEN]
> >> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
> >> >> >>  security.UserGroupInformation - PriviledgedActionException
> >> >>as:abc@XYZ
> >> >> >> (auth:KERBEROS) cause:java.io.IOException:
> >> >> >> org.apache.hadoop.security.AccessControlException: Client cannot
> >> >> >> authenticate via:[TOKEN]
> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
> >> >> >>  retry.RetryInvocationHandler - Exception while invoking allocate
> >>of
> >> >> >>class
> >> >> >> ApplicationMasterProtocolPBClientImpl over rm2 after 287 fail over
> >> >> >> attempts. Trying to fail over immediately.
> >> >> >> java.io.IOException: Failed on local exception:
> >>java.io.IOException:
> >> >> >> org.apache.hadoop.security.AccessControlException: Client cannot
> >> >> >> authenticate via:[TOKEN]; Host Details : local host is: "<SliderAM
> >> >> >> HOST>/<slider AM Host IP>"; destination host is: "<RM2
> >>HOST>":23130;
> >> >> >>         at
> >> >> >>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> >> >> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1476)
> >> >> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1403)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> >>>>>>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEng
> >>>>>>in
> >> >>>>e.
> >> >> >>java:230)
> >> >> >>         at com.sun.proxy.$Proxy23.allocate(Unknown Source)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> >>>>>>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPB
> >>>>>>Cl
> >> >>>>ie
> >> >> >>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> >> >> >>         at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown
> >> >>Source)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> >>>>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> >>>>>>so
> >> >>>>rI
> >> >> >>mpl.java:43)
> >> >> >>         at java.lang.reflect.Method.invoke(Method.java:497)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> >>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryI
> >>>>>>nv
> >> >>>>oc
> >> >> >>ationHandler.java:252)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> >>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocat
> >>>>>>io
> >> >>>>nH
> >> >> >>andler.java:104)
> >> >> >>         at com.sun.proxy.$Proxy24.allocate(Unknown Source)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> >>>>>>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMCl
> >>>>>>ie
> >> >>>>nt
> >> >> >>Impl.java:278)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> >>>>>>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$Hear
> >>>>>>tb
> >> >>>>ea
> >> >> >>tThread.run(AMRMClientAsyncImpl.java:224)
> >> >> >> Caused by: java.io.IOException:
> >> >> >> org.apache.hadoop.security.AccessControlException: Client cannot
> >> >> >> authenticate via:[TOKEN]
> >> >> >>         at
> >> >> >>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
> >> >> >>         at java.security.AccessController.doPrivileged(Native
> >>Method)
> >> >> >>         at javax.security.auth.Subject.doAs(Subject.java:422)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> >>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
> >>>>>>ti
> >> >>>>on
> >> >> >>.java:1671)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> >>>>>>org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(C
> >>>>>>li
> >> >>>>en
> >> >> >>t.java:645)
> >> >> >>         at
> >> >> >>
> >>
> >>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:733)
> >> >> >>         at
> >> >> >>
> >>org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
> >> >> >>         at
> >> >>org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
> >> >> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1442)
> >> >> >>         ... 12 more
> >> >> >> Caused by: org.apache.hadoop.security.AccessControlException:
> >>Client
> >> >> >> cannot authenticate via:[TOKEN]
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> >>>>>>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClie
> >>>>>>nt
> >> >>>>.j
> >> >> >>ava:172)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> >>>>>>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.ja
> >>>>>>va
> >> >>>>:3
> >> >> >>96)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> >>>>>>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.ja
> >>>>>>va
> >> >>>>:5
> >> >> >>55)
> >> >> >>         at
> >> >> >>
> >>org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
> >> >> >>         at
> >> >> >>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
> >> >> >>         at
> >> >> >>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
> >> >> >>         at java.security.AccessController.doPrivileged(Native
> >>Method)
> >> >> >>         at javax.security.auth.Subject.doAs(Subject.java:422)
> >> >> >>         at
> >> >> >>
> >> >>
> >>
> >>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
> >>>>>>ti
> >> >>>>on
> >> >> >>.java:1671)
> >> >> >>         at
> >> >> >>
> >>
> >>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720)
> >> >> >>         ... 15 more
> >> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
> >> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to rm1
> >> >> >>
> >> >>
> >> >>
> >>
> >>
>
>

Re: Slider AM fails to run when RM in HA setup fails over

Posted by Gour Saha <gs...@hortonworks.com>.
Ok, so this node is not a gateway. It is part of the cluster, which means
you don¹t need slider-client.xml at all. Just have HADOOP_CONF_DIR
pointing to /etc/hadoop/conf in slider-env.sh and that should be it.

So the above simplifies your config setup. It will not solve either of the
2 problems you are facing.

Now coming to the 2 issues you are facing, you have to provide additional
logs for us to understand better. Let¹s start with  -
1. RM logs (specifically between the time when rm1->rm2 failover is
simulated)
2. Slider App logs

-Gour

On 7/25/16, 5:16 PM, "Manoj Samel" <ma...@gmail.com> wrote:

>   1. Not clear about your question on "gateway" node. The node running
>   slider is part of the hadoop cluster and there are other services like
>   Oozie that run on this node that utilizes hdfs and yarn. So if your
>   question is whether the node is otherwise working for HDFS and Yarn
>   configuration, it is working
>   2. I copied all files from HADOOP_CONF_DIR (say /etc/hadoop/conf) to
>the
>   directory containing slider-client.xml (say /data/latest/conf)
>   3. In earlier email, I had done a mistake where slider-env.sh file
>HADOOP_CONF_DIR
>   was pointing to original directory /etc/hadoop/conf. I edited it to
>   point to same directory containing slider-client.xml & slider-env.sh
>i.e.
>   /data/latest/conf
>   4. I emptied slider-client.xml. It just had the
><configuration></configuration>.
>   The creation of spas worked but the Slider AM still shows the same
>issue.
>   i.e. when RM1 goes from active to standby, slider AM goes from RUNNING
>to
>   ACCPTED state with same error about TOKEN. Also NOTE that when
>   slider-client.xml is empty, the "slider destroy xxx" command still
>fails
>   with Zookeeper connection errors.
>   5. I then added same parameters (as my last email - except
>   HADOOP_CONF_DIR) to slider-client.xml and ran. This time slider-env.sh
>   has HADOOP_CONF_DIR pointing to /data/latest/conf and slider-client.xml
>   does not have HADOOP_CONF_DIR. The same issue exists (but "slider
>   destroy" does not fails)
>   6. Could you explain what do you expect to pick up from Hadoop
>   configurations that will help you in RM Token ? If slider has token
>from
>   RM1, and it switches to RM2, not clear what slider does to get
>delegation
>   token for RM2 communication ?
>   7. It is worth repeating again that issue happens only when RM1 was
>   active when slider app was created and then RM1 becomes standby. If
>RM2 was
>   active when slider app was created, then slider AM keeps running for
>any
>   number of switches between RM2 and RM1 back and forth ...
>
>
>On Mon, Jul 25, 2016 at 4:21 PM, Gour Saha <gs...@hortonworks.com> wrote:
>
>> The node you are running slider from, is that a gateway node? Sorry for
>> not being explicit. I meant copy everything under /etc/hadoop/conf from
>> your cluster into some temp directory (say /tmp/hadoop_conf) in your
>> gateway node or local or whichever node you are running slider from.
>>Then
>> set HADOOP_CONF_DIR to /tmp/hadoop_conf and clear everything out from
>> slider-client.xml.
>>
>> On 7/25/16, 4:12 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>>
>> >Hi Gour,
>> >
>> >Thanks for your prompt reply.
>> >
>> >FYI, issue happens when I create slider app when rm1 is active and when
>> >rm1
>> >fails over to rm2. As soon as rm2 becomes active; the slider AM goes
>>from
>> >RUNNING to ACCEPTED state with above error.
>> >
>> >For your suggestion, I did following
>> >
>> >1) Copied core-site, hdfs-site, yarn-site, and mapred-site from
>> >HADOOP_CONF_DIR
>> >to slider conf directory.
>> >2) Our slider-env.sh already had HADOOP_CONF_DIR set
>> >3) I removed all properties from slider-client.xml EXCEPT following
>> >
>> >   - HADOOP_CONF_DIR
>> >   - slider.yarn.queue
>> >   - slider.zookeeper.quorum
>> >   - hadoop.registry.zk.quorum
>> >   - hadoop.registry.zk.root
>> >   - hadoop.security.authorization
>> >   - hadoop.security.authentication
>> >
>> >Then I made rm1 active, installed and created slider app and restarted
>>rm1
>> >(to make rm2) active. The slider-am again went from RUNNING to ACCEPTED
>> >state.
>> >
>> >Let me know if you want me to try further changes.
>> >
>> >If I make the slider-client.xml completely empty per your suggestion,
>>only
>> >slider AM comes up but it
>> >fails to start components. The AM log shows errors trying to connect to
>> >zookeeper like below.
>> >2016-07-25 23:07:41,532
>> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
>> >zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error,
>> >closing socket connection and attempting reconnect
>> >java.net.ConnectException: Connection refused
>> >
>> >Hence I kept minimal info in slider-client.xml
>> >
>> >FYI This is slider version 0.80
>> >
>> >Thanks,
>> >
>> >Manoj
>> >
>> >On Mon, Jul 25, 2016 at 2:54 PM, Gour Saha <gs...@hortonworks.com>
>>wrote:
>> >
>> >> If possible, can you copy the entire content of the directory
>> >> /etc/hadoop/conf and then set HADOOP_CONF_DIR in slider-env.sh to it.
>> >>Keep
>> >> slider-client.xml empty.
>> >>
>> >> Now when you do the same rm1->rm2 and then the reverse failovers, do
>>you
>> >> see the same behaviors?
>> >>
>> >> -Gour
>> >>
>> >> On 7/25/16, 2:28 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>> >>
>> >> >Another observation (whatever it is worth)
>> >> >
>> >> >If slider app is created and started when rm2 was active, then it
>> >>seems to
>> >> >survive switches between rm2 and rm1 (and back). I.e
>> >> >
>> >> >* rm2 is active
>> >> >* create and start slider application
>> >> >* fail over to rm1. Now the Slider AM keeps running
>> >> >* fail over to rm2 again. Slider AM still keeps running
>> >> >
>> >> >So, it seems if it starts with rm1 active, then the AM goes to
>> >>"ACCEPTED"
>> >> >state when RM fails to rm2. If it starts with rm2 active, then it
>>runs
>> >> >fine
>> >> >with any switches between rm1 and rm2.
>> >> >
>> >> >Any feedback ?
>> >> >
>> >> >Thanks,
>> >> >
>> >> >Manoj
>> >> >
>> >> >On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel
>> >><ma...@gmail.com>
>> >> >wrote:
>> >> >
>> >> >> Setup
>> >> >>
>> >> >> - Hadoop 2.6 with RM HA, Kerberos enabled
>> >> >> - Slider 0.80
>> >> >> - In my slider-client.xml, I have added all RM HA properties,
>> >>including
>> >> >> the ones mentioned in
>>http://markmail.org/message/wnhpp2zn6ixo65e3.
>> >> >>
>> >> >> Following is the issue
>> >> >>
>> >> >> * rm1 is active, rm2 is standby
>> >> >> * deploy and start slider application, it runs fine
>> >> >> * restart rm1, rm2 is now active.
>> >> >> * The slider-am now goes from running into "ACCEPTED" mode. It
>>stays
>> >> >>there
>> >> >> till rm1 is made active again.
>> >> >>
>> >> >> In the slider-am log, it tries to connect to RM2 and connection
>>fails
>> >> >>due
>> >> >> to org.apache.hadoop.security.AccessControlException: Client
>>cannot
>> >> >> authenticate via:[TOKEN]. See detailed log below
>> >> >>
>> >> >>  It seems it has some token (delegation token?) for RM1 but tries
>>to
>> >>use
>> >> >> same(?) for RM2 and fails. Am I missing some configuration ???
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >>
>> >> >>
>> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] INFO
>> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to rm2
>> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>> >> >>  security.UserGroupInformation - PriviledgedActionException
>> >>as:abc@XYZ
>> >> >> (auth:KERBEROS)
>> >>cause:org.apache.hadoop.security.AccessControlException:
>> >> >> Client cannot authenticate via:[TOKEN]
>> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>>ipc.Client -
>> >> >> Exception encountered while connecting to the server :
>> >> >> org.apache.hadoop.security.AccessControlException: Client cannot
>> >> >> authenticate via:[TOKEN]
>> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>> >> >>  security.UserGroupInformation - PriviledgedActionException
>> >>as:abc@XYZ
>> >> >> (auth:KERBEROS) cause:java.io.IOException:
>> >> >> org.apache.hadoop.security.AccessControlException: Client cannot
>> >> >> authenticate via:[TOKEN]
>> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
>> >> >>  retry.RetryInvocationHandler - Exception while invoking allocate
>>of
>> >> >>class
>> >> >> ApplicationMasterProtocolPBClientImpl over rm2 after 287 fail over
>> >> >> attempts. Trying to fail over immediately.
>> >> >> java.io.IOException: Failed on local exception:
>>java.io.IOException:
>> >> >> org.apache.hadoop.security.AccessControlException: Client cannot
>> >> >> authenticate via:[TOKEN]; Host Details : local host is: "<SliderAM
>> >> >> HOST>/<slider AM Host IP>"; destination host is: "<RM2
>>HOST>":23130;
>> >> >>         at
>> >> >>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>> >> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1476)
>> >> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1403)
>> >> >>         at
>> >> >>
>> >>
>> 
>>>>>>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEng
>>>>>>in
>> >>>>e.
>> >> >>java:230)
>> >> >>         at com.sun.proxy.$Proxy23.allocate(Unknown Source)
>> >> >>         at
>> >> >>
>> >>
>> 
>>>>>>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPB
>>>>>>Cl
>> >>>>ie
>> >> >>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >> >>         at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown
>> >>Source)
>> >> >>         at
>> >> >>
>> >>
>> 
>>>>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
>>>>>>so
>> >>>>rI
>> >> >>mpl.java:43)
>> >> >>         at java.lang.reflect.Method.invoke(Method.java:497)
>> >> >>         at
>> >> >>
>> >>
>> 
>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryI
>>>>>>nv
>> >>>>oc
>> >> >>ationHandler.java:252)
>> >> >>         at
>> >> >>
>> >>
>> 
>>>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocat
>>>>>>io
>> >>>>nH
>> >> >>andler.java:104)
>> >> >>         at com.sun.proxy.$Proxy24.allocate(Unknown Source)
>> >> >>         at
>> >> >>
>> >>
>> 
>>>>>>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMCl
>>>>>>ie
>> >>>>nt
>> >> >>Impl.java:278)
>> >> >>         at
>> >> >>
>> >>
>> 
>>>>>>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$Hear
>>>>>>tb
>> >>>>ea
>> >> >>tThread.run(AMRMClientAsyncImpl.java:224)
>> >> >> Caused by: java.io.IOException:
>> >> >> org.apache.hadoop.security.AccessControlException: Client cannot
>> >> >> authenticate via:[TOKEN]
>> >> >>         at
>> >> >>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
>> >> >>         at java.security.AccessController.doPrivileged(Native
>>Method)
>> >> >>         at javax.security.auth.Subject.doAs(Subject.java:422)
>> >> >>         at
>> >> >>
>> >>
>> 
>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
>>>>>>ti
>> >>>>on
>> >> >>.java:1671)
>> >> >>         at
>> >> >>
>> >>
>> 
>>>>>>org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(C
>>>>>>li
>> >>>>en
>> >> >>t.java:645)
>> >> >>         at
>> >> >>
>> 
>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:733)
>> >> >>         at
>> >> >> 
>>org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
>> >> >>         at
>> >>org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
>> >> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1442)
>> >> >>         ... 12 more
>> >> >> Caused by: org.apache.hadoop.security.AccessControlException:
>>Client
>> >> >> cannot authenticate via:[TOKEN]
>> >> >>         at
>> >> >>
>> >>
>> 
>>>>>>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClie
>>>>>>nt
>> >>>>.j
>> >> >>ava:172)
>> >> >>         at
>> >> >>
>> >>
>> 
>>>>>>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.ja
>>>>>>va
>> >>>>:3
>> >> >>96)
>> >> >>         at
>> >> >>
>> >>
>> 
>>>>>>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.ja
>>>>>>va
>> >>>>:5
>> >> >>55)
>> >> >>         at
>> >> >> 
>>org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
>> >> >>         at
>> >> >>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
>> >> >>         at
>> >> >>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
>> >> >>         at java.security.AccessController.doPrivileged(Native
>>Method)
>> >> >>         at javax.security.auth.Subject.doAs(Subject.java:422)
>> >> >>         at
>> >> >>
>> >>
>> 
>>>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma
>>>>>>ti
>> >>>>on
>> >> >>.java:1671)
>> >> >>         at
>> >> >>
>> 
>>>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720)
>> >> >>         ... 15 more
>> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
>> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to rm1
>> >> >>
>> >>
>> >>
>>
>>


Re: Slider AM fails to run when RM in HA setup fails over

Posted by Manoj Samel <ma...@gmail.com>.
   1. Not clear about your question on "gateway" node. The node running
   slider is part of the hadoop cluster and there are other services like
   Oozie that run on this node that utilizes hdfs and yarn. So if your
   question is whether the node is otherwise working for HDFS and Yarn
   configuration, it is working
   2. I copied all files from HADOOP_CONF_DIR (say /etc/hadoop/conf) to the
   directory containing slider-client.xml (say /data/latest/conf)
   3. In earlier email, I had done a mistake where slider-env.sh file
HADOOP_CONF_DIR
   was pointing to original directory /etc/hadoop/conf. I edited it to
   point to same directory containing slider-client.xml & slider-env.sh i.e.
   /data/latest/conf
   4. I emptied slider-client.xml. It just had the
<configuration></configuration>.
   The creation of spas worked but the Slider AM still shows the same issue.
   i.e. when RM1 goes from active to standby, slider AM goes from RUNNING to
   ACCPTED state with same error about TOKEN. Also NOTE that when
   slider-client.xml is empty, the "slider destroy xxx" command still fails
   with Zookeeper connection errors.
   5. I then added same parameters (as my last email - except
   HADOOP_CONF_DIR) to slider-client.xml and ran. This time slider-env.sh
   has HADOOP_CONF_DIR pointing to /data/latest/conf and slider-client.xml
   does not have HADOOP_CONF_DIR. The same issue exists (but "slider
   destroy" does not fails)
   6. Could you explain what do you expect to pick up from Hadoop
   configurations that will help you in RM Token ? If slider has token from
   RM1, and it switches to RM2, not clear what slider does to get delegation
   token for RM2 communication ?
   7. It is worth repeating again that issue happens only when RM1 was
   active when slider app was created and then RM1 becomes standby. If RM2 was
   active when slider app was created, then slider AM keeps running for any
   number of switches between RM2 and RM1 back and forth ...


On Mon, Jul 25, 2016 at 4:21 PM, Gour Saha <gs...@hortonworks.com> wrote:

> The node you are running slider from, is that a gateway node? Sorry for
> not being explicit. I meant copy everything under /etc/hadoop/conf from
> your cluster into some temp directory (say /tmp/hadoop_conf) in your
> gateway node or local or whichever node you are running slider from. Then
> set HADOOP_CONF_DIR to /tmp/hadoop_conf and clear everything out from
> slider-client.xml.
>
> On 7/25/16, 4:12 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>
> >Hi Gour,
> >
> >Thanks for your prompt reply.
> >
> >FYI, issue happens when I create slider app when rm1 is active and when
> >rm1
> >fails over to rm2. As soon as rm2 becomes active; the slider AM goes from
> >RUNNING to ACCEPTED state with above error.
> >
> >For your suggestion, I did following
> >
> >1) Copied core-site, hdfs-site, yarn-site, and mapred-site from
> >HADOOP_CONF_DIR
> >to slider conf directory.
> >2) Our slider-env.sh already had HADOOP_CONF_DIR set
> >3) I removed all properties from slider-client.xml EXCEPT following
> >
> >   - HADOOP_CONF_DIR
> >   - slider.yarn.queue
> >   - slider.zookeeper.quorum
> >   - hadoop.registry.zk.quorum
> >   - hadoop.registry.zk.root
> >   - hadoop.security.authorization
> >   - hadoop.security.authentication
> >
> >Then I made rm1 active, installed and created slider app and restarted rm1
> >(to make rm2) active. The slider-am again went from RUNNING to ACCEPTED
> >state.
> >
> >Let me know if you want me to try further changes.
> >
> >If I make the slider-client.xml completely empty per your suggestion, only
> >slider AM comes up but it
> >fails to start components. The AM log shows errors trying to connect to
> >zookeeper like below.
> >2016-07-25 23:07:41,532
> >[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
> >zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error,
> >closing socket connection and attempting reconnect
> >java.net.ConnectException: Connection refused
> >
> >Hence I kept minimal info in slider-client.xml
> >
> >FYI This is slider version 0.80
> >
> >Thanks,
> >
> >Manoj
> >
> >On Mon, Jul 25, 2016 at 2:54 PM, Gour Saha <gs...@hortonworks.com> wrote:
> >
> >> If possible, can you copy the entire content of the directory
> >> /etc/hadoop/conf and then set HADOOP_CONF_DIR in slider-env.sh to it.
> >>Keep
> >> slider-client.xml empty.
> >>
> >> Now when you do the same rm1->rm2 and then the reverse failovers, do you
> >> see the same behaviors?
> >>
> >> -Gour
> >>
> >> On 7/25/16, 2:28 PM, "Manoj Samel" <ma...@gmail.com> wrote:
> >>
> >> >Another observation (whatever it is worth)
> >> >
> >> >If slider app is created and started when rm2 was active, then it
> >>seems to
> >> >survive switches between rm2 and rm1 (and back). I.e
> >> >
> >> >* rm2 is active
> >> >* create and start slider application
> >> >* fail over to rm1. Now the Slider AM keeps running
> >> >* fail over to rm2 again. Slider AM still keeps running
> >> >
> >> >So, it seems if it starts with rm1 active, then the AM goes to
> >>"ACCEPTED"
> >> >state when RM fails to rm2. If it starts with rm2 active, then it runs
> >> >fine
> >> >with any switches between rm1 and rm2.
> >> >
> >> >Any feedback ?
> >> >
> >> >Thanks,
> >> >
> >> >Manoj
> >> >
> >> >On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel
> >><ma...@gmail.com>
> >> >wrote:
> >> >
> >> >> Setup
> >> >>
> >> >> - Hadoop 2.6 with RM HA, Kerberos enabled
> >> >> - Slider 0.80
> >> >> - In my slider-client.xml, I have added all RM HA properties,
> >>including
> >> >> the ones mentioned in http://markmail.org/message/wnhpp2zn6ixo65e3.
> >> >>
> >> >> Following is the issue
> >> >>
> >> >> * rm1 is active, rm2 is standby
> >> >> * deploy and start slider application, it runs fine
> >> >> * restart rm1, rm2 is now active.
> >> >> * The slider-am now goes from running into "ACCEPTED" mode. It stays
> >> >>there
> >> >> till rm1 is made active again.
> >> >>
> >> >> In the slider-am log, it tries to connect to RM2 and connection fails
> >> >>due
> >> >> to org.apache.hadoop.security.AccessControlException: Client cannot
> >> >> authenticate via:[TOKEN]. See detailed log below
> >> >>
> >> >>  It seems it has some token (delegation token?) for RM1 but tries to
> >>use
> >> >> same(?) for RM2 and fails. Am I missing some configuration ???
> >> >>
> >> >> Thanks,
> >> >>
> >> >>
> >> >>
> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] INFO
> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to rm2
> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
> >> >>  security.UserGroupInformation - PriviledgedActionException
> >>as:abc@XYZ
> >> >> (auth:KERBEROS)
> >>cause:org.apache.hadoop.security.AccessControlException:
> >> >> Client cannot authenticate via:[TOKEN]
> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN  ipc.Client -
> >> >> Exception encountered while connecting to the server :
> >> >> org.apache.hadoop.security.AccessControlException: Client cannot
> >> >> authenticate via:[TOKEN]
> >> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
> >> >>  security.UserGroupInformation - PriviledgedActionException
> >>as:abc@XYZ
> >> >> (auth:KERBEROS) cause:java.io.IOException:
> >> >> org.apache.hadoop.security.AccessControlException: Client cannot
> >> >> authenticate via:[TOKEN]
> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
> >> >>  retry.RetryInvocationHandler - Exception while invoking allocate of
> >> >>class
> >> >> ApplicationMasterProtocolPBClientImpl over rm2 after 287 fail over
> >> >> attempts. Trying to fail over immediately.
> >> >> java.io.IOException: Failed on local exception: java.io.IOException:
> >> >> org.apache.hadoop.security.AccessControlException: Client cannot
> >> >> authenticate via:[TOKEN]; Host Details : local host is: "<SliderAM
> >> >> HOST>/<slider AM Host IP>"; destination host is: "<RM2 HOST>":23130;
> >> >>         at
> >> >>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> >> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1476)
> >> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1403)
> >> >>         at
> >> >>
> >>
> >>>>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngin
> >>>>e.
> >> >>java:230)
> >> >>         at com.sun.proxy.$Proxy23.allocate(Unknown Source)
> >> >>         at
> >> >>
> >>
> >>>>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBCl
> >>>>ie
> >> >>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> >> >>         at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown
> >>Source)
> >> >>         at
> >> >>
> >>
> >>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso
> >>>>rI
> >> >>mpl.java:43)
> >> >>         at java.lang.reflect.Method.invoke(Method.java:497)
> >> >>         at
> >> >>
> >>
> >>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInv
> >>>>oc
> >> >>ationHandler.java:252)
> >> >>         at
> >> >>
> >>
> >>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocatio
> >>>>nH
> >> >>andler.java:104)
> >> >>         at com.sun.proxy.$Proxy24.allocate(Unknown Source)
> >> >>         at
> >> >>
> >>
> >>>>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClie
> >>>>nt
> >> >>Impl.java:278)
> >> >>         at
> >> >>
> >>
> >>>>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$Heartb
> >>>>ea
> >> >>tThread.run(AMRMClientAsyncImpl.java:224)
> >> >> Caused by: java.io.IOException:
> >> >> org.apache.hadoop.security.AccessControlException: Client cannot
> >> >> authenticate via:[TOKEN]
> >> >>         at
> >> >>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
> >> >>         at java.security.AccessController.doPrivileged(Native Method)
> >> >>         at javax.security.auth.Subject.doAs(Subject.java:422)
> >> >>         at
> >> >>
> >>
> >>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati
> >>>>on
> >> >>.java:1671)
> >> >>         at
> >> >>
> >>
> >>>>org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Cli
> >>>>en
> >> >>t.java:645)
> >> >>         at
> >> >>
> >>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:733)
> >> >>         at
> >> >> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
> >> >>         at
> >>org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
> >> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1442)
> >> >>         ... 12 more
> >> >> Caused by: org.apache.hadoop.security.AccessControlException: Client
> >> >> cannot authenticate via:[TOKEN]
> >> >>         at
> >> >>
> >>
> >>>>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient
> >>>>.j
> >> >>ava:172)
> >> >>         at
> >> >>
> >>
> >>>>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java
> >>>>:3
> >> >>96)
> >> >>         at
> >> >>
> >>
> >>>>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java
> >>>>:5
> >> >>55)
> >> >>         at
> >> >> org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
> >> >>         at
> >> >>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
> >> >>         at
> >> >>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
> >> >>         at java.security.AccessController.doPrivileged(Native Method)
> >> >>         at javax.security.auth.Subject.doAs(Subject.java:422)
> >> >>         at
> >> >>
> >>
> >>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati
> >>>>on
> >> >>.java:1671)
> >> >>         at
> >> >>
> >>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720)
> >> >>         ... 15 more
> >> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
> >> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to rm1
> >> >>
> >>
> >>
>
>

Re: Slider AM fails to run when RM in HA setup fails over

Posted by Gour Saha <gs...@hortonworks.com>.
The node you are running slider from, is that a gateway node? Sorry for
not being explicit. I meant copy everything under /etc/hadoop/conf from
your cluster into some temp directory (say /tmp/hadoop_conf) in your
gateway node or local or whichever node you are running slider from. Then
set HADOOP_CONF_DIR to /tmp/hadoop_conf and clear everything out from
slider-client.xml.

On 7/25/16, 4:12 PM, "Manoj Samel" <ma...@gmail.com> wrote:

>Hi Gour,
>
>Thanks for your prompt reply.
>
>FYI, issue happens when I create slider app when rm1 is active and when
>rm1
>fails over to rm2. As soon as rm2 becomes active; the slider AM goes from
>RUNNING to ACCEPTED state with above error.
>
>For your suggestion, I did following
>
>1) Copied core-site, hdfs-site, yarn-site, and mapred-site from
>HADOOP_CONF_DIR
>to slider conf directory.
>2) Our slider-env.sh already had HADOOP_CONF_DIR set
>3) I removed all properties from slider-client.xml EXCEPT following
>
>   - HADOOP_CONF_DIR
>   - slider.yarn.queue
>   - slider.zookeeper.quorum
>   - hadoop.registry.zk.quorum
>   - hadoop.registry.zk.root
>   - hadoop.security.authorization
>   - hadoop.security.authentication
>
>Then I made rm1 active, installed and created slider app and restarted rm1
>(to make rm2) active. The slider-am again went from RUNNING to ACCEPTED
>state.
>
>Let me know if you want me to try further changes.
>
>If I make the slider-client.xml completely empty per your suggestion, only
>slider AM comes up but it
>fails to start components. The AM log shows errors trying to connect to
>zookeeper like below.
>2016-07-25 23:07:41,532
>[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
>zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error,
>closing socket connection and attempting reconnect
>java.net.ConnectException: Connection refused
>
>Hence I kept minimal info in slider-client.xml
>
>FYI This is slider version 0.80
>
>Thanks,
>
>Manoj
>
>On Mon, Jul 25, 2016 at 2:54 PM, Gour Saha <gs...@hortonworks.com> wrote:
>
>> If possible, can you copy the entire content of the directory
>> /etc/hadoop/conf and then set HADOOP_CONF_DIR in slider-env.sh to it.
>>Keep
>> slider-client.xml empty.
>>
>> Now when you do the same rm1->rm2 and then the reverse failovers, do you
>> see the same behaviors?
>>
>> -Gour
>>
>> On 7/25/16, 2:28 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>>
>> >Another observation (whatever it is worth)
>> >
>> >If slider app is created and started when rm2 was active, then it
>>seems to
>> >survive switches between rm2 and rm1 (and back). I.e
>> >
>> >* rm2 is active
>> >* create and start slider application
>> >* fail over to rm1. Now the Slider AM keeps running
>> >* fail over to rm2 again. Slider AM still keeps running
>> >
>> >So, it seems if it starts with rm1 active, then the AM goes to
>>"ACCEPTED"
>> >state when RM fails to rm2. If it starts with rm2 active, then it runs
>> >fine
>> >with any switches between rm1 and rm2.
>> >
>> >Any feedback ?
>> >
>> >Thanks,
>> >
>> >Manoj
>> >
>> >On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel
>><ma...@gmail.com>
>> >wrote:
>> >
>> >> Setup
>> >>
>> >> - Hadoop 2.6 with RM HA, Kerberos enabled
>> >> - Slider 0.80
>> >> - In my slider-client.xml, I have added all RM HA properties,
>>including
>> >> the ones mentioned in http://markmail.org/message/wnhpp2zn6ixo65e3.
>> >>
>> >> Following is the issue
>> >>
>> >> * rm1 is active, rm2 is standby
>> >> * deploy and start slider application, it runs fine
>> >> * restart rm1, rm2 is now active.
>> >> * The slider-am now goes from running into "ACCEPTED" mode. It stays
>> >>there
>> >> till rm1 is made active again.
>> >>
>> >> In the slider-am log, it tries to connect to RM2 and connection fails
>> >>due
>> >> to org.apache.hadoop.security.AccessControlException: Client cannot
>> >> authenticate via:[TOKEN]. See detailed log below
>> >>
>> >>  It seems it has some token (delegation token?) for RM1 but tries to
>>use
>> >> same(?) for RM2 and fails. Am I missing some configuration ???
>> >>
>> >> Thanks,
>> >>
>> >>
>> >>
>> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] INFO
>> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to rm2
>> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>> >>  security.UserGroupInformation - PriviledgedActionException
>>as:abc@XYZ
>> >> (auth:KERBEROS)
>>cause:org.apache.hadoop.security.AccessControlException:
>> >> Client cannot authenticate via:[TOKEN]
>> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN  ipc.Client -
>> >> Exception encountered while connecting to the server :
>> >> org.apache.hadoop.security.AccessControlException: Client cannot
>> >> authenticate via:[TOKEN]
>> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>> >>  security.UserGroupInformation - PriviledgedActionException
>>as:abc@XYZ
>> >> (auth:KERBEROS) cause:java.io.IOException:
>> >> org.apache.hadoop.security.AccessControlException: Client cannot
>> >> authenticate via:[TOKEN]
>> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
>> >>  retry.RetryInvocationHandler - Exception while invoking allocate of
>> >>class
>> >> ApplicationMasterProtocolPBClientImpl over rm2 after 287 fail over
>> >> attempts. Trying to fail over immediately.
>> >> java.io.IOException: Failed on local exception: java.io.IOException:
>> >> org.apache.hadoop.security.AccessControlException: Client cannot
>> >> authenticate via:[TOKEN]; Host Details : local host is: "<SliderAM
>> >> HOST>/<slider AM Host IP>"; destination host is: "<RM2 HOST>":23130;
>> >>         at
>> >>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1476)
>> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1403)
>> >>         at
>> >>
>> 
>>>>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngin
>>>>e.
>> >>java:230)
>> >>         at com.sun.proxy.$Proxy23.allocate(Unknown Source)
>> >>         at
>> >>
>> 
>>>>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBCl
>>>>ie
>> >>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >>         at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown
>>Source)
>> >>         at
>> >>
>> 
>>>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso
>>>>rI
>> >>mpl.java:43)
>> >>         at java.lang.reflect.Method.invoke(Method.java:497)
>> >>         at
>> >>
>> 
>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInv
>>>>oc
>> >>ationHandler.java:252)
>> >>         at
>> >>
>> 
>>>>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocatio
>>>>nH
>> >>andler.java:104)
>> >>         at com.sun.proxy.$Proxy24.allocate(Unknown Source)
>> >>         at
>> >>
>> 
>>>>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClie
>>>>nt
>> >>Impl.java:278)
>> >>         at
>> >>
>> 
>>>>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$Heartb
>>>>ea
>> >>tThread.run(AMRMClientAsyncImpl.java:224)
>> >> Caused by: java.io.IOException:
>> >> org.apache.hadoop.security.AccessControlException: Client cannot
>> >> authenticate via:[TOKEN]
>> >>         at
>> >>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
>> >>         at java.security.AccessController.doPrivileged(Native Method)
>> >>         at javax.security.auth.Subject.doAs(Subject.java:422)
>> >>         at
>> >>
>> 
>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati
>>>>on
>> >>.java:1671)
>> >>         at
>> >>
>> 
>>>>org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Cli
>>>>en
>> >>t.java:645)
>> >>         at
>> >> 
>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:733)
>> >>         at
>> >> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
>> >>         at 
>>org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
>> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1442)
>> >>         ... 12 more
>> >> Caused by: org.apache.hadoop.security.AccessControlException: Client
>> >> cannot authenticate via:[TOKEN]
>> >>         at
>> >>
>> 
>>>>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient
>>>>.j
>> >>ava:172)
>> >>         at
>> >>
>> 
>>>>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java
>>>>:3
>> >>96)
>> >>         at
>> >>
>> 
>>>>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java
>>>>:5
>> >>55)
>> >>         at
>> >> org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
>> >>         at
>> >>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
>> >>         at
>> >>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
>> >>         at java.security.AccessController.doPrivileged(Native Method)
>> >>         at javax.security.auth.Subject.doAs(Subject.java:422)
>> >>         at
>> >>
>> 
>>>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati
>>>>on
>> >>.java:1671)
>> >>         at
>> >> 
>>org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720)
>> >>         ... 15 more
>> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
>> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to rm1
>> >>
>>
>>


Re: Slider AM fails to run when RM in HA setup fails over

Posted by Manoj Samel <ma...@gmail.com>.
Hi Gour,

Thanks for your prompt reply.

FYI, issue happens when I create slider app when rm1 is active and when rm1
fails over to rm2. As soon as rm2 becomes active; the slider AM goes from
RUNNING to ACCEPTED state with above error.

For your suggestion, I did following

1) Copied core-site, hdfs-site, yarn-site, and mapred-site from
HADOOP_CONF_DIR
to slider conf directory.
2) Our slider-env.sh already had HADOOP_CONF_DIR set
3) I removed all properties from slider-client.xml EXCEPT following

   - HADOOP_CONF_DIR
   - slider.yarn.queue
   - slider.zookeeper.quorum
   - hadoop.registry.zk.quorum
   - hadoop.registry.zk.root
   - hadoop.security.authorization
   - hadoop.security.authentication

Then I made rm1 active, installed and created slider app and restarted rm1
(to make rm2) active. The slider-am again went from RUNNING to ACCEPTED
state.

Let me know if you want me to try further changes.

If I make the slider-client.xml completely empty per your suggestion, only
slider AM comes up but it
fails to start components. The AM log shows errors trying to connect to
zookeeper like below.
2016-07-25 23:07:41,532
[AmExecutor-006-SendThread(localhost.localdomain:2181)] WARN
zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error,
closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused

Hence I kept minimal info in slider-client.xml

FYI This is slider version 0.80

Thanks,

Manoj

On Mon, Jul 25, 2016 at 2:54 PM, Gour Saha <gs...@hortonworks.com> wrote:

> If possible, can you copy the entire content of the directory
> /etc/hadoop/conf and then set HADOOP_CONF_DIR in slider-env.sh to it. Keep
> slider-client.xml empty.
>
> Now when you do the same rm1->rm2 and then the reverse failovers, do you
> see the same behaviors?
>
> -Gour
>
> On 7/25/16, 2:28 PM, "Manoj Samel" <ma...@gmail.com> wrote:
>
> >Another observation (whatever it is worth)
> >
> >If slider app is created and started when rm2 was active, then it seems to
> >survive switches between rm2 and rm1 (and back). I.e
> >
> >* rm2 is active
> >* create and start slider application
> >* fail over to rm1. Now the Slider AM keeps running
> >* fail over to rm2 again. Slider AM still keeps running
> >
> >So, it seems if it starts with rm1 active, then the AM goes to "ACCEPTED"
> >state when RM fails to rm2. If it starts with rm2 active, then it runs
> >fine
> >with any switches between rm1 and rm2.
> >
> >Any feedback ?
> >
> >Thanks,
> >
> >Manoj
> >
> >On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel <ma...@gmail.com>
> >wrote:
> >
> >> Setup
> >>
> >> - Hadoop 2.6 with RM HA, Kerberos enabled
> >> - Slider 0.80
> >> - In my slider-client.xml, I have added all RM HA properties, including
> >> the ones mentioned in http://markmail.org/message/wnhpp2zn6ixo65e3.
> >>
> >> Following is the issue
> >>
> >> * rm1 is active, rm2 is standby
> >> * deploy and start slider application, it runs fine
> >> * restart rm1, rm2 is now active.
> >> * The slider-am now goes from running into "ACCEPTED" mode. It stays
> >>there
> >> till rm1 is made active again.
> >>
> >> In the slider-am log, it tries to connect to RM2 and connection fails
> >>due
> >> to org.apache.hadoop.security.AccessControlException: Client cannot
> >> authenticate via:[TOKEN]. See detailed log below
> >>
> >>  It seems it has some token (delegation token?) for RM1 but tries to use
> >> same(?) for RM2 and fails. Am I missing some configuration ???
> >>
> >> Thanks,
> >>
> >>
> >>
> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] INFO
> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to rm2
> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
> >>  security.UserGroupInformation - PriviledgedActionException as:abc@XYZ
> >> (auth:KERBEROS) cause:org.apache.hadoop.security.AccessControlException:
> >> Client cannot authenticate via:[TOKEN]
> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN  ipc.Client -
> >> Exception encountered while connecting to the server :
> >> org.apache.hadoop.security.AccessControlException: Client cannot
> >> authenticate via:[TOKEN]
> >> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
> >>  security.UserGroupInformation - PriviledgedActionException as:abc@XYZ
> >> (auth:KERBEROS) cause:java.io.IOException:
> >> org.apache.hadoop.security.AccessControlException: Client cannot
> >> authenticate via:[TOKEN]
> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
> >>  retry.RetryInvocationHandler - Exception while invoking allocate of
> >>class
> >> ApplicationMasterProtocolPBClientImpl over rm2 after 287 fail over
> >> attempts. Trying to fail over immediately.
> >> java.io.IOException: Failed on local exception: java.io.IOException:
> >> org.apache.hadoop.security.AccessControlException: Client cannot
> >> authenticate via:[TOKEN]; Host Details : local host is: "<SliderAM
> >> HOST>/<slider AM Host IP>"; destination host is: "<RM2 HOST>":23130;
> >>         at
> >>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1476)
> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1403)
> >>         at
> >>
> >>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.
> >>java:230)
> >>         at com.sun.proxy.$Proxy23.allocate(Unknown Source)
> >>         at
> >>
> >>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClie
> >>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> >>         at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
> >>         at
> >>
> >>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorI
> >>mpl.java:43)
> >>         at java.lang.reflect.Method.invoke(Method.java:497)
> >>         at
> >>
> >>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvoc
> >>ationHandler.java:252)
> >>         at
> >>
> >>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationH
> >>andler.java:104)
> >>         at com.sun.proxy.$Proxy24.allocate(Unknown Source)
> >>         at
> >>
> >>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClient
> >>Impl.java:278)
> >>         at
> >>
> >>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$Heartbea
> >>tThread.run(AMRMClientAsyncImpl.java:224)
> >> Caused by: java.io.IOException:
> >> org.apache.hadoop.security.AccessControlException: Client cannot
> >> authenticate via:[TOKEN]
> >>         at
> >>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
> >>         at java.security.AccessController.doPrivileged(Native Method)
> >>         at javax.security.auth.Subject.doAs(Subject.java:422)
> >>         at
> >>
> >>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation
> >>.java:1671)
> >>         at
> >>
> >>org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Clien
> >>t.java:645)
> >>         at
> >> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:733)
> >>         at
> >> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
> >>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
> >>         at org.apache.hadoop.ipc.Client.call(Client.java:1442)
> >>         ... 12 more
> >> Caused by: org.apache.hadoop.security.AccessControlException: Client
> >> cannot authenticate via:[TOKEN]
> >>         at
> >>
> >>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.j
> >>ava:172)
> >>         at
> >>
> >>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:3
> >>96)
> >>         at
> >>
> >>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:5
> >>55)
> >>         at
> >> org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
> >>         at
> >>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
> >>         at
> >>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
> >>         at java.security.AccessController.doPrivileged(Native Method)
> >>         at javax.security.auth.Subject.doAs(Subject.java:422)
> >>         at
> >>
> >>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation
> >>.java:1671)
> >>         at
> >> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720)
> >>         ... 15 more
> >> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
> >>  client.ConfiguredRMFailoverProxyProvider - Failing over to rm1
> >>
>
>

Re: Slider AM fails to run when RM in HA setup fails over

Posted by Gour Saha <gs...@hortonworks.com>.
If possible, can you copy the entire content of the directory
/etc/hadoop/conf and then set HADOOP_CONF_DIR in slider-env.sh to it. Keep
slider-client.xml empty.

Now when you do the same rm1->rm2 and then the reverse failovers, do you
see the same behaviors?

-Gour

On 7/25/16, 2:28 PM, "Manoj Samel" <ma...@gmail.com> wrote:

>Another observation (whatever it is worth)
>
>If slider app is created and started when rm2 was active, then it seems to
>survive switches between rm2 and rm1 (and back). I.e
>
>* rm2 is active
>* create and start slider application
>* fail over to rm1. Now the Slider AM keeps running
>* fail over to rm2 again. Slider AM still keeps running
>
>So, it seems if it starts with rm1 active, then the AM goes to "ACCEPTED"
>state when RM fails to rm2. If it starts with rm2 active, then it runs
>fine
>with any switches between rm1 and rm2.
>
>Any feedback ?
>
>Thanks,
>
>Manoj
>
>On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel <ma...@gmail.com>
>wrote:
>
>> Setup
>>
>> - Hadoop 2.6 with RM HA, Kerberos enabled
>> - Slider 0.80
>> - In my slider-client.xml, I have added all RM HA properties, including
>> the ones mentioned in http://markmail.org/message/wnhpp2zn6ixo65e3.
>>
>> Following is the issue
>>
>> * rm1 is active, rm2 is standby
>> * deploy and start slider application, it runs fine
>> * restart rm1, rm2 is now active.
>> * The slider-am now goes from running into "ACCEPTED" mode. It stays
>>there
>> till rm1 is made active again.
>>
>> In the slider-am log, it tries to connect to RM2 and connection fails
>>due
>> to org.apache.hadoop.security.AccessControlException: Client cannot
>> authenticate via:[TOKEN]. See detailed log below
>>
>>  It seems it has some token (delegation token?) for RM1 but tries to use
>> same(?) for RM2 and fails. Am I missing some configuration ???
>>
>> Thanks,
>>
>>
>>
>> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] INFO
>>  client.ConfiguredRMFailoverProxyProvider - Failing over to rm2
>> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>>  security.UserGroupInformation - PriviledgedActionException as:abc@XYZ
>> (auth:KERBEROS) cause:org.apache.hadoop.security.AccessControlException:
>> Client cannot authenticate via:[TOKEN]
>> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN  ipc.Client -
>> Exception encountered while connecting to the server :
>> org.apache.hadoop.security.AccessControlException: Client cannot
>> authenticate via:[TOKEN]
>> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>>  security.UserGroupInformation - PriviledgedActionException as:abc@XYZ
>> (auth:KERBEROS) cause:java.io.IOException:
>> org.apache.hadoop.security.AccessControlException: Client cannot
>> authenticate via:[TOKEN]
>> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
>>  retry.RetryInvocationHandler - Exception while invoking allocate of
>>class
>> ApplicationMasterProtocolPBClientImpl over rm2 after 287 fail over
>> attempts. Trying to fail over immediately.
>> java.io.IOException: Failed on local exception: java.io.IOException:
>> org.apache.hadoop.security.AccessControlException: Client cannot
>> authenticate via:[TOKEN]; Host Details : local host is: "<SliderAM
>> HOST>/<slider AM Host IP>"; destination host is: "<RM2 HOST>":23130;
>>         at 
>>org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1476)
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1403)
>>         at
>> 
>>org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.
>>java:230)
>>         at com.sun.proxy.$Proxy23.allocate(Unknown Source)
>>         at
>> 
>>org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClie
>>ntImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>>         at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
>>         at
>> 
>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorI
>>mpl.java:43)
>>         at java.lang.reflect.Method.invoke(Method.java:497)
>>         at
>> 
>>org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvoc
>>ationHandler.java:252)
>>         at
>> 
>>org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationH
>>andler.java:104)
>>         at com.sun.proxy.$Proxy24.allocate(Unknown Source)
>>         at
>> 
>>org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClient
>>Impl.java:278)
>>         at
>> 
>>org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$Heartbea
>>tThread.run(AMRMClientAsyncImpl.java:224)
>> Caused by: java.io.IOException:
>> org.apache.hadoop.security.AccessControlException: Client cannot
>> authenticate via:[TOKEN]
>>         at 
>>org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:422)
>>         at
>> 
>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation
>>.java:1671)
>>         at
>> 
>>org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Clien
>>t.java:645)
>>         at
>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:733)
>>         at
>> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
>>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
>>         at org.apache.hadoop.ipc.Client.call(Client.java:1442)
>>         ... 12 more
>> Caused by: org.apache.hadoop.security.AccessControlException: Client
>> cannot authenticate via:[TOKEN]
>>         at
>> 
>>org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.j
>>ava:172)
>>         at
>> 
>>org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:3
>>96)
>>         at
>> 
>>org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:5
>>55)
>>         at
>> org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
>>         at 
>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
>>         at 
>>org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:422)
>>         at
>> 
>>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation
>>.java:1671)
>>         at
>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720)
>>         ... 15 more
>> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
>>  client.ConfiguredRMFailoverProxyProvider - Failing over to rm1
>>


Re: Slider AM fails to run when RM in HA setup fails over

Posted by Manoj Samel <ma...@gmail.com>.
Another observation (whatever it is worth)

If slider app is created and started when rm2 was active, then it seems to
survive switches between rm2 and rm1 (and back). I.e

* rm2 is active
* create and start slider application
* fail over to rm1. Now the Slider AM keeps running
* fail over to rm2 again. Slider AM still keeps running

So, it seems if it starts with rm1 active, then the AM goes to "ACCEPTED"
state when RM fails to rm2. If it starts with rm2 active, then it runs fine
with any switches between rm1 and rm2.

Any feedback ?

Thanks,

Manoj

On Mon, Jul 25, 2016 at 12:25 PM, Manoj Samel <ma...@gmail.com>
wrote:

> Setup
>
> - Hadoop 2.6 with RM HA, Kerberos enabled
> - Slider 0.80
> - In my slider-client.xml, I have added all RM HA properties, including
> the ones mentioned in http://markmail.org/message/wnhpp2zn6ixo65e3.
>
> Following is the issue
>
> * rm1 is active, rm2 is standby
> * deploy and start slider application, it runs fine
> * restart rm1, rm2 is now active.
> * The slider-am now goes from running into "ACCEPTED" mode. It stays there
> till rm1 is made active again.
>
> In the slider-am log, it tries to connect to RM2 and connection fails due
> to org.apache.hadoop.security.AccessControlException: Client cannot
> authenticate via:[TOKEN]. See detailed log below
>
>  It seems it has some token (delegation token?) for RM1 but tries to use
> same(?) for RM2 and fails. Am I missing some configuration ???
>
> Thanks,
>
>
>
> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] INFO
>  client.ConfiguredRMFailoverProxyProvider - Failing over to rm2
> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>  security.UserGroupInformation - PriviledgedActionException as:abc@XYZ
> (auth:KERBEROS) cause:org.apache.hadoop.security.AccessControlException:
> Client cannot authenticate via:[TOKEN]
> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN  ipc.Client -
> Exception encountered while connecting to the server :
> org.apache.hadoop.security.AccessControlException: Client cannot
> authenticate via:[TOKEN]
> 2016-07-25 19:06:48,088 [AMRM Heartbeater thread] WARN
>  security.UserGroupInformation - PriviledgedActionException as:abc@XYZ
> (auth:KERBEROS) cause:java.io.IOException:
> org.apache.hadoop.security.AccessControlException: Client cannot
> authenticate via:[TOKEN]
> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
>  retry.RetryInvocationHandler - Exception while invoking allocate of class
> ApplicationMasterProtocolPBClientImpl over rm2 after 287 fail over
> attempts. Trying to fail over immediately.
> java.io.IOException: Failed on local exception: java.io.IOException:
> org.apache.hadoop.security.AccessControlException: Client cannot
> authenticate via:[TOKEN]; Host Details : local host is: "<SliderAM
> HOST>/<slider AM Host IP>"; destination host is: "<RM2 HOST>":23130;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1476)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1403)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>         at com.sun.proxy.$Proxy23.allocate(Unknown Source)
>         at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>         at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:497)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>         at com.sun.proxy.$Proxy24.allocate(Unknown Source)
>         at
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278)
>         at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
> Caused by: java.io.IOException:
> org.apache.hadoop.security.AccessControlException: Client cannot
> authenticate via:[TOKEN]
>         at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>         at
> org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:645)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:733)
>         at
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1525)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1442)
>         ... 12 more
> Caused by: org.apache.hadoop.security.AccessControlException: Client
> cannot authenticate via:[TOKEN]
>         at
> org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:172)
>         at
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:396)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:555)
>         at
> org.apache.hadoop.ipc.Client$Connection.access$1800(Client.java:370)
>         at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:725)
>         at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:721)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>         at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720)
>         ... 15 more
> 2016-07-25 19:06:48,089 [AMRM Heartbeater thread] INFO
>  client.ConfiguredRMFailoverProxyProvider - Failing over to rm1
>