You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ambari.apache.org by "Dmitry Lysnichenko (JIRA)" <ji...@apache.org> on 2019/07/25 15:03:00 UTC

[jira] [Updated] (AMBARI-25343) Storm Service Check fails during Rolling Upgrade due to Nimbus leader selection

     [ https://issues.apache.org/jira/browse/AMBARI-25343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitry Lysnichenko updated AMBARI-25343:
----------------------------------------
    Description: 
Storm Service Check fails during Rolling Upgrade

{code}
1099 [main] WARN  o.a.s.h.DefaultShader - Relocating backtype/storm/task/ShellBolt to org/apache/storm/task/ShellBolt in storm/starter/WordCountTopology$SplitSentence.class. please modify your code to use the new namespace
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-228/storm/lib/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-228/storm/contrib/storm-autocreds/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-228/storm/lib/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-228/storm/contrib/storm-autocreds/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Running: /usr/lib/jvm/java-openjdk/bin/java -Ddaemon.name= -Dstorm.options= -Dstorm.home=/usr/hdp/3.1.4.0-228/storm -Dstorm.log.dir=/grid/0/log/storm -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dstorm.conf.file= -cp /usr/hdp/3.1.4.0-228/storm/*:/usr/hdp/3.1.4.0-228/storm/lib/*:/usr/hdp/3.1.4.0-228/storm/extlib/*:/usr/hdp/current/storm-supervisor/external/storm-autocreds/*:/tmp/ed6b6fbca91811e9a9970242ac1b7581.jar:/usr/hdp/current/storm-supervisor/conf:/usr/hdp/3.1.4.0-228/storm/bin -Dstorm.jar=/tmp/ed6b6fbca91811e9a9970242ac1b7581.jar -Dstorm.dependency.jars= -Dstorm.dependency.artifacts={} storm.starter.WordCountTopology WordCountid1bac8175_date001819
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-228/storm/lib/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-228/storm/contrib/storm-autocreds/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
745  [main] WARN  o.a.s.u.Utils - STORM-VERSION new 1.2.1.3.1.4.0-228 old null
768  [main] INFO  o.a.s.StormSubmitter - Generated ZooKeeper secret payload for MD5-digest: -5297560719882079603:-7456204693431648577
934  [main] INFO  o.a.s.m.n.Login - successfully logged in.
1006 [main] INFO  o.a.s.u.NimbusClient - Found leader nimbus : ctr-e139-1542663976389-174824-01-000005.hwx.site:6627
1012 [main] INFO  o.a.s.m.n.Login - successfully logged in.
1102 [main] INFO  o.a.s.s.a.AuthUtils - Got AutoCreds []
1107 [main] INFO  o.a.s.m.n.Login - successfully logged in.
1128 [main] INFO  o.a.s.u.NimbusClient - Found leader nimbus : ctr-e139-1542663976389-174824-01-000005.hwx.site:6627
1134 [main] INFO  o.a.s.m.n.Login - successfully logged in.
1188 [main] INFO  o.a.s.StormSubmitter - Uploading dependencies - jars...
1189 [main] INFO  o.a.s.StormSubmitter - Uploading dependencies - artifacts...
1189 [main] INFO  o.a.s.StormSubmitter - Dependency Blob keys - jars : [] / artifacts : []
1275 [main] INFO  o.a.s.StormSubmitter - Uploading topology jar /tmp/ed6b6fbca91811e9a9970242ac1b7581.jar to assigned location: /hadoop/storm/nimbus/inbox/stormjar-0728f59f-a74c-4edf-99a7-c77c3be6c7a3.jar
1314 [main] INFO  o.a.s.StormSubmitter - Successfully uploaded topology jar to assigned location: /hadoop/storm/nimbus/inbox/stormjar-0728f59f-a74c-4edf-99a7-c77c3be6c7a3.jar
1314 [main] INFO  o.a.s.StormSubmitter - Submitting topology WordCountid1bac8175_date001819 in distributed mode with conf {"storm.zookeeper.topology.auth.scheme":"digest","storm.zookeeper.topology.auth.payload":"-5297560719882079603:-7456204693431648577","topology.workers":3,"topology.debug":true}
1314 [main] WARN  o.a.s.u.Utils - STORM-VERSION new 1.2.1.3.1.4.0-228 old 1.2.1.3.1.4.0-228
Exception in thread "main" java.lang.RuntimeException: org.apache.storm.thrift.TApplicationException: Internal error processing submitTopology
	at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:273)
	at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:387)
	at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:159)
	at storm.starter.WordCountTopology.main(WordCountTopology.java:77)
Caused by: org.apache.storm.thrift.TApplicationException: Internal error processing submitTopology
	at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
	at org.apache.storm.generated.Nimbus$Client.recv_submitTopology(Nimbus.java:279)
	at org.apache.storm.generated.Nimbus$Client.submitTopology(Nimbus.java:263)
	at org.apache.storm.StormSubmitter.submitTopologyInDistributeMode(StormSubmitter.java:326)
	at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:260)
	... 3 more

Command failed after 1 tries
{code}

STR: Deploy cluster with Ambari2.7.1.0 and HDP3.0.1.0
Upgrade ambari to 2.7.4.0-96 (5273932c0b8269aa46027356cfb61ef8a5347625)
Start RU to HDP-3.1.4.0-228

at nimbus logs (both nodes): there was some time (around 1 min) where both nodes were not a leader, and topology submission happened at that time.

Nimbus in node5 logged the first log message (Ranger plugin) at 2019-07-18 04:59:24.694, and nimbus in node6 logged same at 2019-07-18 04:59:10.717.

Both were started as “Not a leader” as it has to execute its own startup process. Since they’re not a leader, they added themselves to leader lock on zookeeper and waited for gaining leadership. (Node5: 2019-07-18 04:59:28.769, Node6: 2019-07-18 04:59:13.676)

And topology submission happened at the time where both hadn’t gain leadership. (Node5: 2019-07-18 05:00:25.020, Node6: 2019-07-18 05:00:02.609) Both would fail to process topology submission as they’re not a leader.

Node6 finally gain leadership at 2019-07-18 05:00:27.026. After this time Storm can properly receive topology submission.

Btw, the client side thrift error message seems to be misleading, as it doesn’t show actual cause in Nimbus log: 

{code}java.lang.RuntimeException: not a leader, current leader is NimbusInfo{host='host5', port=6627, isLeader=true}{code}

(though isLeader=true is incorrect too… anyway “not a leader” is the thing)

 We can’t ensure one nimbus must be gained leadership at any time. In H/A environment, client side also has to tolerate failure on “no leader”. We will add logic to retry

  was:
Storm Service Check fails during Rolling Upgrade

{code}
1099 [main] WARN  o.a.s.h.DefaultShader - Relocating backtype/storm/task/ShellBolt to org/apache/storm/task/ShellBolt in storm/starter/WordCountTopology$SplitSentence.class. please modify your code to use the new namespace
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-228/storm/lib/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-228/storm/contrib/storm-autocreds/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-228/storm/lib/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-228/storm/contrib/storm-autocreds/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Running: /usr/lib/jvm/java-openjdk/bin/java -Ddaemon.name= -Dstorm.options= -Dstorm.home=/usr/hdp/3.1.4.0-228/storm -Dstorm.log.dir=/grid/0/log/storm -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dstorm.conf.file= -cp /usr/hdp/3.1.4.0-228/storm/*:/usr/hdp/3.1.4.0-228/storm/lib/*:/usr/hdp/3.1.4.0-228/storm/extlib/*:/usr/hdp/current/storm-supervisor/external/storm-autocreds/*:/tmp/ed6b6fbca91811e9a9970242ac1b7581.jar:/usr/hdp/current/storm-supervisor/conf:/usr/hdp/3.1.4.0-228/storm/bin -Dstorm.jar=/tmp/ed6b6fbca91811e9a9970242ac1b7581.jar -Dstorm.dependency.jars= -Dstorm.dependency.artifacts={} storm.starter.WordCountTopology WordCountid1bac8175_date001819
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-228/storm/lib/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-228/storm/contrib/storm-autocreds/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
745  [main] WARN  o.a.s.u.Utils - STORM-VERSION new 1.2.1.3.1.4.0-228 old null
768  [main] INFO  o.a.s.StormSubmitter - Generated ZooKeeper secret payload for MD5-digest: -5297560719882079603:-7456204693431648577
934  [main] INFO  o.a.s.m.n.Login - successfully logged in.
1006 [main] INFO  o.a.s.u.NimbusClient - Found leader nimbus : ctr-e139-1542663976389-174824-01-000005.hwx.site:6627
1012 [main] INFO  o.a.s.m.n.Login - successfully logged in.
1102 [main] INFO  o.a.s.s.a.AuthUtils - Got AutoCreds []
1107 [main] INFO  o.a.s.m.n.Login - successfully logged in.
1128 [main] INFO  o.a.s.u.NimbusClient - Found leader nimbus : ctr-e139-1542663976389-174824-01-000005.hwx.site:6627
1134 [main] INFO  o.a.s.m.n.Login - successfully logged in.
1188 [main] INFO  o.a.s.StormSubmitter - Uploading dependencies - jars...
1189 [main] INFO  o.a.s.StormSubmitter - Uploading dependencies - artifacts...
1189 [main] INFO  o.a.s.StormSubmitter - Dependency Blob keys - jars : [] / artifacts : []
1275 [main] INFO  o.a.s.StormSubmitter - Uploading topology jar /tmp/ed6b6fbca91811e9a9970242ac1b7581.jar to assigned location: /hadoop/storm/nimbus/inbox/stormjar-0728f59f-a74c-4edf-99a7-c77c3be6c7a3.jar
1314 [main] INFO  o.a.s.StormSubmitter - Successfully uploaded topology jar to assigned location: /hadoop/storm/nimbus/inbox/stormjar-0728f59f-a74c-4edf-99a7-c77c3be6c7a3.jar
1314 [main] INFO  o.a.s.StormSubmitter - Submitting topology WordCountid1bac8175_date001819 in distributed mode with conf {"storm.zookeeper.topology.auth.scheme":"digest","storm.zookeeper.topology.auth.payload":"-5297560719882079603:-7456204693431648577","topology.workers":3,"topology.debug":true}
1314 [main] WARN  o.a.s.u.Utils - STORM-VERSION new 1.2.1.3.1.4.0-228 old 1.2.1.3.1.4.0-228
Exception in thread "main" java.lang.RuntimeException: org.apache.storm.thrift.TApplicationException: Internal error processing submitTopology
	at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:273)
	at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:387)
	at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:159)
	at storm.starter.WordCountTopology.main(WordCountTopology.java:77)
Caused by: org.apache.storm.thrift.TApplicationException: Internal error processing submitTopology
	at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
	at org.apache.storm.generated.Nimbus$Client.recv_submitTopology(Nimbus.java:279)
	at org.apache.storm.generated.Nimbus$Client.submitTopology(Nimbus.java:263)
	at org.apache.storm.StormSubmitter.submitTopologyInDistributeMode(StormSubmitter.java:326)
	at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:260)
	... 3 more

Command failed after 1 tries
{code}

STR: Deploy cluster with Ambari2.7.1.0 and HDP3.0.1.0
Upgrade ambari to 2.7.4.0-96 (5273932c0b8269aa46027356cfb61ef8a5347625)
Start RU to HDP-3.1.4.0-228

at nimbus logs (both nodes): there was some time (around 1 min) where both nodes were not a leader, and topology submission happened at that time.

Nimbus in node5 logged the first log message (Ranger plugin) at 2019-07-18 04:59:24.694, and nimbus in node6 logged same at 2019-07-18 04:59:10.717.

Both were started as “Not a leader” as it has to execute its own startup process. Since they’re not a leader, they added themselves to leader lock on zookeeper and waited for gaining leadership. (Node5: 2019-07-18 04:59:28.769, Node6: 2019-07-18 04:59:13.676)

And topology submission happened at the time where both hadn’t gain leadership. (Node5: 2019-07-18 05:00:25.020, Node6: 2019-07-18 05:00:02.609) Both would fail to process topology submission as they’re not a leader.

Node6 finally gain leadership at 2019-07-18 05:00:27.026. After this time Storm can properly receive topology submission.

Btw, the client side thrift error message seems to be misleading, as it doesn’t show actual cause in Nimbus log: 

{code}java.lang.RuntimeException: not a leader, current leader is NimbusInfo{host='ctr-e139-1542663976389-174824-01-000006.hwx.site', port=6627, isLeader=true}{code}

(though isLeader=true is incorrect too… anyway “not a leader” is the thing)

 We can’t ensure one nimbus must be gained leadership at any time. In H/A environment, client side also has to tolerate failure on “no leader”. We will add logic to retry


> Storm Service Check fails during Rolling Upgrade due to Nimbus leader selection
> -------------------------------------------------------------------------------
>
>                 Key: AMBARI-25343
>                 URL: https://issues.apache.org/jira/browse/AMBARI-25343
>             Project: Ambari
>          Issue Type: Bug
>    Affects Versions: 2.7.4
>            Reporter: Dmitry Lysnichenko
>            Assignee: Dmitry Lysnichenko
>            Priority: Major
>             Fix For: 2.7.4
>
>
> Storm Service Check fails during Rolling Upgrade
> {code}
> 1099 [main] WARN  o.a.s.h.DefaultShader - Relocating backtype/storm/task/ShellBolt to org/apache/storm/task/ShellBolt in storm/starter/WordCountTopology$SplitSentence.class. please modify your code to use the new namespace
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-228/storm/lib/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-228/storm/contrib/storm-autocreds/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
> SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-228/storm/lib/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-228/storm/contrib/storm-autocreds/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
> SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
> Running: /usr/lib/jvm/java-openjdk/bin/java -Ddaemon.name= -Dstorm.options= -Dstorm.home=/usr/hdp/3.1.4.0-228/storm -Dstorm.log.dir=/grid/0/log/storm -Djava.library.path=/usr/local/lib:/opt/local/lib:/usr/lib -Dstorm.conf.file= -cp /usr/hdp/3.1.4.0-228/storm/*:/usr/hdp/3.1.4.0-228/storm/lib/*:/usr/hdp/3.1.4.0-228/storm/extlib/*:/usr/hdp/current/storm-supervisor/external/storm-autocreds/*:/tmp/ed6b6fbca91811e9a9970242ac1b7581.jar:/usr/hdp/current/storm-supervisor/conf:/usr/hdp/3.1.4.0-228/storm/bin -Dstorm.jar=/tmp/ed6b6fbca91811e9a9970242ac1b7581.jar -Dstorm.dependency.jars= -Dstorm.dependency.artifacts={} storm.starter.WordCountTopology WordCountid1bac8175_date001819
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-228/storm/lib/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-228/storm/contrib/storm-autocreds/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
> SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
> 745  [main] WARN  o.a.s.u.Utils - STORM-VERSION new 1.2.1.3.1.4.0-228 old null
> 768  [main] INFO  o.a.s.StormSubmitter - Generated ZooKeeper secret payload for MD5-digest: -5297560719882079603:-7456204693431648577
> 934  [main] INFO  o.a.s.m.n.Login - successfully logged in.
> 1006 [main] INFO  o.a.s.u.NimbusClient - Found leader nimbus : ctr-e139-1542663976389-174824-01-000005.hwx.site:6627
> 1012 [main] INFO  o.a.s.m.n.Login - successfully logged in.
> 1102 [main] INFO  o.a.s.s.a.AuthUtils - Got AutoCreds []
> 1107 [main] INFO  o.a.s.m.n.Login - successfully logged in.
> 1128 [main] INFO  o.a.s.u.NimbusClient - Found leader nimbus : ctr-e139-1542663976389-174824-01-000005.hwx.site:6627
> 1134 [main] INFO  o.a.s.m.n.Login - successfully logged in.
> 1188 [main] INFO  o.a.s.StormSubmitter - Uploading dependencies - jars...
> 1189 [main] INFO  o.a.s.StormSubmitter - Uploading dependencies - artifacts...
> 1189 [main] INFO  o.a.s.StormSubmitter - Dependency Blob keys - jars : [] / artifacts : []
> 1275 [main] INFO  o.a.s.StormSubmitter - Uploading topology jar /tmp/ed6b6fbca91811e9a9970242ac1b7581.jar to assigned location: /hadoop/storm/nimbus/inbox/stormjar-0728f59f-a74c-4edf-99a7-c77c3be6c7a3.jar
> 1314 [main] INFO  o.a.s.StormSubmitter - Successfully uploaded topology jar to assigned location: /hadoop/storm/nimbus/inbox/stormjar-0728f59f-a74c-4edf-99a7-c77c3be6c7a3.jar
> 1314 [main] INFO  o.a.s.StormSubmitter - Submitting topology WordCountid1bac8175_date001819 in distributed mode with conf {"storm.zookeeper.topology.auth.scheme":"digest","storm.zookeeper.topology.auth.payload":"-5297560719882079603:-7456204693431648577","topology.workers":3,"topology.debug":true}
> 1314 [main] WARN  o.a.s.u.Utils - STORM-VERSION new 1.2.1.3.1.4.0-228 old 1.2.1.3.1.4.0-228
> Exception in thread "main" java.lang.RuntimeException: org.apache.storm.thrift.TApplicationException: Internal error processing submitTopology
> 	at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:273)
> 	at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:387)
> 	at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:159)
> 	at storm.starter.WordCountTopology.main(WordCountTopology.java:77)
> Caused by: org.apache.storm.thrift.TApplicationException: Internal error processing submitTopology
> 	at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
> 	at org.apache.storm.generated.Nimbus$Client.recv_submitTopology(Nimbus.java:279)
> 	at org.apache.storm.generated.Nimbus$Client.submitTopology(Nimbus.java:263)
> 	at org.apache.storm.StormSubmitter.submitTopologyInDistributeMode(StormSubmitter.java:326)
> 	at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:260)
> 	... 3 more
> Command failed after 1 tries
> {code}
> STR: Deploy cluster with Ambari2.7.1.0 and HDP3.0.1.0
> Upgrade ambari to 2.7.4.0-96 (5273932c0b8269aa46027356cfb61ef8a5347625)
> Start RU to HDP-3.1.4.0-228
> at nimbus logs (both nodes): there was some time (around 1 min) where both nodes were not a leader, and topology submission happened at that time.
> Nimbus in node5 logged the first log message (Ranger plugin) at 2019-07-18 04:59:24.694, and nimbus in node6 logged same at 2019-07-18 04:59:10.717.
> Both were started as “Not a leader” as it has to execute its own startup process. Since they’re not a leader, they added themselves to leader lock on zookeeper and waited for gaining leadership. (Node5: 2019-07-18 04:59:28.769, Node6: 2019-07-18 04:59:13.676)
> And topology submission happened at the time where both hadn’t gain leadership. (Node5: 2019-07-18 05:00:25.020, Node6: 2019-07-18 05:00:02.609) Both would fail to process topology submission as they’re not a leader.
> Node6 finally gain leadership at 2019-07-18 05:00:27.026. After this time Storm can properly receive topology submission.
> Btw, the client side thrift error message seems to be misleading, as it doesn’t show actual cause in Nimbus log: 
> {code}java.lang.RuntimeException: not a leader, current leader is NimbusInfo{host='host5', port=6627, isLeader=true}{code}
> (though isLeader=true is incorrect too… anyway “not a leader” is the thing)
>  We can’t ensure one nimbus must be gained leadership at any time. In H/A environment, client side also has to tolerate failure on “no leader”. We will add logic to retry



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)