You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by GitBox <gi...@apache.org> on 2022/12/15 17:04:49 UTC

[GitHub] [nifi] markap14 commented on pull request #6779: NIFI-10975 Add Kubernetes Leader Election and State Provider

markap14 commented on PR #6779:
URL: https://github.com/apache/nifi/pull/6779#issuecomment-1353410957

   So I crated a two-node nifi cluster using GKE to test this. On startup, things work well. Both nodes join the cluster. I can see that state is getting stored/recovered properly using ListGCSBucket. If I then disconnect the node that is Primary/Coordinator, I see that the other node is elected. But if I then reconnect the disconnected node, it gets into a bad state.
   Running `bin/nifi.sh diagnostics diag1.txt` on both nodes shows that both nodes actually believe that they are both the Cluster Coordinator AND the Primary Node.
   Looking at the logs of the disconnected node, I see:
   ```
   2022-12-15 16:50:42,065 ERROR [KubernetesLeaderElectionManager] i.f.k.c.e.leaderelection.LeaderElector Exception occurred while releasing lock 'LeaseLock: nifi - cluster-coordinator (10.31.1.4:4423)'
   io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LockException: Unable to update LeaseLock
           at io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LeaseLock.update(LeaseLock.java:102)
           at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.cancel(Unknown Source)
           at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$null$0(LeaderElector.java:92)
           at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.cancel(Unknown Source)
           at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.run(LeaderElector.java:70)
           at org.apache.nifi.kubernetes.leader.election.command.LeaderElectionCommand.run(LeaderElectionCommand.java:78)
           at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
           at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
           at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
           at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:517)
           at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:551)
           at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleUpdate(OperationSupport.java:347)
           at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleUpdate(BaseOperation.java:680)
           at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.lambda$replace$0(HasMetadataOperation.java:167)
           at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:172)
           at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:113)
           at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:41)
           at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.replace(BaseOperation.java:1043)
           at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.replace(BaseOperation.java:88)
           at io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LeaseLock.update(LeaseLock.java:100)
           ... 19 common frames omitted
           at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:709)
           at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:689)
           at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
           at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$4.onResponse(OkHttpClientImpl.java:277)
           at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:519)
           ... 3 common frames omitted
   2022-12-15 16:50:42,066 ERROR [KubernetesLeaderElectionManager] i.f.k.c.e.leaderelection.LeaderElector Exception occurred while releasing lock 'LeaseLock: nifi - primary-node (10.31.1.4:4423)'
   io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LockException: Unable to update LeaseLock
           at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.stopLeading(LeaderElector.java:120)
           at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$null$1(LeaderElector.java:94)
           at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.cancel(Unknown Source)
           at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$null$0(LeaderElector.java:92)
           at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.cancel(Unknown Source)
           at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.run(LeaderElector.java:70)
           at org.apache.nifi.kubernetes.leader.election.command.LeaderElectionCommand.run(LeaderElectionCommand.java:78)
           at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
           at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
           at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
           at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
           at java.base/java.lang.Thread.run(Unknown Source)
   Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://10.31.128.1/apis/coordination.k8s.io/v1/namespaces/nifi/leases/primary-node. Message: Operation cannot be fulfilled on leases.coordination.k8s.io "primary-node": the object has been modified; please apply your changes to the latest version and try again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=coordination.k8s.io, kind=leases, name=primary-node, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on leases.coordination.k8s.io "primary-node": the object has been modified; please apply your changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}).
           at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
           at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:517)
           at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:551)
           at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleUpdate(OperationSupport.java:347)
           at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleUpdate(BaseOperation.java:680)
           at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.lambda$replace$0(HasMetadataOperation.java:167)
           at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:172)
           at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:113)
           at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:41)
           at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.replace(BaseOperation.java:1043)
           at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.replace(BaseOperation.java:88)
           at io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LeaseLock.update(LeaseLock.java:100)
           ... 19 common frames omitted
   Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://10.31.128.1/apis/coordination.k8s.io/v1/namespaces/nifi/leases/primary-node. Message: Operation cannot be fulfilled on leases.coordination.k8s.io "primary-node": the object has been modified; please apply your changes to the latest version and try again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=coordination.k8s.io, kind=leases, name=primary-node, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on leases.coordination.k8s.io "primary-node": the object has been modified; please apply your changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}).
           at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:709)
           at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:689)
           at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:640)
           at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:576)
           at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
           at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$retryWithExponentialBackoff$2(OperationSupport.java:618)
           at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
           at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
           at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$4.onResponse(OkHttpClientImpl.java:277)
           at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:519)
           ... 3 common frames omitted
   ```
   
   So looks like it is not properly relinquishing the ownership of the lease. I presume this is what causes both nodes to believe that they are the coordinator/primary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@nifi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org