You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ignite.apache.org by "Mikhail Petrov (Jira)" <ji...@apache.org> on 2021/03/10 18:53:00 UTC

[jira] [Created] (IGNITE-14301) Authentication processor can hang all user management operation after server node reconnect

Mikhail Petrov created IGNITE-14301:
---------------------------------------

Summary: Authentication processor can hang all user management operation after server node reconnect
Key: IGNITE-14301
URL: https://issues.apache.org/jira/browse/IGNITE-14301
Project: Ignite
Issue Type: Bug
Reporter: Mikhail Petrov

First for all look at the test - AuthenticationProcessorNodeRestartTest#testConcurrentAddUpdateRemoveNodeRestartServer - [TC history|https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=-8873434544416175780&tab=testDetails]

The first problem with this test is that user management operations(add/update/remove) create too many discovery messages. So discovery custom message history size is not enough to properly skip duplicated custom messages that can be sent across the ring during server node reconnect. It leads to test failures due to duplication of user management operations (see GridDiscoveryManager#discoCacheHist, IGNITE_DISCOVERY_HISTORY_SIZE system property, and ServerImpl.RingMessageWorker#sendMessageAcrossRing).

If the discovery history size will be increased significantly, the test stops failing and starts hanging. The steps that lead to this:
1. Client node sent UserProposedMessage across the ring while one node is offline due to reconnect.
2. Alive server nodes update their local user lists and finish the operation.
3. Reconnected node joins the ring and receives an updated user list from the coordinator.
4. Reconnected node receives duplicated UserProposedMessage that has been already handled by all nodes, handles it, and sents UserManagementOperationFinishedMessage to the coordinator and start to wait for the UserAcceptedMessage from it. But the coordinator has already finished this operation. So the thread that responsible for user management operation on the reconnected node becomes blocked (see IgniteAuthenticationProcessor.UserOperationWorker#body).
5. Client node starts the next operation that needs all alive nodes to respond with UserManagementOperationFinishedMessage. But reconnected node authentication thread is blocked. So this operation can't be completed at all.

--
This message was sent by Atlassian Jira
(v8.3.4#803005)