You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Varun Thacker (JIRA)" <ji...@apache.org> on 2017/11/01 07:11:00 UTC

[jira] [Created] (SOLR-11590) Synchronize ZK connect/disconnect handling

Varun Thacker created SOLR-11590:
------------------------------------

             Summary: Synchronize ZK connect/disconnect handling
                 Key: SOLR-11590
                 URL: https://issues.apache.org/jira/browse/SOLR-11590
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Varun Thacker
            Priority: Major


Here is a sequence of 2 disconnects and re-connects

{code}
1. 2017-10-31T08:34:23.106-0700 Watcher org.apache.solr.common.cloud.ConnectionManager@1579ca20 name:ZooKeeperConnection Watcher:host:port got event WatchedEvent state:Disconnected type:None path:null path:null type:None
2. 2017-10-31T08:34:23.106-0700 zkClient has disconnected
3. 2017-10-31T08:34:23.107-0700 Watcher org.apache.solr.common.cloud.ConnectionManager@1579ca20 name:ZooKeeperConnection Watcher:host:port got event WatchedEvent state:SyncConnected type:None path:null path:null type:None
{code}

{code}
1. 2017-10-31T08:36:46.541-0700 Watcher org.apache.solr.common.cloud.ConnectionManager@1579ca20 name:ZooKeeperConnection Watcher:host:port got event WatchedEvent state:Disconnected type:None path:null path:null type:None
2. 2017-10-31T08:36:46.549-0700 Watcher org.apache.solr.common.cloud.ConnectionManager@1579ca20 name:ZooKeeperConnection Watcher:host:port got event WatchedEvent state:SyncConnected type:None path:null path:null type:None
2. 2017-10-31T08:36:46.563-0700 zkClient has disconnected
{code}

In the first disconnect the sequence is -  get disconnect watcher, execute disconnect code, execute connect code
In the second disconnect the sequence is - get disconnect watcher, execute connect code, execute disconnect code

In the second sequence of events, if the JVM has leader replicas then all updates start failing with "Cannot talk to ZooKeeper - Updates are disabled." . This starts happening exactly after 27 seconds ( zk client timeout is 30s , 90% of 30 = 27 - when the code thinks the session is likely expired). No leadership changes since there was no session expiry. Unless you restart the node all updates to the system continue to fail.

These log lines correspond are from Solr 5.3 hence where the WatchedEvent was still being logged as INFO

We process the connect code and then process the disconnect code out of order based on the log ordering. The connection is active but the flag is not set and hence after 27 seconds {{zkCheck}} starts complaining that the connection is likely expired

A related Jira is SOLR-5721

ZK gives us ordered watch events ( https://zookeeper.apache.org/doc/r3.4.8/zookeeperProgrammers.html#sc_WatchGuarantees ) but from what I understand Solr can still process them out of order. We could take a lock and synchronize {{ConnectionManager#connected}} and {{ConnectionManager#disconnected}} . 

Would that be the right approach to take?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org