You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2020/02/27 18:08:04 UTC

[GitHub] [lucene-solr] madrob opened a new pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

madrob opened a new pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297
 
 
   # Description
   
   There are lots of places in the code where we poll and sleep rather than using proper ZK callbacks.
   
   # Solution
   
   Replace poll loops with waitForState

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] madrob commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

madrob commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r385768968
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/ZkController.java
 ##########
 @@ -1684,58 +1685,37 @@ private void doGetShardIdAndNodeNameProcess(CoreDescriptor cd) {
   }
 
   private void waitForCoreNodeName(CoreDescriptor descriptor) {
-    int retryCount = 320;
-    log.debug("look for our core node name");
-    while (retryCount-- > 0) {
-      final DocCollection docCollection = zkStateReader.getClusterState()
-          .getCollectionOrNull(descriptor.getCloudDescriptor().getCollectionName());
-      if (docCollection != null && docCollection.getSlicesMap() != null) {
-        final Map<String, Slice> slicesMap = docCollection.getSlicesMap();
-        for (Slice slice : slicesMap.values()) {
-          for (Replica replica : slice.getReplicas()) {
-            // TODO: for really large clusters, we could 'index' on this
-
-            String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-            String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-            String msgNodeName = getNodeName();
-            String msgCore = descriptor.getName();
-
-            if (msgNodeName.equals(nodeName) && core.equals(msgCore)) {
-              descriptor.getCloudDescriptor()
-                  .setCoreNodeName(replica.getName());
-              getCoreContainer().getCoresLocator().persist(getCoreContainer(), descriptor);
-              return;
-            }
-          }
+    log.debug("waitForCoreNodeName >>> look for our core node name");
+    try {
+      zkStateReader.waitForState(descriptor.getCollectionName(), 320, TimeUnit.SECONDS, c -> {
+        String name = ClusterStateMutator.getAssignedCoreNodeName(c, getNodeName(), descriptor.getName());
+        if (name == null) {
+          return false;
         }
-      }
-      try {
-        Thread.sleep(1000);
-      } catch (InterruptedException e) {
-        Thread.currentThread().interrupt();
-      }
+        descriptor.getCloudDescriptor().setCoreNodeName(name);
 
 Review comment:
   This thread will block on the wait call, so I don't think we're introducing any new races. It was always possible that two threads could be trying to access the CoreDescriptor, I think.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r385440678
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/api/collections/OverseerCollectionMessageHandler.java
 ##########
 @@ -672,35 +636,35 @@ void cleanupCollection(String collectionName, NamedList results) throws Exceptio
     commandMap.get(DELETE).call(zkStateReader.getClusterState(), new ZkNodeProps(props), results);
   }
 
-  Map<String, Replica> waitToSeeReplicasInState(String collectionName, Collection<String> coreNames) throws InterruptedException {
-    assert coreNames.size() > 0;
-    Map<String, Replica> result = new HashMap<>();
-    TimeOut timeout = new TimeOut(Integer.getInteger("solr.waitToSeeReplicasInStateTimeoutSeconds", 120), TimeUnit.SECONDS, timeSource); // could be a big cluster
-    while (true) {
-      DocCollection coll = zkStateReader.getClusterState().getCollection(collectionName);
-      for (String coreName : coreNames) {
-        if (result.containsKey(coreName)) continue;
-        for (Slice slice : coll.getSlices()) {
-          for (Replica replica : slice.getReplicas()) {
-            if (coreName.equals(replica.getStr(ZkStateReader.CORE_NAME_PROP))) {
-              result.put(coreName, replica);
-              break;
+  Map<String, Replica> waitToSeeReplicasInState(String collectionName, Collection<String> coreNames) {
+    final Map<String, Replica> result = new HashMap<>();
+    int timeout = Integer.getInteger("solr.waitToSeeReplicasInStateTimeoutSeconds", 120); // could be a big cluster
+    try {
+      zkStateReader.waitForState(collectionName, timeout, TimeUnit.SECONDS, c -> {
+        // todo this is ugly, but I'm not sure there is a better way to fix it?
 
 Review comment:
   Can't we iterate shards/replicas, and for each one check if coreNames .contains(replica.getStr(ZkStateReader.CORE_NAME_PROP)). Maybe make a set with all the elements in coreNames and remove them as you find them, and break if empty? I guess it depends on how big coreNames will be

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] madrob commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

madrob commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r385779735
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/api/collections/OverseerCollectionMessageHandler.java
 ##########
 @@ -672,35 +636,35 @@ void cleanupCollection(String collectionName, NamedList results) throws Exceptio
     commandMap.get(DELETE).call(zkStateReader.getClusterState(), new ZkNodeProps(props), results);
   }
 
-  Map<String, Replica> waitToSeeReplicasInState(String collectionName, Collection<String> coreNames) throws InterruptedException {
-    assert coreNames.size() > 0;
-    Map<String, Replica> result = new HashMap<>();
-    TimeOut timeout = new TimeOut(Integer.getInteger("solr.waitToSeeReplicasInStateTimeoutSeconds", 120), TimeUnit.SECONDS, timeSource); // could be a big cluster
-    while (true) {
-      DocCollection coll = zkStateReader.getClusterState().getCollection(collectionName);
-      for (String coreName : coreNames) {
-        if (result.containsKey(coreName)) continue;
-        for (Slice slice : coll.getSlices()) {
-          for (Replica replica : slice.getReplicas()) {
-            if (coreName.equals(replica.getStr(ZkStateReader.CORE_NAME_PROP))) {
-              result.put(coreName, replica);
-              break;
+  Map<String, Replica> waitToSeeReplicasInState(String collectionName, Collection<String> coreNames) {
+    final Map<String, Replica> result = new HashMap<>();
+    int timeout = Integer.getInteger("solr.waitToSeeReplicasInStateTimeoutSeconds", 120); // could be a big cluster
+    try {
+      zkStateReader.waitForState(collectionName, timeout, TimeUnit.SECONDS, c -> {
+        // todo this is ugly, but I'm not sure there is a better way to fix it?
 
 Review comment:
   I think I found something that makes sense.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] madrob commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

madrob commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r386726758
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/ZkController.java
 ##########
 @@ -1684,58 +1685,37 @@ private void doGetShardIdAndNodeNameProcess(CoreDescriptor cd) {
   }
 
   private void waitForCoreNodeName(CoreDescriptor descriptor) {
-    int retryCount = 320;
-    log.debug("look for our core node name");
-    while (retryCount-- > 0) {
-      final DocCollection docCollection = zkStateReader.getClusterState()
-          .getCollectionOrNull(descriptor.getCloudDescriptor().getCollectionName());
-      if (docCollection != null && docCollection.getSlicesMap() != null) {
-        final Map<String, Slice> slicesMap = docCollection.getSlicesMap();
-        for (Slice slice : slicesMap.values()) {
-          for (Replica replica : slice.getReplicas()) {
-            // TODO: for really large clusters, we could 'index' on this
-
-            String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-            String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-            String msgNodeName = getNodeName();
-            String msgCore = descriptor.getName();
-
-            if (msgNodeName.equals(nodeName) && core.equals(msgCore)) {
-              descriptor.getCloudDescriptor()
-                  .setCoreNodeName(replica.getName());
-              getCoreContainer().getCoresLocator().persist(getCoreContainer(), descriptor);
-              return;
-            }
-          }
+    log.debug("waitForCoreNodeName >>> look for our core node name");
+    try {
+      zkStateReader.waitForState(descriptor.getCollectionName(), 320, TimeUnit.SECONDS, c -> {
+        String name = ClusterStateMutator.getAssignedCoreNodeName(c, getNodeName(), descriptor.getName());
+        if (name == null) {
+          return false;
         }
-      }
-      try {
-        Thread.sleep(1000);
-      } catch (InterruptedException e) {
-        Thread.currentThread().interrupt();
-      }
+        descriptor.getCloudDescriptor().setCoreNodeName(name);
+        return true;
+      });
+    } catch (TimeoutException | InterruptedException e) {
+      throw new SolrException(ErrorCode.SERVER_ERROR, "Timeout waiting for collection state", e);
 
 Review comment:
   Will do!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r385439336
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/api/collections/OverseerCollectionMessageHandler.java
 ##########
 @@ -630,34 +596,32 @@ private void modifyCollection(ClusterState clusterState, ZkNodeProps message, Na
 
     overseer.offerStateUpdate(Utils.toJSON(message));
 
-    TimeOut timeout = new TimeOut(30, TimeUnit.SECONDS, timeSource);
-    boolean areChangesVisible = true;
-    while (!timeout.hasTimedOut()) {
-      DocCollection collection = cloudManager.getClusterStateProvider().getClusterState().getCollection(collectionName);
-      areChangesVisible = true;
-      for (Map.Entry<String,Object> updateEntry : message.getProperties().entrySet()) {
-        String updateKey = updateEntry.getKey();
-
-        if (!updateKey.equals(ZkStateReader.COLLECTION_PROP)
-            && !updateKey.equals(Overseer.QUEUE_OPERATION)
-            && updateEntry.getValue() != null // handled below in a separate conditional
-            && !updateEntry.getValue().equals(collection.get(updateKey))) {
-          areChangesVisible = false;
-          break;
+    try {
+      zkStateReader.waitForState(collectionName, 30, TimeUnit.SECONDS, c -> {
+        if (c == null) {
+          return false;
         }
+        for (Map.Entry<String,Object> updateEntry : message.getProperties().entrySet()) {
+          String updateKey = updateEntry.getKey();
+
+          if (!updateKey.equals(ZkStateReader.COLLECTION_PROP)
+                  && !updateKey.equals(Overseer.QUEUE_OPERATION)
+                  && updateEntry.getValue() != null // handled below in a separate conditional
+                  && !updateEntry.getValue().equals(c.get(updateKey))) {
+            return false;
+          }
 
-        if (updateEntry.getValue() == null && collection.containsKey(updateKey)) {
-          areChangesVisible = false;
-          break;
+          if (updateEntry.getValue() == null && c.containsKey(updateKey)) {
+            return false;
+          }
         }
-      }
-      if (areChangesVisible) break;
-      timeout.sleep(100);
+        return true;
+      });
+    } catch (TimeoutException | InterruptedException e) {
+      log.debug("modifyCollection(ClusterState=" + clusterState + ", ZkNodeProps=" + message + ", NamedList=" + results + ")", e);
 
 Review comment:
   reset interruption

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] madrob commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

madrob commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r386726709
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/ZkController.java
 ##########
 @@ -1684,58 +1685,37 @@ private void doGetShardIdAndNodeNameProcess(CoreDescriptor cd) {
   }
 
   private void waitForCoreNodeName(CoreDescriptor descriptor) {
-    int retryCount = 320;
-    log.debug("look for our core node name");
-    while (retryCount-- > 0) {
-      final DocCollection docCollection = zkStateReader.getClusterState()
-          .getCollectionOrNull(descriptor.getCloudDescriptor().getCollectionName());
-      if (docCollection != null && docCollection.getSlicesMap() != null) {
-        final Map<String, Slice> slicesMap = docCollection.getSlicesMap();
-        for (Slice slice : slicesMap.values()) {
-          for (Replica replica : slice.getReplicas()) {
-            // TODO: for really large clusters, we could 'index' on this
-
-            String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-            String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-            String msgNodeName = getNodeName();
-            String msgCore = descriptor.getName();
-
-            if (msgNodeName.equals(nodeName) && core.equals(msgCore)) {
-              descriptor.getCloudDescriptor()
-                  .setCoreNodeName(replica.getName());
-              getCoreContainer().getCoresLocator().persist(getCoreContainer(), descriptor);
-              return;
-            }
-          }
+    log.debug("waitForCoreNodeName >>> look for our core node name");
+    try {
+      zkStateReader.waitForState(descriptor.getCollectionName(), 320, TimeUnit.SECONDS, c -> {
+        String name = ClusterStateMutator.getAssignedCoreNodeName(c, getNodeName(), descriptor.getName());
+        if (name == null) {
+          return false;
         }
-      }
-      try {
-        Thread.sleep(1000);
-      } catch (InterruptedException e) {
-        Thread.currentThread().interrupt();
-      }
+        descriptor.getCloudDescriptor().setCoreNodeName(name);
 
 Review comment:
   Dug into this deeper, and I believe that the latch in `waitForState` will guarantee data visibility.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r385438210
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/api/collections/OverseerCollectionMessageHandler.java
 ##########
 @@ -526,57 +518,31 @@ static UpdateResponse softCommit(String url) throws SolrServerException, IOExcep
   }
 
   String waitForCoreNodeName(String collectionName, String msgNodeName, String msgCore) {
-    int retryCount = 320;
-    while (retryCount-- > 0) {
-      final DocCollection docCollection = zkStateReader.getClusterState().getCollectionOrNull(collectionName);
-      if (docCollection != null && docCollection.getSlicesMap() != null) {
-        Map<String,Slice> slicesMap = docCollection.getSlicesMap();
-        for (Slice slice : slicesMap.values()) {
-          for (Replica replica : slice.getReplicas()) {
-            // TODO: for really large clusters, we could 'index' on this
-
-            String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-            String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-            if (nodeName.equals(msgNodeName) && core.equals(msgCore)) {
-              return replica.getName();
-            }
-          }
+    AtomicReference<String> coreNodeName = new AtomicReference<>();
+    try {
+      zkStateReader.waitForState(collectionName, 320, TimeUnit.SECONDS, c -> {
+        String name = ClusterStateMutator.getAssignedCoreNodeName(c, msgNodeName, msgCore);
+        if (name == null) {
+          return false;
         }
-      }
-      try {
-        Thread.sleep(1000);
-      } catch (InterruptedException e) {
-        Thread.currentThread().interrupt();
-      }
+        coreNodeName.set(name);
+        return true;
+      });
+    } catch (TimeoutException | InterruptedException e) {
+      throw new SolrException(ErrorCode.SERVER_ERROR, "Timeout waiting for collection state", e);
     }
-    throw new SolrException(ErrorCode.SERVER_ERROR, "Could not find coreNodeName");
+    return coreNodeName.get();
   }
 
-  ClusterState waitForNewShard(String collectionName, String sliceName) throws KeeperException, InterruptedException {
+  ClusterState waitForNewShard(String collectionName, String sliceName) {
     log.debug("Waiting for slice {} of collection {} to be available", sliceName, collectionName);
-    RTimer timer = new RTimer();
-    int retryCount = 320;
-    while (retryCount-- > 0) {
-      ClusterState clusterState = zkStateReader.getClusterState();
-      DocCollection collection = clusterState.getCollection(collectionName);
-
-      if (collection == null) {
-        throw new SolrException(ErrorCode.SERVER_ERROR,
-            "Unable to find collection: " + collectionName + " in clusterstate");
-      }
-      Slice slice = collection.getSlice(sliceName);
-      if (slice != null) {
-        log.debug("Waited for {}ms for slice {} of collection {} to be available",
-            timer.getTime(), sliceName, collectionName);
-        return clusterState;
-      }
-      Thread.sleep(1000);
+    try {
+      zkStateReader.waitForState(collectionName, 320, TimeUnit.SECONDS, c -> c != null && c.getSlice(sliceName) != null);
+    } catch (TimeoutException | InterruptedException e) {
+      throw new SolrException(ErrorCode.SERVER_ERROR, "Timeout waiting for new slice", e);
     }
-    throw new SolrException(ErrorCode.SERVER_ERROR,
-        "Could not find new slice " + sliceName + " in collection " + collectionName
-            + " even after waiting for " + timer.getTime() + "ms"
-    );
+    // nocommit is there a race condition here since we're not returning the same clusterstate we inspected?
 
 Review comment:
   Isn't that the case with most of this methods? While the predicate is being executed for example, there is no watch in ZooKeeper AFAICT, unless we go back and write in ZooKeeper and use the version.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r385823773
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/ZkController.java
 ##########
 @@ -1684,58 +1685,37 @@ private void doGetShardIdAndNodeNameProcess(CoreDescriptor cd) {
   }
 
   private void waitForCoreNodeName(CoreDescriptor descriptor) {
-    int retryCount = 320;
-    log.debug("look for our core node name");
-    while (retryCount-- > 0) {
-      final DocCollection docCollection = zkStateReader.getClusterState()
-          .getCollectionOrNull(descriptor.getCloudDescriptor().getCollectionName());
-      if (docCollection != null && docCollection.getSlicesMap() != null) {
-        final Map<String, Slice> slicesMap = docCollection.getSlicesMap();
-        for (Slice slice : slicesMap.values()) {
-          for (Replica replica : slice.getReplicas()) {
-            // TODO: for really large clusters, we could 'index' on this
-
-            String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-            String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-            String msgNodeName = getNodeName();
-            String msgCore = descriptor.getName();
-
-            if (msgNodeName.equals(nodeName) && core.equals(msgCore)) {
-              descriptor.getCloudDescriptor()
-                  .setCoreNodeName(replica.getName());
-              getCoreContainer().getCoresLocator().persist(getCoreContainer(), descriptor);
-              return;
-            }
-          }
+    log.debug("waitForCoreNodeName >>> look for our core node name");
+    try {
+      zkStateReader.waitForState(descriptor.getCollectionName(), 320, TimeUnit.SECONDS, c -> {
+        String name = ClusterStateMutator.getAssignedCoreNodeName(c, getNodeName(), descriptor.getName());
+        if (name == null) {
+          return false;
         }
-      }
-      try {
-        Thread.sleep(1000);
-      } catch (InterruptedException e) {
-        Thread.currentThread().interrupt();
-      }
+        descriptor.getCloudDescriptor().setCoreNodeName(name);
+        return true;
+      });
+    } catch (TimeoutException | InterruptedException e) {
+      throw new SolrException(ErrorCode.SERVER_ERROR, "Timeout waiting for collection state", e);
 
 Review comment:
   +1 for the changes. Just wanted to make sure they were intentional. Regarding the InterruptedException handling, This LGTM, but if you want to keep the multicatch you could use `SolrZkClient.checkInterrupted(Throwable e)`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] madrob commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

madrob commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r385779370
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/api/collections/OverseerCollectionMessageHandler.java
 ##########
 @@ -630,34 +596,32 @@ private void modifyCollection(ClusterState clusterState, ZkNodeProps message, Na
 
     overseer.offerStateUpdate(Utils.toJSON(message));
 
-    TimeOut timeout = new TimeOut(30, TimeUnit.SECONDS, timeSource);
-    boolean areChangesVisible = true;
-    while (!timeout.hasTimedOut()) {
-      DocCollection collection = cloudManager.getClusterStateProvider().getClusterState().getCollection(collectionName);
-      areChangesVisible = true;
-      for (Map.Entry<String,Object> updateEntry : message.getProperties().entrySet()) {
-        String updateKey = updateEntry.getKey();
-
-        if (!updateKey.equals(ZkStateReader.COLLECTION_PROP)
-            && !updateKey.equals(Overseer.QUEUE_OPERATION)
-            && updateEntry.getValue() != null // handled below in a separate conditional
-            && !updateEntry.getValue().equals(collection.get(updateKey))) {
-          areChangesVisible = false;
-          break;
+    try {
+      zkStateReader.waitForState(collectionName, 30, TimeUnit.SECONDS, c -> {
+        if (c == null) {
 
 Review comment:
   It could happen if there is a concurrent delete

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r385434556
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/ZkController.java
 ##########
 @@ -1684,58 +1685,37 @@ private void doGetShardIdAndNodeNameProcess(CoreDescriptor cd) {
   }
 
   private void waitForCoreNodeName(CoreDescriptor descriptor) {
-    int retryCount = 320;
-    log.debug("look for our core node name");
-    while (retryCount-- > 0) {
-      final DocCollection docCollection = zkStateReader.getClusterState()
-          .getCollectionOrNull(descriptor.getCloudDescriptor().getCollectionName());
-      if (docCollection != null && docCollection.getSlicesMap() != null) {
-        final Map<String, Slice> slicesMap = docCollection.getSlicesMap();
-        for (Slice slice : slicesMap.values()) {
-          for (Replica replica : slice.getReplicas()) {
-            // TODO: for really large clusters, we could 'index' on this
-
-            String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-            String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-            String msgNodeName = getNodeName();
-            String msgCore = descriptor.getName();
-
-            if (msgNodeName.equals(nodeName) && core.equals(msgCore)) {
-              descriptor.getCloudDescriptor()
-                  .setCoreNodeName(replica.getName());
-              getCoreContainer().getCoresLocator().persist(getCoreContainer(), descriptor);
-              return;
-            }
-          }
+    log.debug("waitForCoreNodeName >>> look for our core node name");
+    try {
+      zkStateReader.waitForState(descriptor.getCollectionName(), 320, TimeUnit.SECONDS, c -> {
+        String name = ClusterStateMutator.getAssignedCoreNodeName(c, getNodeName(), descriptor.getName());
+        if (name == null) {
+          return false;
         }
-      }
-      try {
-        Thread.sleep(1000);
-      } catch (InterruptedException e) {
-        Thread.currentThread().interrupt();
-      }
+        descriptor.getCloudDescriptor().setCoreNodeName(name);
 
 Review comment:
   Do we need any synchronization, since this will now be running on a different thread?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] madrob commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

madrob commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r389193738
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/ZkController.java
 ##########
 @@ -1684,58 +1685,39 @@ private void doGetShardIdAndNodeNameProcess(CoreDescriptor cd) {
   }
 
   private void waitForCoreNodeName(CoreDescriptor descriptor) {
-    int retryCount = 320;
-    log.debug("look for our core node name");
-    while (retryCount-- > 0) {
-      final DocCollection docCollection = zkStateReader.getClusterState()
-          .getCollectionOrNull(descriptor.getCloudDescriptor().getCollectionName());
-      if (docCollection != null && docCollection.getSlicesMap() != null) {
-        final Map<String, Slice> slicesMap = docCollection.getSlicesMap();
-        for (Slice slice : slicesMap.values()) {
-          for (Replica replica : slice.getReplicas()) {
-            // TODO: for really large clusters, we could 'index' on this
-
-            String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-            String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-            String msgNodeName = getNodeName();
-            String msgCore = descriptor.getName();
-
-            if (msgNodeName.equals(nodeName) && core.equals(msgCore)) {
-              descriptor.getCloudDescriptor()
-                  .setCoreNodeName(replica.getName());
-              getCoreContainer().getCoresLocator().persist(getCoreContainer(), descriptor);
-              return;
-            }
-          }
+    log.debug("waitForCoreNodeName >>> look for our core node name");
+    try {
+      zkStateReader.waitForState(descriptor.getCollectionName(), 320, TimeUnit.SECONDS, c -> {
 
 Review comment:
   In general, I think it's good to have knobs, but there's definitely the possibility of having too many things available to configure and overwhelming operators. Can you describe what conditions would lead to wanting to tweak this?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] madrob commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

madrob commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r385776252
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/ZkController.java
 ##########
 @@ -1684,58 +1685,37 @@ private void doGetShardIdAndNodeNameProcess(CoreDescriptor cd) {
   }
 
   private void waitForCoreNodeName(CoreDescriptor descriptor) {
-    int retryCount = 320;
-    log.debug("look for our core node name");
-    while (retryCount-- > 0) {
-      final DocCollection docCollection = zkStateReader.getClusterState()
-          .getCollectionOrNull(descriptor.getCloudDescriptor().getCollectionName());
-      if (docCollection != null && docCollection.getSlicesMap() != null) {
-        final Map<String, Slice> slicesMap = docCollection.getSlicesMap();
-        for (Slice slice : slicesMap.values()) {
-          for (Replica replica : slice.getReplicas()) {
-            // TODO: for really large clusters, we could 'index' on this
-
-            String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-            String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-            String msgNodeName = getNodeName();
-            String msgCore = descriptor.getName();
-
-            if (msgNodeName.equals(nodeName) && core.equals(msgCore)) {
-              descriptor.getCloudDescriptor()
-                  .setCoreNodeName(replica.getName());
-              getCoreContainer().getCoresLocator().persist(getCoreContainer(), descriptor);
-              return;
-            }
-          }
+    log.debug("waitForCoreNodeName >>> look for our core node name");
+    try {
+      zkStateReader.waitForState(descriptor.getCollectionName(), 320, TimeUnit.SECONDS, c -> {
+        String name = ClusterStateMutator.getAssignedCoreNodeName(c, getNodeName(), descriptor.getName());
+        if (name == null) {
+          return false;
         }
-      }
-      try {
-        Thread.sleep(1000);
-      } catch (InterruptedException e) {
-        Thread.currentThread().interrupt();
-      }
+        descriptor.getCloudDescriptor().setCoreNodeName(name);
+        return true;
+      });
+    } catch (TimeoutException | InterruptedException e) {
+      throw new SolrException(ErrorCode.SERVER_ERROR, "Timeout waiting for collection state", e);
 
 Review comment:
   A couple of logic changes here, yes. 1) Before we would continue to retry on interrupt, i.e. the interruption would only count against the current attempt not the whole method. That's probably wrong. 2) We wouldn't fail if we don't see the result state. Also probably wrong, and I suspect that we would end up failing later when this was missing?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r385437194
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/api/collections/OverseerCollectionMessageHandler.java
 ##########
 @@ -526,57 +518,31 @@ static UpdateResponse softCommit(String url) throws SolrServerException, IOExcep
   }
 
   String waitForCoreNodeName(String collectionName, String msgNodeName, String msgCore) {
-    int retryCount = 320;
-    while (retryCount-- > 0) {
-      final DocCollection docCollection = zkStateReader.getClusterState().getCollectionOrNull(collectionName);
-      if (docCollection != null && docCollection.getSlicesMap() != null) {
-        Map<String,Slice> slicesMap = docCollection.getSlicesMap();
-        for (Slice slice : slicesMap.values()) {
-          for (Replica replica : slice.getReplicas()) {
-            // TODO: for really large clusters, we could 'index' on this
-
-            String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-            String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-            if (nodeName.equals(msgNodeName) && core.equals(msgCore)) {
-              return replica.getName();
-            }
-          }
+    AtomicReference<String> coreNodeName = new AtomicReference<>();
+    try {
+      zkStateReader.waitForState(collectionName, 320, TimeUnit.SECONDS, c -> {
+        String name = ClusterStateMutator.getAssignedCoreNodeName(c, msgNodeName, msgCore);
+        if (name == null) {
+          return false;
         }
-      }
-      try {
-        Thread.sleep(1000);
-      } catch (InterruptedException e) {
-        Thread.currentThread().interrupt();
-      }
+        coreNodeName.set(name);
+        return true;
+      });
+    } catch (TimeoutException | InterruptedException e) {
+      throw new SolrException(ErrorCode.SERVER_ERROR, "Timeout waiting for collection state", e);
     }
-    throw new SolrException(ErrorCode.SERVER_ERROR, "Could not find coreNodeName");
+    return coreNodeName.get();
   }
 
-  ClusterState waitForNewShard(String collectionName, String sliceName) throws KeeperException, InterruptedException {
+  ClusterState waitForNewShard(String collectionName, String sliceName) {
     log.debug("Waiting for slice {} of collection {} to be available", sliceName, collectionName);
-    RTimer timer = new RTimer();
-    int retryCount = 320;
-    while (retryCount-- > 0) {
-      ClusterState clusterState = zkStateReader.getClusterState();
-      DocCollection collection = clusterState.getCollection(collectionName);
-
-      if (collection == null) {
-        throw new SolrException(ErrorCode.SERVER_ERROR,
-            "Unable to find collection: " + collectionName + " in clusterstate");
-      }
-      Slice slice = collection.getSlice(sliceName);
-      if (slice != null) {
-        log.debug("Waited for {}ms for slice {} of collection {} to be available",
-            timer.getTime(), sliceName, collectionName);
-        return clusterState;
-      }
-      Thread.sleep(1000);
+    try {
+      zkStateReader.waitForState(collectionName, 320, TimeUnit.SECONDS, c -> c != null && c.getSlice(sliceName) != null);
+    } catch (TimeoutException | InterruptedException e) {
+      throw new SolrException(ErrorCode.SERVER_ERROR, "Timeout waiting for new slice", e);
 
 Review comment:
   interruption?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r385436856
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/api/collections/OverseerCollectionMessageHandler.java
 ##########
 @@ -526,57 +518,31 @@ static UpdateResponse softCommit(String url) throws SolrServerException, IOExcep
   }
 
   String waitForCoreNodeName(String collectionName, String msgNodeName, String msgCore) {
-    int retryCount = 320;
-    while (retryCount-- > 0) {
-      final DocCollection docCollection = zkStateReader.getClusterState().getCollectionOrNull(collectionName);
-      if (docCollection != null && docCollection.getSlicesMap() != null) {
-        Map<String,Slice> slicesMap = docCollection.getSlicesMap();
-        for (Slice slice : slicesMap.values()) {
-          for (Replica replica : slice.getReplicas()) {
-            // TODO: for really large clusters, we could 'index' on this
-
-            String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-            String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-            if (nodeName.equals(msgNodeName) && core.equals(msgCore)) {
-              return replica.getName();
-            }
-          }
+    AtomicReference<String> coreNodeName = new AtomicReference<>();
+    try {
+      zkStateReader.waitForState(collectionName, 320, TimeUnit.SECONDS, c -> {
+        String name = ClusterStateMutator.getAssignedCoreNodeName(c, msgNodeName, msgCore);
+        if (name == null) {
+          return false;
         }
-      }
-      try {
-        Thread.sleep(1000);
-      } catch (InterruptedException e) {
-        Thread.currentThread().interrupt();
-      }
+        coreNodeName.set(name);
+        return true;
+      });
+    } catch (TimeoutException | InterruptedException e) {
+      throw new SolrException(ErrorCode.SERVER_ERROR, "Timeout waiting for collection state", e);
 
 Review comment:
   `Thread.currentThread().interrupt()`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r385439085
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/api/collections/OverseerCollectionMessageHandler.java
 ##########
 @@ -630,34 +596,32 @@ private void modifyCollection(ClusterState clusterState, ZkNodeProps message, Na
 
     overseer.offerStateUpdate(Utils.toJSON(message));
 
-    TimeOut timeout = new TimeOut(30, TimeUnit.SECONDS, timeSource);
-    boolean areChangesVisible = true;
-    while (!timeout.hasTimedOut()) {
-      DocCollection collection = cloudManager.getClusterStateProvider().getClusterState().getCollection(collectionName);
-      areChangesVisible = true;
-      for (Map.Entry<String,Object> updateEntry : message.getProperties().entrySet()) {
-        String updateKey = updateEntry.getKey();
-
-        if (!updateKey.equals(ZkStateReader.COLLECTION_PROP)
-            && !updateKey.equals(Overseer.QUEUE_OPERATION)
-            && updateEntry.getValue() != null // handled below in a separate conditional
-            && !updateEntry.getValue().equals(collection.get(updateKey))) {
-          areChangesVisible = false;
-          break;
+    try {
+      zkStateReader.waitForState(collectionName, 30, TimeUnit.SECONDS, c -> {
+        if (c == null) {
+          return false;
         }
+        for (Map.Entry<String,Object> updateEntry : message.getProperties().entrySet()) {
+          String updateKey = updateEntry.getKey();
+
+          if (!updateKey.equals(ZkStateReader.COLLECTION_PROP)
+                  && !updateKey.equals(Overseer.QUEUE_OPERATION)
+                  && updateEntry.getValue() != null // handled below in a separate conditional
+                  && !updateEntry.getValue().equals(c.get(updateKey))) {
+            return false;
+          }
 
-        if (updateEntry.getValue() == null && collection.containsKey(updateKey)) {
-          areChangesVisible = false;
-          break;
+          if (updateEntry.getValue() == null && c.containsKey(updateKey)) {
+            return false;
+          }
         }
-      }
-      if (areChangesVisible) break;
-      timeout.sleep(100);
+        return true;
+      });
+    } catch (TimeoutException | InterruptedException e) {
+      log.debug("modifyCollection(ClusterState=" + clusterState + ", ZkNodeProps=" + message + ", NamedList=" + results + ")", e);
 
 Review comment:
   Can we use parametrized logging here?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r385822483
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/ZkController.java
 ##########
 @@ -1684,58 +1685,37 @@ private void doGetShardIdAndNodeNameProcess(CoreDescriptor cd) {
   }
 
   private void waitForCoreNodeName(CoreDescriptor descriptor) {
-    int retryCount = 320;
-    log.debug("look for our core node name");
-    while (retryCount-- > 0) {
-      final DocCollection docCollection = zkStateReader.getClusterState()
-          .getCollectionOrNull(descriptor.getCloudDescriptor().getCollectionName());
-      if (docCollection != null && docCollection.getSlicesMap() != null) {
-        final Map<String, Slice> slicesMap = docCollection.getSlicesMap();
-        for (Slice slice : slicesMap.values()) {
-          for (Replica replica : slice.getReplicas()) {
-            // TODO: for really large clusters, we could 'index' on this
-
-            String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-            String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-            String msgNodeName = getNodeName();
-            String msgCore = descriptor.getName();
-
-            if (msgNodeName.equals(nodeName) && core.equals(msgCore)) {
-              descriptor.getCloudDescriptor()
-                  .setCoreNodeName(replica.getName());
-              getCoreContainer().getCoresLocator().persist(getCoreContainer(), descriptor);
-              return;
-            }
-          }
+    log.debug("waitForCoreNodeName >>> look for our core node name");
+    try {
+      zkStateReader.waitForState(descriptor.getCollectionName(), 320, TimeUnit.SECONDS, c -> {
+        String name = ClusterStateMutator.getAssignedCoreNodeName(c, getNodeName(), descriptor.getName());
+        if (name == null) {
+          return false;
         }
-      }
-      try {
-        Thread.sleep(1000);
-      } catch (InterruptedException e) {
-        Thread.currentThread().interrupt();
-      }
+        descriptor.getCloudDescriptor().setCoreNodeName(name);
 
 Review comment:
   OK. I was thinking more on data visibility than race conditions

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] beettlle commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

beettlle commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r389132693
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/ZkController.java
 ##########
 @@ -1684,58 +1685,39 @@ private void doGetShardIdAndNodeNameProcess(CoreDescriptor cd) {
   }
 
   private void waitForCoreNodeName(CoreDescriptor descriptor) {
-    int retryCount = 320;
-    log.debug("look for our core node name");
-    while (retryCount-- > 0) {
-      final DocCollection docCollection = zkStateReader.getClusterState()
-          .getCollectionOrNull(descriptor.getCloudDescriptor().getCollectionName());
-      if (docCollection != null && docCollection.getSlicesMap() != null) {
-        final Map<String, Slice> slicesMap = docCollection.getSlicesMap();
-        for (Slice slice : slicesMap.values()) {
-          for (Replica replica : slice.getReplicas()) {
-            // TODO: for really large clusters, we could 'index' on this
-
-            String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-            String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-            String msgNodeName = getNodeName();
-            String msgCore = descriptor.getName();
-
-            if (msgNodeName.equals(nodeName) && core.equals(msgCore)) {
-              descriptor.getCloudDescriptor()
-                  .setCoreNodeName(replica.getName());
-              getCoreContainer().getCoresLocator().persist(getCoreContainer(), descriptor);
-              return;
-            }
-          }
+    log.debug("waitForCoreNodeName >>> look for our core node name");
+    try {
+      zkStateReader.waitForState(descriptor.getCollectionName(), 320, TimeUnit.SECONDS, c -> {
 
 Review comment:
   If this change is being made, should the number of retries be configurable?  This hardcoded value seems to be used a lot in the code.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] beettlle commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

beettlle commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r389208349
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/ZkController.java
 ##########
 @@ -1684,58 +1685,39 @@ private void doGetShardIdAndNodeNameProcess(CoreDescriptor cd) {
   }
 
   private void waitForCoreNodeName(CoreDescriptor descriptor) {
-    int retryCount = 320;
-    log.debug("look for our core node name");
-    while (retryCount-- > 0) {
-      final DocCollection docCollection = zkStateReader.getClusterState()
-          .getCollectionOrNull(descriptor.getCloudDescriptor().getCollectionName());
-      if (docCollection != null && docCollection.getSlicesMap() != null) {
-        final Map<String, Slice> slicesMap = docCollection.getSlicesMap();
-        for (Slice slice : slicesMap.values()) {
-          for (Replica replica : slice.getReplicas()) {
-            // TODO: for really large clusters, we could 'index' on this
-
-            String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-            String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-            String msgNodeName = getNodeName();
-            String msgCore = descriptor.getName();
-
-            if (msgNodeName.equals(nodeName) && core.equals(msgCore)) {
-              descriptor.getCloudDescriptor()
-                  .setCoreNodeName(replica.getName());
-              getCoreContainer().getCoresLocator().persist(getCoreContainer(), descriptor);
-              return;
-            }
-          }
+    log.debug("waitForCoreNodeName >>> look for our core node name");
+    try {
+      zkStateReader.waitForState(descriptor.getCollectionName(), 320, TimeUnit.SECONDS, c -> {
 
 Review comment:
   Agreed bout having too many settings, we're already drowning in them.  
   
   Looking back looks like the number was added as part of SOLR-9140 and there's no comment of where the "320" came from.  As well, there's another retry number [here](https://github.com/apache/lucene-solr/pull/1297/files#diff-d5e1be02f6f0c397e18380598aa62b3dR476) of "30" but no idea why.  So we already have 2 different numbers of retries.
   
   If the numbers come from empirical experiments then I agree with them being constants but because they seem arbitrary seems like good candidates of per-application tuning.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r385436558
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/api/collections/OverseerCollectionMessageHandler.java
 ##########
 @@ -471,29 +471,21 @@ void checkResults(String label, NamedList<Object> results, boolean failureIsFata
   private void migrateStateFormat(ClusterState state, ZkNodeProps message, NamedList results) throws Exception {
     final String collectionName = message.getStr(COLLECTION_PROP);
 
-    boolean firstLoop = true;
-    // wait for a while until the state format changes
-    TimeOut timeout = new TimeOut(30, TimeUnit.SECONDS, timeSource);
-    while (! timeout.hasTimedOut()) {
-      DocCollection collection = zkStateReader.getClusterState().getCollection(collectionName);
-      if (collection == null) {
-        throw new SolrException(ErrorCode.BAD_REQUEST, "Collection: " + collectionName + " not found");
-      }
-      if (collection.getStateFormat() == 2) {
-        // Done.
-        results.add("success", new SimpleOrderedMap<>());
-        return;
-      }
+    ZkNodeProps m = new ZkNodeProps(Overseer.QUEUE_OPERATION, MIGRATESTATEFORMAT.toLower(), COLLECTION_PROP, collectionName);
+    overseer.offerStateUpdate(Utils.toJSON(m));
 
-      if (firstLoop) {
-        // Actually queue the migration command.
-        firstLoop = false;
-        ZkNodeProps m = new ZkNodeProps(Overseer.QUEUE_OPERATION, MIGRATESTATEFORMAT.toLower(), COLLECTION_PROP, collectionName);
-        overseer.offerStateUpdate(Utils.toJSON(m));
-      }
-      timeout.sleep(100);
+    try {
+      zkStateReader.waitForState(collectionName, 30, TimeUnit.SECONDS, c -> {
 
 Review comment:
   same comment/questions as with the other methods

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r385434650
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/ZkController.java
 ##########
 @@ -1684,58 +1685,37 @@ private void doGetShardIdAndNodeNameProcess(CoreDescriptor cd) {
   }
 
   private void waitForCoreNodeName(CoreDescriptor descriptor) {
-    int retryCount = 320;
-    log.debug("look for our core node name");
-    while (retryCount-- > 0) {
-      final DocCollection docCollection = zkStateReader.getClusterState()
-          .getCollectionOrNull(descriptor.getCloudDescriptor().getCollectionName());
-      if (docCollection != null && docCollection.getSlicesMap() != null) {
-        final Map<String, Slice> slicesMap = docCollection.getSlicesMap();
-        for (Slice slice : slicesMap.values()) {
-          for (Replica replica : slice.getReplicas()) {
-            // TODO: for really large clusters, we could 'index' on this
-
-            String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-            String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-            String msgNodeName = getNodeName();
-            String msgCore = descriptor.getName();
-
-            if (msgNodeName.equals(nodeName) && core.equals(msgCore)) {
-              descriptor.getCloudDescriptor()
-                  .setCoreNodeName(replica.getName());
-              getCoreContainer().getCoresLocator().persist(getCoreContainer(), descriptor);
-              return;
-            }
-          }
+    log.debug("waitForCoreNodeName >>> look for our core node name");
+    try {
+      zkStateReader.waitForState(descriptor.getCollectionName(), 320, TimeUnit.SECONDS, c -> {
+        String name = ClusterStateMutator.getAssignedCoreNodeName(c, getNodeName(), descriptor.getName());
+        if (name == null) {
+          return false;
         }
-      }
-      try {
-        Thread.sleep(1000);
-      } catch (InterruptedException e) {
-        Thread.currentThread().interrupt();
-      }
+        descriptor.getCloudDescriptor().setCoreNodeName(name);
+        return true;
+      });
+    } catch (TimeoutException | InterruptedException e) {
+      throw new SolrException(ErrorCode.SERVER_ERROR, "Timeout waiting for collection state", e);
 
 Review comment:
   We still want to reset interruption, right?
   Also, am I reading right? before we'd just continue running normally even after a "timeout", while now we throw an exception. Sounds like a good change, but that's intended, right?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r385435944
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/ZkController.java
 ##########
 @@ -1684,58 +1685,37 @@ private void doGetShardIdAndNodeNameProcess(CoreDescriptor cd) {
   }
 
   private void waitForCoreNodeName(CoreDescriptor descriptor) {
-    int retryCount = 320;
-    log.debug("look for our core node name");
-    while (retryCount-- > 0) {
-      final DocCollection docCollection = zkStateReader.getClusterState()
-          .getCollectionOrNull(descriptor.getCloudDescriptor().getCollectionName());
-      if (docCollection != null && docCollection.getSlicesMap() != null) {
-        final Map<String, Slice> slicesMap = docCollection.getSlicesMap();
-        for (Slice slice : slicesMap.values()) {
-          for (Replica replica : slice.getReplicas()) {
-            // TODO: for really large clusters, we could 'index' on this
-
-            String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-            String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-            String msgNodeName = getNodeName();
-            String msgCore = descriptor.getName();
-
-            if (msgNodeName.equals(nodeName) && core.equals(msgCore)) {
-              descriptor.getCloudDescriptor()
-                  .setCoreNodeName(replica.getName());
-              getCoreContainer().getCoresLocator().persist(getCoreContainer(), descriptor);
-              return;
-            }
-          }
+    log.debug("waitForCoreNodeName >>> look for our core node name");
+    try {
+      zkStateReader.waitForState(descriptor.getCollectionName(), 320, TimeUnit.SECONDS, c -> {
+        String name = ClusterStateMutator.getAssignedCoreNodeName(c, getNodeName(), descriptor.getName());
+        if (name == null) {
+          return false;
         }
-      }
-      try {
-        Thread.sleep(1000);
-      } catch (InterruptedException e) {
-        Thread.currentThread().interrupt();
-      }
+        descriptor.getCloudDescriptor().setCoreNodeName(name);
+        return true;
+      });
+    } catch (TimeoutException | InterruptedException e) {
+      throw new SolrException(ErrorCode.SERVER_ERROR, "Timeout waiting for collection state", e);
     }
+    getCoreContainer().getCoresLocator().persist(getCoreContainer(), descriptor);
   }
 
-  private void waitForShardId(CoreDescriptor cd) {
+  private void waitForShardId(final CoreDescriptor cd) {
     log.debug("waiting to find shard id in clusterstate for " + cd.getName());
-    int retryCount = 320;
-    while (retryCount-- > 0) {
-      final String shardId = zkStateReader.getClusterState().getShardId(cd.getCollectionName(), getNodeName(), cd.getName());
-      if (shardId != null) {
-        cd.getCloudDescriptor().setShardId(shardId);
-        return;
-      }
-      try {
-        Thread.sleep(1000);
-      } catch (InterruptedException e) {
-        Thread.currentThread().interrupt();
-      }
+    try {
+      zkStateReader.waitForState(cd.getCollectionName(), 320, TimeUnit.SECONDS, c -> {
+        if (c == null) return false;
+        final String shardId = c.getShardId(getNodeName(), cd.getName());
+        if (shardId != null) {
+          cd.getCloudDescriptor().setShardId(shardId);
 
 Review comment:
   Do we need any synchronization?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r385438713
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/api/collections/OverseerCollectionMessageHandler.java
 ##########
 @@ -630,34 +596,32 @@ private void modifyCollection(ClusterState clusterState, ZkNodeProps message, Na
 
     overseer.offerStateUpdate(Utils.toJSON(message));
 
-    TimeOut timeout = new TimeOut(30, TimeUnit.SECONDS, timeSource);
-    boolean areChangesVisible = true;
-    while (!timeout.hasTimedOut()) {
-      DocCollection collection = cloudManager.getClusterStateProvider().getClusterState().getCollection(collectionName);
-      areChangesVisible = true;
-      for (Map.Entry<String,Object> updateEntry : message.getProperties().entrySet()) {
-        String updateKey = updateEntry.getKey();
-
-        if (!updateKey.equals(ZkStateReader.COLLECTION_PROP)
-            && !updateKey.equals(Overseer.QUEUE_OPERATION)
-            && updateEntry.getValue() != null // handled below in a separate conditional
-            && !updateEntry.getValue().equals(collection.get(updateKey))) {
-          areChangesVisible = false;
-          break;
+    try {
+      zkStateReader.waitForState(collectionName, 30, TimeUnit.SECONDS, c -> {
+        if (c == null) {
 
 Review comment:
   Can this happen?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits

Posted by GitBox <gi...@apache.org>.

tflobbe commented on a change in pull request #1297: SOLR-14253 Replace various sleep calls with ZK waits
URL: https://github.com/apache/lucene-solr/pull/1297#discussion_r385436094
 
 

 ##########
 File path: solr/core/src/java/org/apache/solr/cloud/ZkController.java
 ##########
 @@ -1684,58 +1685,37 @@ private void doGetShardIdAndNodeNameProcess(CoreDescriptor cd) {
   }
 
   private void waitForCoreNodeName(CoreDescriptor descriptor) {
-    int retryCount = 320;
-    log.debug("look for our core node name");
-    while (retryCount-- > 0) {
-      final DocCollection docCollection = zkStateReader.getClusterState()
-          .getCollectionOrNull(descriptor.getCloudDescriptor().getCollectionName());
-      if (docCollection != null && docCollection.getSlicesMap() != null) {
-        final Map<String, Slice> slicesMap = docCollection.getSlicesMap();
-        for (Slice slice : slicesMap.values()) {
-          for (Replica replica : slice.getReplicas()) {
-            // TODO: for really large clusters, we could 'index' on this
-
-            String nodeName = replica.getStr(ZkStateReader.NODE_NAME_PROP);
-            String core = replica.getStr(ZkStateReader.CORE_NAME_PROP);
-
-            String msgNodeName = getNodeName();
-            String msgCore = descriptor.getName();
-
-            if (msgNodeName.equals(nodeName) && core.equals(msgCore)) {
-              descriptor.getCloudDescriptor()
-                  .setCoreNodeName(replica.getName());
-              getCoreContainer().getCoresLocator().persist(getCoreContainer(), descriptor);
-              return;
-            }
-          }
+    log.debug("waitForCoreNodeName >>> look for our core node name");
+    try {
+      zkStateReader.waitForState(descriptor.getCollectionName(), 320, TimeUnit.SECONDS, c -> {
+        String name = ClusterStateMutator.getAssignedCoreNodeName(c, getNodeName(), descriptor.getName());
+        if (name == null) {
+          return false;
         }
-      }
-      try {
-        Thread.sleep(1000);
-      } catch (InterruptedException e) {
-        Thread.currentThread().interrupt();
-      }
+        descriptor.getCloudDescriptor().setCoreNodeName(name);
+        return true;
+      });
+    } catch (TimeoutException | InterruptedException e) {
+      throw new SolrException(ErrorCode.SERVER_ERROR, "Timeout waiting for collection state", e);
     }
+    getCoreContainer().getCoresLocator().persist(getCoreContainer(), descriptor);
   }
 
-  private void waitForShardId(CoreDescriptor cd) {
+  private void waitForShardId(final CoreDescriptor cd) {
     log.debug("waiting to find shard id in clusterstate for " + cd.getName());
-    int retryCount = 320;
-    while (retryCount-- > 0) {
-      final String shardId = zkStateReader.getClusterState().getShardId(cd.getCollectionName(), getNodeName(), cd.getName());
-      if (shardId != null) {
-        cd.getCloudDescriptor().setShardId(shardId);
-        return;
-      }
-      try {
-        Thread.sleep(1000);
-      } catch (InterruptedException e) {
-        Thread.currentThread().interrupt();
-      }
+    try {
+      zkStateReader.waitForState(cd.getCollectionName(), 320, TimeUnit.SECONDS, c -> {
+        if (c == null) return false;
+        final String shardId = c.getShardId(getNodeName(), cd.getName());
+        if (shardId != null) {
+          cd.getCloudDescriptor().setShardId(shardId);
+          return true;
+        }
+        return false;
+      });
+    } catch (TimeoutException | InterruptedException e) {
+      throw new SolrException(ErrorCode.SERVER_ERROR, "Could not get shard id for core: " + cd.getName());
 
 Review comment:
   Same as before, we should probably re set the interruption. Also, did you intentionally not wrap the exception?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org