You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/02/08 00:51:19 UTC

[GitHub] [lucene-solr] noblepaul opened a new pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

noblepaul opened a new pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318


   WIP


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] murblanc commented on a change in pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

murblanc commented on a change in pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#discussion_r573680423



##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -264,8 +299,8 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
         log.info("Cleaned up artifacts for failed create collection for [{}]", collectionName);
         throw new SolrException(ErrorCode.BAD_REQUEST, "Underlying core creation failed while creating collection: " + collectionName);
       } else {
-        log.debug("Finished create command on all shards for collection: {}", collectionName);
 
+        log.debug("Finished create command on all shards for collection: {}", collectionName);

Review comment:
       When we get here for a PRS collection, the `ZkStateReader` cluster state for the collection is fine, as are the ZK structures, but the Overseer cluster state updater does not know about the collection. Any new operation on the collection through the Collection API would fail in the cluster state updater.

##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -149,27 +143,38 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
       }
 
       createCollectionZkNode(stateManager, collectionName, collectionParams);
-      
-      ocmh.overseer.offerStateUpdate(Utils.toJSON(message));
-
-      // wait for a while until we see the collection
-      TimeOut waitUntil = new TimeOut(30, TimeUnit.SECONDS, timeSource);
-      boolean created = false;
-      while (! waitUntil.hasTimedOut()) {
-        waitUntil.sleep(100);
-        created = ocmh.cloudManager.getClusterStateProvider().getClusterState().hasCollection(collectionName);
-        if(created) break;
-      }
-      if (!created) {
-        throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Could not fully create collection: " + collectionName);
+
+      if(isPrs) {
+        ZkWriteCommand command = new ClusterStateMutator(ocmh.cloudManager).createCollection(clusterState, message);
+        byte[] data = Utils.toJSON(Collections.singletonMap(collectionName, command.collection));
+        ocmh.zkStateReader.getZkClient().create(collectionPath, data, CreateMode.PERSISTENT, true);
+        clusterState = clusterState.copyWith(collectionName, command.collection);
+        newColl = command.collection;
+      } else {
+        ocmh.overseer.offerStateUpdate(Utils.toJSON(message));
+
+        // wait for a while until we see the collection
+        TimeOut waitUntil = new TimeOut(30, TimeUnit.SECONDS, timeSource);
+        boolean created = false;
+        while (!waitUntil.hasTimedOut()) {
+          waitUntil.sleep(100);
+          created = ocmh.cloudManager.getClusterStateProvider().getClusterState().hasCollection(collectionName);
+          if (created) break;
+        }
+        if (!created) {
+          throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Could not fully create collection: " + collectionName);
+        }
+
+        // refresh cluster state
+        clusterState = ocmh.cloudManager.getClusterStateProvider().getClusterState();

Review comment:
       This call does nothing and can be removed.
   The command already has the instance of `ClusterState` returned  by the call (it is passed to it as a `call` parameter by OCMH).
   The only way the clusterState is refreshed is when the watchers on the node get called and fetch the latest updates from ZK (that in this case were done by the Overseer, and we know the watchers fired and updated our local state because we've waited for it in the `while` loop ~10 lines above).

##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -149,27 +143,38 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
       }
 
       createCollectionZkNode(stateManager, collectionName, collectionParams);
-      
-      ocmh.overseer.offerStateUpdate(Utils.toJSON(message));
-
-      // wait for a while until we see the collection
-      TimeOut waitUntil = new TimeOut(30, TimeUnit.SECONDS, timeSource);
-      boolean created = false;
-      while (! waitUntil.hasTimedOut()) {
-        waitUntil.sleep(100);
-        created = ocmh.cloudManager.getClusterStateProvider().getClusterState().hasCollection(collectionName);
-        if(created) break;
-      }
-      if (!created) {
-        throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Could not fully create collection: " + collectionName);
+
+      if(isPrs) {
+        ZkWriteCommand command = new ClusterStateMutator(ocmh.cloudManager).createCollection(clusterState, message);

Review comment:
       Calling this method here (or more specifically using its output in the `create()` call below) breaks the cluster state updater contract in Overseer.
   This call updates a `state.json` from a thread other than the cluster state updater thread in Overseer, the cluster state updater will not know about the new Json being written (the fact that this thread running this method is executing on the Overseer node is irrelevant, except passing stats, cluster state updater and Collection API execution are totally independent and could be running on different nodes).
   
   Overseer has two copies of the cluster state: the cluster state maintained in `ZkStateReader` like any node has (this is the state passed to Collection API calls), and another copy of the state maintained in `ZkStateWriter` which initially is obtained from `ZkStateReader` but quickly diverges and then used as the "source of truth" by the cluster state updater that keeps writing back that state to Zookeeper for the whole life of the Overseer on the node. The cluster updater state has no watches is only updated by the cluster state updater when processing messages and then the updated value written to ZK.
   
   To achieve the effect of dealing with the Per Replica States directly from this thread rather than from the ClusterStateUpdater single thread, the state.json would have to be created by the Overseer (i.e. through a classic `ocmh.overseer.offerStateUpdate(Utils.toJSON(message));`) and once done the PRS could be created from here. This is complicated to do because of the way `ZkStateWriter` is implemented to write the `state.json` as well as handle the PRS operations in the same method (when it calls `cmd.ops.persist` in `writePendingUpdates()`) and would require quite some refactoring.

##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -213,7 +218,16 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
             ZkStateReader.NODE_NAME_PROP, nodeName,
             ZkStateReader.REPLICA_TYPE, replicaPosition.type.name(),
             CommonAdminParams.WAIT_FOR_FINAL_STATE, Boolean.toString(waitForFinalState));
-        ocmh.overseer.offerStateUpdate(Utils.toJSON(props));
+        if(isPrs) {
+          ZkWriteCommand command = new SliceMutator(ocmh.cloudManager).addReplica(clusterState, props);
+          byte[] data = Utils.toJSON(Collections.singletonMap(collectionName, command.collection));
+//        log.info("collection updated : {}", new String(data, StandardCharsets.UTF_8));
+          ocmh.zkStateReader.getZkClient().setData(collectionPath, data, true);

Review comment:
       This call bypasses the cluster state updater to update the `state.json`. If it tried to do so through the Overseer cluster state updater, the update would fail as the collection is unknown there.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] chatman commented on pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

chatman commented on pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#issuecomment-776854325


   > I have very serious doubts about the proposed change.
   > Forcing a cluster state updater refresh in that way is going to force it to re-read all relevant ZK data through forciblyRefreshAllClusterStateSlow() as it does during Overseer start up, stalling processing of the state update queue in the meantime. And this will happen each time a new PRS collection is created.
   
   Agree, this is incredibly inefficient. I'm reverting the commit for now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] madrob commented on a change in pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

madrob commented on a change in pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#discussion_r573116634



##########
File path: solr/solrj/src/java/org/apache/solr/common/cloud/PerReplicaStates.java
##########
@@ -92,6 +94,17 @@ public PerReplicaStates(String path, int cversion, List<String> states) {
 
   }
 
+  /** Check and return if all replicas are ACTIVE
+   */
+  public boolean allActive() {
+    if (this.allActive != null) return allActive;
+    boolean[] result = new boolean[]{true};

Review comment:
       Agree.

##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -264,8 +264,11 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
         log.info("Cleaned up artifacts for failed create collection for [{}]", collectionName);
         throw new SolrException(ErrorCode.BAD_REQUEST, "Underlying core creation failed while creating collection: " + collectionName);
       } else {
+        //we want to wait till all the replicas are ACTIVE for PRS collections because
+          ocmh.zkStateReader.waitForState(collectionName, 30, TimeUnit.SECONDS, (liveNodes, c) ->
+                  c.getPerReplicaStates() == null || // this is not a PRS collection

Review comment:
       I agree with Ilan here, let's skip the extra watcher and call to ZK.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] noblepaul commented on a change in pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

noblepaul commented on a change in pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#discussion_r574240410



##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -213,7 +218,16 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
             ZkStateReader.NODE_NAME_PROP, nodeName,
             ZkStateReader.REPLICA_TYPE, replicaPosition.type.name(),
             CommonAdminParams.WAIT_FOR_FINAL_STATE, Boolean.toString(waitForFinalState));
-        ocmh.overseer.offerStateUpdate(Utils.toJSON(props));
+        if(isPrs) {
+          ZkWriteCommand command = new SliceMutator(ocmh.cloudManager).addReplica(clusterState, props);
+          byte[] data = Utils.toJSON(Collections.singletonMap(collectionName, command.collection));
+//        log.info("collection updated : {}", new String(data, StandardCharsets.UTF_8));
+          ocmh.zkStateReader.getZkClient().setData(collectionPath, data, true);

Review comment:
       yes 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] murblanc commented on pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

murblanc commented on pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#issuecomment-775852524


   > Error is a timeout from `CollectionsHandler` having waited 45 seconds 
   
   I take that back @noblepaul . I did the test wrong and just did it again and it passes. Timing for the collection creation (11x11=121 replicas on 3 nodes) is similar with or without PRS at about 45 seconds.
   
   I can do more testing later (more concurrent threads, more and smaller collections). Note I did put out a few numbers on PRS (not with the patch in this PR though), see [this comment](https://issues.apache.org/jira/browse/SOLR-15146?focusedCommentId=17281460&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17281460).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] chatman commented on a change in pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

chatman commented on a change in pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#discussion_r575312927



##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -256,6 +280,23 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
       shardRequestTracker.processResponses(results, shardHandler, false, null, Collections.emptySet());
       @SuppressWarnings({"rawtypes"})
       boolean failure = results.get("failure") != null && ((SimpleOrderedMap)results.get("failure")).size() > 0;
+      if(isPrs) {
+        TimeOut timeout = new TimeOut(Integer.getInteger("solr.waitToSeeReplicasInStateTimeoutSeconds", 120), TimeUnit.SECONDS, timeSource); // could be a big cluster
+        PerReplicaStates prs = PerReplicaStates.fetch(collectionPath, ocmh.zkStateReader.getZkClient(), null);
+        while (!timeout.hasTimedOut()) {
+          if(prs.allActive()) break;
+          Thread.sleep(100);
+          prs = PerReplicaStates.fetch(collectionPath, ocmh.zkStateReader.getZkClient(), null);
+        }
+        if (prs.allActive()) {
+          // we have successfully found all replicas to be ACTIVE
+          // Now ask Overseer to fetch the latest state of collection
+          // from ZK
+          ocmh.overseer.submit(new RefreshCollectionMessage(collectionName));
+        } else {
+          failure = true;
+        }
+      }
       if (failure) {
         // Let's cleanup as we hit an exception
         // We shouldn't be passing 'results' here for the cleanup as the response would then contain 'success'

Review comment:
       ^ My above comment was based on the 8x change, and seems like this change got missed when porting them over to this PR (for master). I'll update this branch and bring it up to sync with 8x soon.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] murblanc commented on pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

murblanc commented on pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#issuecomment-776840944

> Thanks for your review, @murblanc. I learned a lot from your detailed review comment.

Thank you!

> I've added a commit to let the overseer thread refresh the view of the cluster state immediately after the collection is created. Please review. CC @noblepaul.

I have very serious doubts about the proposed change.

Forcing a cluster state updater refresh in that way is going to force it to re-read all relevant ZK data through `forciblyRefreshAllClusterStateSlow()` as it does during Overseer start up, stalling processing of the state update queue in the meantime. And this will happen each time a new PRS collection is created.

And it is not even sufficient... If you look at `ClusterStateUpdater.run()`, that flag was really meant to be used once at the beginning. There are two processing loops, and the inner loop happens after the refresh and will not look at the flag, so some processing can take place before the flag is even considered.

So it may well be that the PRS collection is created, the flag is set on the Overseer, the creation call returns, client sends another Collection API related to the collection but `ClusterStateUpdater` hasn't updated yet from ZK (hasn't even started to update). This type of race would in theory be possible even without the inner loop given the batching done by the cluster state updater (+ the waiting time to see if new messages arrive in order to batch them together).

I really feel that a proper refactoring of what gets executed and where it gets executed is a better option (esp. for the master branch). Why not submit the `state.json` through the Overseer as happens for non PRS collections then create the PRS znodes in the Collection API call to not stall the cluster state updater for too long?

If you really have to force an Overseer `ClusterStateUpdater` refresh for that collection anyway, introducing a new type of cluster state updater message (in `OverseerAction`) called `LOADCOLLECTION` for example that triggers the loading of that single collection from ZK into the state used by the cluster state updater seems to me a better option. I don't like it either, but I hate it less. Note the client will see correct behavior only if it then submits the collection creation call and wait for it to return before submitting another Collection API call for that collection, to make sure that the `LOADCOLLECTION` gets executed first (with current Solr client can submit collection creation immediately followed by another Collection API call for that collection and they will be executed in order).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] madrob commented on a change in pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

madrob commented on a change in pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#discussion_r573957903



##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -213,7 +218,16 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
             ZkStateReader.NODE_NAME_PROP, nodeName,
             ZkStateReader.REPLICA_TYPE, replicaPosition.type.name(),
             CommonAdminParams.WAIT_FOR_FINAL_STATE, Boolean.toString(waitForFinalState));
-        ocmh.overseer.offerStateUpdate(Utils.toJSON(props));
+        if(isPrs) {
+          ZkWriteCommand command = new SliceMutator(ocmh.cloudManager).addReplica(clusterState, props);
+          byte[] data = Utils.toJSON(Collections.singletonMap(collectionName, command.collection));
+//        log.info("collection updated : {}", new String(data, StandardCharsets.UTF_8));
+          ocmh.zkStateReader.getZkClient().setData(collectionPath, data, true);

Review comment:
       We're doing this for each replica individually, which turns into a lot more writes. Previously the overseer would batch up the updates, I believe, and minimize the number of updates to state.json. Can we do a ZK multi-operation here instead of lots of small writes in a tight loop?

##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -41,19 +42,11 @@
 import org.apache.solr.cloud.ZkController;
 import org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.ShardRequestTracker;
 import org.apache.solr.cloud.overseer.ClusterStateMutator;
+import org.apache.solr.cloud.overseer.SliceMutator;
+import org.apache.solr.cloud.overseer.ZkWriteCommand;
 import org.apache.solr.common.SolrException;
 import org.apache.solr.common.SolrException.ErrorCode;
-import org.apache.solr.common.cloud.Aliases;
-import org.apache.solr.common.cloud.ClusterState;
-import org.apache.solr.common.cloud.DocCollection;
-import org.apache.solr.common.cloud.DocRouter;
-import org.apache.solr.common.cloud.ImplicitDocRouter;
-import org.apache.solr.common.cloud.Replica;
-import org.apache.solr.common.cloud.ReplicaPosition;
-import org.apache.solr.common.cloud.ZkConfigManager;
-import org.apache.solr.common.cloud.ZkNodeProps;
-import org.apache.solr.common.cloud.ZkStateReader;
-import org.apache.solr.common.cloud.ZooKeeperException;
+import org.apache.solr.common.cloud.*;

Review comment:
       please please please configure your IDE to not do this. it's been an issue on other PRs as well.

##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -213,7 +218,16 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
             ZkStateReader.NODE_NAME_PROP, nodeName,
             ZkStateReader.REPLICA_TYPE, replicaPosition.type.name(),
             CommonAdminParams.WAIT_FOR_FINAL_STATE, Boolean.toString(waitForFinalState));
-        ocmh.overseer.offerStateUpdate(Utils.toJSON(props));
+        if(isPrs) {

Review comment:
       This section looks like almost a duplicate of the other `if(isPrs)` block earlier. No concrete reason, but this doesn't seem right.

##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -256,6 +279,18 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
       shardRequestTracker.processResponses(results, shardHandler, false, null, Collections.emptySet());
       @SuppressWarnings({"rawtypes"})
       boolean failure = results.get("failure") != null && ((SimpleOrderedMap)results.get("failure")).size() > 0;
+      if(isPrs) {
+        TimeOut timeout = new TimeOut(Integer.getInteger("solr.waitToSeeReplicasInStateTimeoutSeconds", 120), TimeUnit.SECONDS, timeSource); // could be a big cluster
+        PerReplicaStates prs = PerReplicaStates.fetch(collectionPath, ocmh.zkStateReader.getZkClient(), null);
+        while (!timeout.hasTimedOut()) {
+          if(prs.allActive()) break;
+          Thread.sleep(100);

Review comment:
       Can we set a watch (maybe on an arbitrary down replica?) instead of doing a loop+sleep?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] noblepaul closed pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

noblepaul closed pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] chatman commented on pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

chatman commented on pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#issuecomment-776793606


   Thanks for your review, @murblanc. I learned a lot from your detailed review comment.
   
   I've added a commit to let the overseer thread refresh the view of the cluster state immediately after the collection is created. Please review. CC @noblepaul.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] chatman commented on a change in pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

chatman commented on a change in pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#discussion_r575485111



##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -256,6 +280,23 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
       shardRequestTracker.processResponses(results, shardHandler, false, null, Collections.emptySet());
       @SuppressWarnings({"rawtypes"})
       boolean failure = results.get("failure") != null && ((SimpleOrderedMap)results.get("failure")).size() > 0;
+      if(isPrs) {
+        TimeOut timeout = new TimeOut(Integer.getInteger("solr.waitToSeeReplicasInStateTimeoutSeconds", 120), TimeUnit.SECONDS, timeSource); // could be a big cluster
+        PerReplicaStates prs = PerReplicaStates.fetch(collectionPath, ocmh.zkStateReader.getZkClient(), null);
+        while (!timeout.hasTimedOut()) {
+          if(prs.allActive()) break;
+          Thread.sleep(100);
+          prs = PerReplicaStates.fetch(collectionPath, ocmh.zkStateReader.getZkClient(), null);
+        }
+        if (prs.allActive()) {
+          // we have successfully found all replicas to be ACTIVE
+          // Now ask Overseer to fetch the latest state of collection
+          // from ZK
+          ocmh.overseer.submit(new RefreshCollectionMessage(collectionName));
+        } else {
+          failure = true;
+        }
+      }
       if (failure) {
         // Let's cleanup as we hit an exception
         // We shouldn't be passing 'results' here for the cleanup as the response would then contain 'success'

Review comment:
       This is resolved now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] chatman commented on pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

chatman commented on pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#issuecomment-778134747


   > Once the operational bugs are fixed, in my opinion we should consider this code as stop gap, and eventually redo things in a cleaner and more efficient way at some later time (ideally before Solr 9 is out). Maintaining this code as it is is not going to be fun.
   
   I agree. I think we should phase out non-PRS mode altogether soon, and then all of this won't be a burden to maintain.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] murblanc commented on a change in pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

murblanc commented on a change in pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#discussion_r574498232



##########
File path: solr/core/src/java/org/apache/solr/cloud/RefreshCollectionMessage.java
##########
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.cloud;
+
+import org.apache.solr.common.cloud.ClusterState;
+import org.apache.solr.common.cloud.DocCollection;
+import org.apache.solr.common.cloud.ZkStateReader;
+import org.apache.zookeeper.KeeperException;
+import org.apache.zookeeper.data.Stat;
+
+/**Refresh the Cluster State for a given collection
+ *
+ */
+public class RefreshCollectionMessage implements Overseer.Message {
+    public final Operation operation;
+    public final String collection;
+
+    public RefreshCollectionMessage(String collection) {
+        this.operation = Operation.REFRESH_COLL;
+        this.collection = collection;
+    }
+
+    ClusterState run(ClusterState clusterState, Overseer overseer) throws InterruptedException, KeeperException {

Review comment:
       Can you please another name than `run()` that is usually related to a `Runnable` which is not the case here.

##########
File path: solr/core/src/java/org/apache/solr/cloud/Overseer.java
##########
@@ -1064,4 +1073,19 @@ public void offerStateUpdate(byte[] data) throws KeeperException, InterruptedExc
     getStateUpdateQueue().offer(data);
   }
 
+  /**Submit an intra-process message
+   * This will be picked up and executed when clusterstate updater thread runs
+   */
+  public void submit(Message message) {
+    unprocessedMessages.add(message);
+  }
+
+  public interface Message {

Review comment:
       No attributes of `Message` are used in the code. The queue could well be defined as `CopyOnWriteArrayList<Object>` and not a single line of code would have to change.
   
   I suggest we define the processing method in Message (`processMessage`? the one currently called `run`) and when getting a new message from the queue above in `Message m = unprocessedMessages.remove(0);` simply call `m.processMessage(clusterState, Overseer.this);`

##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -213,7 +219,16 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
             ZkStateReader.NODE_NAME_PROP, nodeName,
             ZkStateReader.REPLICA_TYPE, replicaPosition.type.name(),
             CommonAdminParams.WAIT_FOR_FINAL_STATE, Boolean.toString(waitForFinalState));
-        ocmh.overseer.offerStateUpdate(Utils.toJSON(props));
+        if(isPrs) {
+          ZkWriteCommand command = new SliceMutator(ocmh.cloudManager).addReplica(clusterState, props);

Review comment:
       Note that each call to `addReplica` here will read back all existing znodes under `state.json`. This is therefore running in `n^2` (`n*(n+1)/2`) with `n` the number of replicas.
   (`fetch` in `PerReplicaStates` called from `SliceMutator.addReplica` will always see a different `cversion` value since the previous replica was added on the previous iteration and will therefore `getChildren`)

##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -256,6 +280,23 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
       shardRequestTracker.processResponses(results, shardHandler, false, null, Collections.emptySet());
       @SuppressWarnings({"rawtypes"})
       boolean failure = results.get("failure") != null && ((SimpleOrderedMap)results.get("failure")).size() > 0;
+      if(isPrs) {
+        TimeOut timeout = new TimeOut(Integer.getInteger("solr.waitToSeeReplicasInStateTimeoutSeconds", 120), TimeUnit.SECONDS, timeSource); // could be a big cluster
+        PerReplicaStates prs = PerReplicaStates.fetch(collectionPath, ocmh.zkStateReader.getZkClient(), null);
+        while (!timeout.hasTimedOut()) {
+          if(prs.allActive()) break;
+          Thread.sleep(100);
+          prs = PerReplicaStates.fetch(collectionPath, ocmh.zkStateReader.getZkClient(), null);
+        }
+        if (prs.allActive()) {
+          // we have successfully found all replicas to be ACTIVE
+          // Now ask Overseer to fetch the latest state of collection
+          // from ZK
+          ocmh.overseer.submit(new RefreshCollectionMessage(collectionName));
+        } else {
+          failure = true;
+        }
+      }
       if (failure) {
         // Let's cleanup as we hit an exception
         // We shouldn't be passing 'results' here for the cleanup as the response would then contain 'success'

Review comment:
       In case of PRS collection (and failure, so we're executing here), the collection delete called from two lines below (cannot attach comment to the actual line, thanks GitHub) will fail because the `ClusterStateUpdater` will not know about the collection, it wasn't refreshed. The failure will happen while waiting for the state after the call to `ocmh.overseer.offerStateUpdate(Utils.toJSON(m));` in `DeleteCollectionCmd`.

##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -246,7 +261,16 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
       }
 
       // wait for all replica entries to be created

Review comment:
       move comment into the `else` bloc.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] chatman commented on pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

chatman commented on pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#issuecomment-778148443


   I tested this using the stress test suite, and performance for collection creations are at par with non-PRS collections and I saw no timeouts/failures etc. I think this change is good to go with.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] murblanc commented on a change in pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

murblanc commented on a change in pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#discussion_r572075066



##########
File path: solr/solrj/src/java/org/apache/solr/common/cloud/PerReplicaStates.java
##########
@@ -92,6 +94,17 @@ public PerReplicaStates(String path, int cversion, List<String> states) {
 
   }
 
+  /** Check and return if all replicas are ACTIVE
+   */
+  public boolean allActive() {
+    if (this.allActive != null) return allActive;
+    boolean[] result = new boolean[]{true};

Review comment:
       `AtomicBoolean` would likely make code easier to read.

##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -264,8 +264,11 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
         log.info("Cleaned up artifacts for failed create collection for [{}]", collectionName);
         throw new SolrException(ErrorCode.BAD_REQUEST, "Underlying core creation failed while creating collection: " + collectionName);
       } else {
+        //we want to wait till all the replicas are ACTIVE for PRS collections because
+          ocmh.zkStateReader.waitForState(collectionName, 30, TimeUnit.SECONDS, (liveNodes, c) ->
+                  c.getPerReplicaStates() == null || // this is not a PRS collection

Review comment:
       Shouldn't that test about collection not being PRS be made before the call to `waitForState` to just skip that call?

##########
File path: solr/solrj/src/java/org/apache/solr/common/cloud/PerReplicaStates.java
##########
@@ -92,6 +94,17 @@ public PerReplicaStates(String path, int cversion, List<String> states) {
 
   }
 
+  /** Check and return if all replicas are ACTIVE
+   */
+  public boolean allActive() {
+    if (this.allActive != null) return allActive;

Review comment:
       This method caches the value computed the first time it's called, then returns that value on each subsequent call.
   When used as part of the predicate in `CollectionStateWatcher`, even if the watch fires multiple times wouldn't it always return the first computed state (which is likely `false` as all replicas are not active initially)?

##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -264,8 +264,11 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
         log.info("Cleaned up artifacts for failed create collection for [{}]", collectionName);
         throw new SolrException(ErrorCode.BAD_REQUEST, "Underlying core creation failed while creating collection: " + collectionName);
       } else {
+        //we want to wait till all the replicas are ACTIVE for PRS collections because

Review comment:
       Truncated comment




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] noblepaul commented on a change in pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

noblepaul commented on a change in pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#discussion_r572556450



##########
File path: solr/solrj/src/java/org/apache/solr/common/cloud/PerReplicaStates.java
##########
@@ -92,6 +94,17 @@ public PerReplicaStates(String path, int cversion, List<String> states) {
 
   }
 
+  /** Check and return if all replicas are ACTIVE
+   */
+  public boolean allActive() {
+    if (this.allActive != null) return allActive;

Review comment:
       This object is immutable. Everytime the state is modified, a new instance is created




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] noblepaul commented on a change in pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

noblepaul commented on a change in pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#discussion_r572557297



##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -264,8 +264,11 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
         log.info("Cleaned up artifacts for failed create collection for [{}]", collectionName);
         throw new SolrException(ErrorCode.BAD_REQUEST, "Underlying core creation failed while creating collection: " + collectionName);
       } else {
+        //we want to wait till all the replicas are ACTIVE for PRS collections because
+          ocmh.zkStateReader.waitForState(collectionName, 30, TimeUnit.SECONDS, (liveNodes, c) ->
+                  c.getPerReplicaStates() == null || // this is not a PRS collection

Review comment:
       could have. But, it will return as soon as the cal is made.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] murblanc commented on a change in pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

murblanc commented on a change in pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#discussion_r574498232



##########
File path: solr/core/src/java/org/apache/solr/cloud/RefreshCollectionMessage.java
##########
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.cloud;
+
+import org.apache.solr.common.cloud.ClusterState;
+import org.apache.solr.common.cloud.DocCollection;
+import org.apache.solr.common.cloud.ZkStateReader;
+import org.apache.zookeeper.KeeperException;
+import org.apache.zookeeper.data.Stat;
+
+/**Refresh the Cluster State for a given collection
+ *
+ */
+public class RefreshCollectionMessage implements Overseer.Message {
+    public final Operation operation;
+    public final String collection;
+
+    public RefreshCollectionMessage(String collection) {
+        this.operation = Operation.REFRESH_COLL;
+        this.collection = collection;
+    }
+
+    ClusterState run(ClusterState clusterState, Overseer overseer) throws InterruptedException, KeeperException {

Review comment:
       Can you please use another name than `run()` that is usually related to a `Runnable` which is not the case here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] chatman commented on a change in pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

Posted by GitBox <gi...@apache.org>.

chatman commented on a change in pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#discussion_r575282512



##########
File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##########
@@ -256,6 +280,23 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
       shardRequestTracker.processResponses(results, shardHandler, false, null, Collections.emptySet());
       @SuppressWarnings({"rawtypes"})
       boolean failure = results.get("failure") != null && ((SimpleOrderedMap)results.get("failure")).size() > 0;
+      if(isPrs) {
+        TimeOut timeout = new TimeOut(Integer.getInteger("solr.waitToSeeReplicasInStateTimeoutSeconds", 120), TimeUnit.SECONDS, timeSource); // could be a big cluster
+        PerReplicaStates prs = PerReplicaStates.fetch(collectionPath, ocmh.zkStateReader.getZkClient(), null);
+        while (!timeout.hasTimedOut()) {
+          if(prs.allActive()) break;
+          Thread.sleep(100);
+          prs = PerReplicaStates.fetch(collectionPath, ocmh.zkStateReader.getZkClient(), null);
+        }
+        if (prs.allActive()) {
+          // we have successfully found all replicas to be ACTIVE
+          // Now ask Overseer to fetch the latest state of collection
+          // from ZK
+          ocmh.overseer.submit(new RefreshCollectionMessage(collectionName));
+        } else {
+          failure = true;
+        }
+      }
       if (failure) {
         // Let's cleanup as we hit an exception
         // We shouldn't be passing 'results' here for the cleanup as the response would then contain 'success'

Review comment:
       I can confirm that this is no longer an issue after the latest commits. I ran CreateCollectionCleanupTest with the following patch [0] and it passed consistently. While it ran, I checked the coverage report to verify that these lines were covered.
   
   [0] - https://paste.centos.org/view/09e3434d




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org