You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2020/11/06 21:53:57 UTC

[GitHub] [lucene-solr] thelabdude opened a new pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream

thelabdude opened a new pull request #2067:
URL: https://github.com/apache/lucene-solr/pull/2067


   # Description
   
   For collections with many shards (or aliases with many collections and some shards), `CloudSolrStream` will end up creating a new `HttpSolrClient` for every `SolrStream` it opens because the cache key is the full core URL, such as: `http://127.0.0.1:63460/solr/collection4_shard4_replica_n6/`
   
   In addition, `CloudSolrStream#getSlices` was calling `clusterState.getCollectionsMap()` which pre-emptively loads all `LazyCollectionRef` from ZK unnecessarily. This could cause an issue with clusters with many collections and slow down the streaming expression execution.
   
   # Solution
   
   In this PR, I've introduced a new ctor in `SolrStream` to pass the `core` which is then used to determine the URL of the node hosting the core to be queried when the stream is opened. This leads to reusing the same `HttpSolrClient` for the same node because the cache key is now `http://127.0.0.1:63460/solr/`. I chose this new ctor approach because `CloudSolrStream` is not the only consumer of SolrStream and it knows how the the list of URLs where constructed from cluster state, so it can safely make the decision about passing the core and reusing clients. We may also be able to see if `distrib=false` in the params passed to the ctor but then we'd still have to parse out the core name, which `CloudSolrStream` can do with confidence but I'm not so sure from the other paths in the code.
   
   When the request is sent to the remote core, we need to add the core name back to the path (since we stripped it from the baseUrl used by the HttpSolrClient). This happens in `SolrStream#constructParser`. This method is public and takes a SolrClient (even though SolrStream already has an HttpSolrClient created in the open method); hence, I've had to be a little more paranoid around checking the type of SolrClient `server instanceof HttpSolrClient` when determining if the core needs to be added to the path. Perhaps we can change this method's signature in Solr 9?
   
   # Tests
   
   Test still under construction ... I have a manual test that creates 10 collections with 100 shards and with this fix, a simple query streaming expression execution time is cut by at least half (500 ms vs. >1000) just for the `openStreams` operation.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [ ] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [ ] I have created a Jira issue and added the issue ID to my pull request title.
   - [ ] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [ ] I have developed this patch against the `master` branch.
   - [ ] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] thelabdude commented on pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream

Posted by GitBox <gi...@apache.org>.
thelabdude commented on pull request #2067:
URL: https://github.com/apache/lucene-solr/pull/2067#issuecomment-730490848


   Hi @joel-bernstein wasn't able to assign you as a reviewer on this, but would love for you to take a look when convenient.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] thelabdude commented on a change in pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream

Posted by GitBox <gi...@apache.org>.
thelabdude commented on a change in pull request #2067:
URL: https://github.com/apache/lucene-solr/pull/2067#discussion_r527029201



##########
File path: solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/SolrStream.java
##########
@@ -268,8 +275,7 @@ private Map mapFields(Map fields, Map<String,String> mappings) {
     return fields;
   }
 
-  // temporary...
-  public TupleStreamParser constructParser(SolrClient server, SolrParams requestParams) throws IOException, SolrServerException {
+  private TupleStreamParser constructParser(SolrParams requestParams) throws IOException, SolrServerException {

Review comment:
       Didn't seem like this method needed to be public and we already get a SolrClient in the open method, so no need to pass it. However, this breaks a public method signature, so is only for Solr 9.x and shouldn't be back-ported to 8.x




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] thelabdude commented on pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream

Posted by GitBox <gi...@apache.org>.
thelabdude commented on pull request #2067:
URL: https://github.com/apache/lucene-solr/pull/2067#issuecomment-738084590


   Plan to merge this into master later today.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] thelabdude commented on a change in pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream

Posted by GitBox <gi...@apache.org>.
thelabdude commented on a change in pull request #2067:
URL: https://github.com/apache/lucene-solr/pull/2067#discussion_r527027557



##########
File path: solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/CloudSolrStream.java
##########
@@ -334,11 +334,6 @@ private StreamComparator parseComp(String sort, String fl) throws IOException {
   public static Slice[] getSlices(String collectionName, ZkStateReader zkStateReader, boolean checkAlias) throws IOException {
     ClusterState clusterState = zkStateReader.getClusterState();
 
-    Map<String, DocCollection> collectionsMap = clusterState.getCollectionsMap();
-
-    //TODO we should probably split collection by comma to query more than one
-    //  which is something already supported in other parts of Solr
-
     // check for alias or collection

Review comment:
       Moved the call to getAliases out of the for loop




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] thelabdude merged pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream

Posted by GitBox <gi...@apache.org>.
thelabdude merged pull request #2067:
URL: https://github.com/apache/lucene-solr/pull/2067


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] thelabdude commented on a change in pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream

Posted by GitBox <gi...@apache.org>.
thelabdude commented on a change in pull request #2067:
URL: https://github.com/apache/lucene-solr/pull/2067#discussion_r525335246



##########
File path: solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/CloudSolrStream.java
##########
@@ -334,11 +334,6 @@ private StreamComparator parseComp(String sort, String fl) throws IOException {
   public static Slice[] getSlices(String collectionName, ZkStateReader zkStateReader, boolean checkAlias) throws IOException {
     ClusterState clusterState = zkStateReader.getClusterState();
 
-    Map<String, DocCollection> collectionsMap = clusterState.getCollectionsMap();

Review comment:
       good idea
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] thelabdude commented on a change in pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream

Posted by GitBox <gi...@apache.org>.
thelabdude commented on a change in pull request #2067:
URL: https://github.com/apache/lucene-solr/pull/2067#discussion_r527028199



##########
File path: solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/CloudSolrStream.java
##########
@@ -334,88 +330,76 @@ private StreamComparator parseComp(String sort, String fl) throws IOException {
   public static Slice[] getSlices(String collectionName, ZkStateReader zkStateReader, boolean checkAlias) throws IOException {
     ClusterState clusterState = zkStateReader.getClusterState();
 
-    Map<String, DocCollection> collectionsMap = clusterState.getCollectionsMap();
-
-    //TODO we should probably split collection by comma to query more than one
-    //  which is something already supported in other parts of Solr
-
     // check for alias or collection
 
     List<String> allCollections = new ArrayList<>();
     String[] collectionNames = collectionName.split(",");
+    Aliases aliases = checkAlias ? zkStateReader.getAliases() : null;
+
     for(String col : collectionNames) {
-      List<String> collections = checkAlias
-          ? zkStateReader.getAliases().resolveAliases(col)  // if not an alias, returns collectionName
+      List<String> collections = (aliases != null)
+          ? aliases.resolveAliases(col)  // if not an alias, returns collectionName
           : Collections.singletonList(collectionName);
       allCollections.addAll(collections);
     }
 
     // Lookup all actives slices for these collections
     List<Slice> slices = allCollections.stream()
-        .map(collectionsMap::get)
+        .map(c -> clusterState.getCollectionOrNull(c, true))
         .filter(Objects::nonNull)
         .flatMap(docCol -> Arrays.stream(docCol.getActiveSlicesArr()))
         .collect(Collectors.toList());
     if (!slices.isEmpty()) {
-      return slices.toArray(new Slice[slices.size()]);
-    }
-
-    // Check collection case insensitive
-    for(Entry<String, DocCollection> entry : collectionsMap.entrySet()) {
-      if(entry.getKey().equalsIgnoreCase(collectionName)) {
-        return entry.getValue().getActiveSlicesArr();
-      }
+      return slices.toArray(new Slice[0]);
     }
 
     throw new IOException("Slices not found for " + collectionName);
   }
 
   protected void constructStreams() throws IOException {
+    final ModifiableSolrParams mParams = adjustParams(new ModifiableSolrParams(params));
+    mParams.set(DISTRIB, "false"); // We are the aggregator.
     try {
+      final Stream<SolrStream> streamOfSolrStream;
+      if (streamContext != null && streamContext.get("shards") != null) {
+        // stream of shard url with core
+        streamOfSolrStream = getShards(this.zkHost, this.collection, this.streamContext, mParams).stream()
+            .map(s -> new SolrStream(s, mParams));
+      } else {
+        // stream of replicas to reuse the same SolrHttpClient per baseUrl
+        // avoids re-parsing data we already have in the replicas
+        streamOfSolrStream = getReplicas(this.zkHost, this.collection, this.streamContext, mParams).stream()

Review comment:
       Here we're keeping the Replica so we have direct access to its baseUrl and core name instead of parsing those out of the shardUrl




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] thelabdude commented on pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream

Posted by GitBox <gi...@apache.org>.
thelabdude commented on pull request #2067:
URL: https://github.com/apache/lucene-solr/pull/2067#issuecomment-723636169


   I don't think the current solution solves the problem of clusters many nodes, you'll still get a bunch of client objects, still investigating what can be done about that
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] thelabdude commented on a change in pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream

Posted by GitBox <gi...@apache.org>.
thelabdude commented on a change in pull request #2067:
URL: https://github.com/apache/lucene-solr/pull/2067#discussion_r527027234



##########
File path: solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/CloudSolrStream.java
##########
@@ -334,88 +330,76 @@ private StreamComparator parseComp(String sort, String fl) throws IOException {
   public static Slice[] getSlices(String collectionName, ZkStateReader zkStateReader, boolean checkAlias) throws IOException {
     ClusterState clusterState = zkStateReader.getClusterState();
 
-    Map<String, DocCollection> collectionsMap = clusterState.getCollectionsMap();
-
-    //TODO we should probably split collection by comma to query more than one
-    //  which is something already supported in other parts of Solr
-
     // check for alias or collection
 
     List<String> allCollections = new ArrayList<>();
     String[] collectionNames = collectionName.split(",");
+    Aliases aliases = checkAlias ? zkStateReader.getAliases() : null;
+
     for(String col : collectionNames) {
-      List<String> collections = checkAlias
-          ? zkStateReader.getAliases().resolveAliases(col)  // if not an alias, returns collectionName
+      List<String> collections = (aliases != null)
+          ? aliases.resolveAliases(col)  // if not an alias, returns collectionName
           : Collections.singletonList(collectionName);
       allCollections.addAll(collections);
     }
 
     // Lookup all actives slices for these collections
     List<Slice> slices = allCollections.stream()
-        .map(collectionsMap::get)
+        .map(c -> clusterState.getCollectionOrNull(c, true))
         .filter(Objects::nonNull)
         .flatMap(docCol -> Arrays.stream(docCol.getActiveSlicesArr()))
         .collect(Collectors.toList());
     if (!slices.isEmpty()) {
-      return slices.toArray(new Slice[slices.size()]);
-    }
-
-    // Check collection case insensitive
-    for(Entry<String, DocCollection> entry : collectionsMap.entrySet()) {

Review comment:
       I removed this b/c I don't think we should try to accommodate improperly cased collection names. No tests broke, but let me know if we need this for some reason I don't understand 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] madrob commented on a change in pull request #2067: SOLR-14987: Reuse HttpSolrClient per node vs. one per Solr core when using CloudSolrStream

Posted by GitBox <gi...@apache.org>.
madrob commented on a change in pull request #2067:
URL: https://github.com/apache/lucene-solr/pull/2067#discussion_r520198395



##########
File path: solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/CloudSolrStream.java
##########
@@ -334,11 +334,6 @@ private StreamComparator parseComp(String sort, String fl) throws IOException {
   public static Slice[] getSlices(String collectionName, ZkStateReader zkStateReader, boolean checkAlias) throws IOException {
     ClusterState clusterState = zkStateReader.getClusterState();
 
-    Map<String, DocCollection> collectionsMap = clusterState.getCollectionsMap();

Review comment:
       related: can we update the javadoc on clusterState.getCollectionsMap to be more explicit that it _will_ make a call to zk, instead of the current _may_

##########
File path: solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/CloudSolrStream.java
##########
@@ -334,11 +334,6 @@ private StreamComparator parseComp(String sort, String fl) throws IOException {
   public static Slice[] getSlices(String collectionName, ZkStateReader zkStateReader, boolean checkAlias) throws IOException {
     ClusterState clusterState = zkStateReader.getClusterState();
 
-    Map<String, DocCollection> collectionsMap = clusterState.getCollectionsMap();
-
-    //TODO we should probably split collection by comma to query more than one
-    //  which is something already supported in other parts of Solr
-
     // check for alias or collection

Review comment:
       Should we cache the value of `zkStateReader.getAliases` below to avoid volatile reads?

##########
File path: solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/SolrStream.java
##########
@@ -126,6 +135,17 @@ public void open() throws IOException {
     }
   }
 
+  private String getNodeUrl() {

Review comment:
       Can we precomute this in the constructor?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org