You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Russell Taylor <Ru...@theice.com> on 2019/05/22 12:38:47 UTC

CloudSolrClient (any version). Find the node your query has connected to.

Hi,
Using CloudSolrClient, how do I find the node (I have 3 nodes for this collection on our 6 node cluster) the query has connected to.
I'm hoping to get the full URL if possible.


Regards

Russell Taylor



________________________________

This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.

RE: CloudSolrClient (any version). Find the node your query has connected to.

Posted by Russell Taylor <Ru...@theice.com>.

Thanks Erick,
Pretty stuck with the delete-by-query as it can be deleting a million docs.

I'll work through what you have said and also try to find the root cause of the recovery.

Regards

Russell Taylor

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: 22 May 2019 20:17
To: solr-user@lucene.apache.org
Subject: Re: CloudSolrClient (any version). Find the node your query has connected to.

WARNING - External email from lucene.apache.org

You have to be a little careful here, one thing I learned relatively recently is that there are in-memory structures that hold pointers to _all_ un-searchable docs (i.e. no new searchers have been opened since the doc was added/updated) to support real-time get. So if you’re indexing a _lot_ of docs that internal structure can grow quite large….

FWIW, delete-by-query is painful. Each one has to lock all indexing on all replicas while it completes. If you can use delete-by-id it’d be better.

Let’s back up a bit and look at _why_ your nodes go into recovery…. Leave the replicas on if you can and look for “Leader Initiated Recovery” (not sure that’s the exact phrase, but you’ll see something very like that). If that’s the case, then one situation we’ve seen is that a request takes too long to return from a follower. So the sequence looks like this:

- leader gets update
- leader indexes locally _and_ forwards to follower
- follower is busy (and the delete-by-query could be why) and takes too long to respond so the request times out
- leader says “hmmm, I don’t know what happened so I’ll tell the follower to recover”.

Given your heavy update rate, there’ll be no chance for “peer sync” to fully recover so it’ll go into full recovery. That can sometimes be fixed by simply lengthening the timeout.

Otherwise also take a look at the logs and see if you can find a root cause for the replica going into recovery and we should see if we can fix that.

I didn’t ask what versions of Solr you’re using, but in the 7x code line (7.3 IIRC) significant work was done to make recovery less likely.

Best,
Erick

> On May 22, 2019, at 10:27 AM, Shawn Heisey <ap...@elyograg.org> wrote:
>
> On 5/22/2019 10:47 AM, Russell Taylor wrote:
>> I will add that we have set commits to be only called by the loading program. We have turned off soft and autoCommits in the solrconfig.xml.
>
> Don't turn off autoCommit.  Regular hard commits, typically with openSearcher set to false so they don't interfere with change visibility, are extremely important for good Solr operation.  Without it, the transaction logs will grow out of control.  In addition to taking a lot of disk space, that will cause a Solr restart to happen VERY slowly.  Note that a hard commit with openSearcher set to false will be VERY fast -- doing them frequently is usually not a problem for performance.  Sample configs in recent Solr versions ship with autoCommit set to 15 seconds and openSearcher set to false.
>
> Not using autoSoftCommit is a reasonable thing to do if you do not need that functionality ... but don't disable autoCommit.
>
> Thanks,
> Shawn

________________________________

This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.

RE: CloudSolrClient (any version). Find the node your query has connected to.

Posted by Russell Taylor <Ru...@theice.com>.

Hi Erick/Shawn,
 I went for the deleteById option but still under heavy load on my test machine I still see issues with the nodes going into recovery. I've also noticed in my testing that the leader changed and one node was down for a period before going into recovery. The two errors I see in the logs are "No registered leader was found" and " Cannot talk to ZooKeeper - Updates are disabled"

2019-05-28 16:53:21,651 [qtp1068824137-29348888] ERROR [c:bob s:shard1 r:core_node6 x:bob_shard1_replica2] org.apache.solr.common.SolrException (SolrException.java:148) - org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates are disabled.
        at org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:1472)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:670)
        at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
        at org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:97)
        at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:179)
        at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:135)
        at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:274)
        at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:121)
        at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:239)
        at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:157)
        at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:186)
        at org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:107)
        at org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:54)
        at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:94)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:69)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102)
        at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
        at org.eclipse.jetty.server.Server.handle(Server.java:499)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
        at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
        at java.lang.Thread.run(Thread.java:745)

and also this error

2019-05-28 16:54:04,503 [qtp1068824137-29348888] ERROR [c:bob s:shard1 r:core_node6 x:bob_shard1_replica2] org.apache.solr.common.SolrException (SolrException.java:148) - org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: bob slice: shard1
        at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:626)
        at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:612)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:367)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:1142)
        at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processDelete(LogUpdateProcessorFactory.java:124)
        at org.apache.solr.handler.loader.JavabinLoader.delete(JavabinLoader.java:151)
        at org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:112)
        at org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:54)
        at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:94)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:69)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2102)
        at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
        at org.eclipse.jetty.server.Server.handle(Server.java:499)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
        at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
        at java.lang.Thread.run(Thread.java:745)

Regards

Russell Taylor – Developer


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: 22 May 2019 20:17
To: solr-user@lucene.apache.org
Subject: Re: CloudSolrClient (any version). Find the node your query has connected to.

WARNING - External email from lucene.apache.org

You have to be a little careful here, one thing I learned relatively recently is that there are in-memory structures that hold pointers to _all_ un-searchable docs (i.e. no new searchers have been opened since the doc was added/updated) to support real-time get. So if you’re indexing a _lot_ of docs that internal structure can grow quite large….

FWIW, delete-by-query is painful. Each one has to lock all indexing on all replicas while it completes. If you can use delete-by-id it’d be better.

Let’s back up a bit and look at _why_ your nodes go into recovery…. Leave the replicas on if you can and look for “Leader Initiated Recovery” (not sure that’s the exact phrase, but you’ll see something very like that). If that’s the case, then one situation we’ve seen is that a request takes too long to return from a follower. So the sequence looks like this:

- leader gets update
- leader indexes locally _and_ forwards to follower
- follower is busy (and the delete-by-query could be why) and takes too long to respond so the request times out
- leader says “hmmm, I don’t know what happened so I’ll tell the follower to recover”.

Given your heavy update rate, there’ll be no chance for “peer sync” to fully recover so it’ll go into full recovery. That can sometimes be fixed by simply lengthening the timeout.

Otherwise also take a look at the logs and see if you can find a root cause for the replica going into recovery and we should see if we can fix that.

I didn’t ask what versions of Solr you’re using, but in the 7x code line (7.3 IIRC) significant work was done to make recovery less likely.

Best,
Erick

> On May 22, 2019, at 10:27 AM, Shawn Heisey <ap...@elyograg.org> wrote:
>
> On 5/22/2019 10:47 AM, Russell Taylor wrote:
>> I will add that we have set commits to be only called by the loading program. We have turned off soft and autoCommits in the solrconfig.xml.
>
> Don't turn off autoCommit.  Regular hard commits, typically with openSearcher set to false so they don't interfere with change visibility, are extremely important for good Solr operation.  Without it, the transaction logs will grow out of control.  In addition to taking a lot of disk space, that will cause a Solr restart to happen VERY slowly.  Note that a hard commit with openSearcher set to false will be VERY fast -- doing them frequently is usually not a problem for performance.  Sample configs in recent Solr versions ship with autoCommit set to 15 seconds and openSearcher set to false.
>
> Not using autoSoftCommit is a reasonable thing to do if you do not need that functionality ... but don't disable autoCommit.
>
> Thanks,
> Shawn


________________________________

This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.

Re: CloudSolrClient (any version). Find the node your query has connected to.

Posted by Jan Høydahl <ja...@cominvent.com>.

Try to add &shards.info=true to your request. It will return a section telling exactly what shards/replicas served that request with counts and all :)

Jan Høydahl

> 22. mai 2019 kl. 21:17 skrev Erick Erickson <er...@gmail.com>:
> 
> You have to be a little careful here, one thing I learned relatively recently is that there are in-memory structures that hold pointers to _all_ un-searchable docs (i.e. no new searchers have been opened since the doc was added/updated) to support real-time get. So if you’re indexing a _lot_ of docs that internal structure can grow quite large….
> 
> FWIW, delete-by-query is painful. Each one has to lock all indexing on all replicas while it completes. If you can use delete-by-id it’d be better.
> 
> Let’s back up a bit and look at _why_ your nodes go into recovery…. Leave the replicas on if you can and look for “Leader Initiated Recovery” (not sure that’s the exact phrase, but you’ll see something very like that). If that’s the case, then one situation we’ve seen is that a request takes too long to return from a follower. So the sequence looks like this:
> 
> - leader gets update
> - leader indexes locally _and_ forwards to follower
> - follower is busy (and the delete-by-query could be why) and takes too long to respond so the request times out
> - leader says “hmmm, I don’t know what happened so I’ll tell the follower to recover”.
> 
> Given your heavy update rate, there’ll be no chance for “peer sync” to fully recover so it’ll go into full recovery. That can sometimes be fixed by simply lengthening the timeout.
> 
> Otherwise also take a look at the logs and see if you can find a root cause for the replica going into recovery and we should see if we can fix that.
> 
> I didn’t ask what versions of Solr you’re using, but in the 7x code line (7.3 IIRC) significant work was done to make recovery less likely.
> 
> Best,
> Erick
> 
>> On May 22, 2019, at 10:27 AM, Shawn Heisey <ap...@elyograg.org> wrote:
>> 
>> On 5/22/2019 10:47 AM, Russell Taylor wrote:
>>> I will add that we have set commits to be only called by the loading program. We have turned off soft and autoCommits in the solrconfig.xml.
>> 
>> Don't turn off autoCommit.  Regular hard commits, typically with openSearcher set to false so they don't interfere with change visibility, are extremely important for good Solr operation.  Without it, the transaction logs will grow out of control.  In addition to taking a lot of disk space, that will cause a Solr restart to happen VERY slowly.  Note that a hard commit with openSearcher set to false will be VERY fast -- doing them frequently is usually not a problem for performance.  Sample configs in recent Solr versions ship with autoCommit set to 15 seconds and openSearcher set to false.
>> 
>> Not using autoSoftCommit is a reasonable thing to do if you do not need that functionality ... but don't disable autoCommit.
>> 
>> Thanks,
>> Shawn
>

Re: CloudSolrClient (any version). Find the node your query has connected to.

Posted by Erick Erickson <er...@gmail.com>.

You have to be a little careful here, one thing I learned relatively recently is that there are in-memory structures that hold pointers to _all_ un-searchable docs (i.e. no new searchers have been opened since the doc was added/updated) to support real-time get. So if you’re indexing a _lot_ of docs that internal structure can grow quite large….

FWIW, delete-by-query is painful. Each one has to lock all indexing on all replicas while it completes. If you can use delete-by-id it’d be better.

Let’s back up a bit and look at _why_ your nodes go into recovery…. Leave the replicas on if you can and look for “Leader Initiated Recovery” (not sure that’s the exact phrase, but you’ll see something very like that). If that’s the case, then one situation we’ve seen is that a request takes too long to return from a follower. So the sequence looks like this:

- leader gets update
- leader indexes locally _and_ forwards to follower
- follower is busy (and the delete-by-query could be why) and takes too long to respond so the request times out
- leader says “hmmm, I don’t know what happened so I’ll tell the follower to recover”.

Given your heavy update rate, there’ll be no chance for “peer sync” to fully recover so it’ll go into full recovery. That can sometimes be fixed by simply lengthening the timeout.

Otherwise also take a look at the logs and see if you can find a root cause for the replica going into recovery and we should see if we can fix that.

I didn’t ask what versions of Solr you’re using, but in the 7x code line (7.3 IIRC) significant work was done to make recovery less likely.

Best,
Erick

> On May 22, 2019, at 10:27 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> 
> On 5/22/2019 10:47 AM, Russell Taylor wrote:
>> I will add that we have set commits to be only called by the loading program. We have turned off soft and autoCommits in the solrconfig.xml.
> 
> Don't turn off autoCommit.  Regular hard commits, typically with openSearcher set to false so they don't interfere with change visibility, are extremely important for good Solr operation.  Without it, the transaction logs will grow out of control.  In addition to taking a lot of disk space, that will cause a Solr restart to happen VERY slowly.  Note that a hard commit with openSearcher set to false will be VERY fast -- doing them frequently is usually not a problem for performance.  Sample configs in recent Solr versions ship with autoCommit set to 15 seconds and openSearcher set to false.
> 
> Not using autoSoftCommit is a reasonable thing to do if you do not need that functionality ... but don't disable autoCommit.
> 
> Thanks,
> Shawn

Re: CloudSolrClient (any version). Find the node your query has connected to.

Posted by Shawn Heisey <ap...@elyograg.org>.

On 5/22/2019 10:47 AM, Russell Taylor wrote:
> I will add that we have set commits to be only called by the loading program. We have turned off soft and autoCommits in the solrconfig.xml.

Don't turn off autoCommit.  Regular hard commits, typically with 
openSearcher set to false so they don't interfere with change 
visibility, are extremely important for good Solr operation.  Without 
it, the transaction logs will grow out of control.  In addition to 
taking a lot of disk space, that will cause a Solr restart to happen 
VERY slowly.  Note that a hard commit with openSearcher set to false 
will be VERY fast -- doing them frequently is usually not a problem for 
performance.  Sample configs in recent Solr versions ship with 
autoCommit set to 15 seconds and openSearcher set to false.

Not using autoSoftCommit is a reasonable thing to do if you do not need 
that functionality ... but don't disable autoCommit.

Thanks,
Shawn

RE: CloudSolrClient (any version). Find the node your query has connected to.

Posted by Russell Taylor <Ru...@theice.com>.

Thanks Eric,
I will add that we have set commits to be only called by the loading program. We have turned off soft and autoCommits in the solrconfig.xml.
This is so when we upload, we move from one list of docs to the new list in one atomic operation (delete, add and then commit).

I'll also add: This index holds 500,000,000 docs and under heavy uploading we get the nodes going into recovery. I'm presuming it's down to the commits being too far apart and causing the replication nodes to falter. This heavy upload is a small window of time and to get around this issue, I remove the replicas during this period and then add them back afterwards. The new recovery mode issue looks like it was down to heavy upload but outside the designated period.

So the most likely scenario is that I've created the issue with my tweaking, hope you can point me in the right direction.

<autoCommit>
       <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
       <openSearcher>false</openSearcher>
  </autoCommit>

<autoSoftCommit>
       <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
  </autoSoftCommit>

Regards

Russell Taylor

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: 22 May 2019 16:45
To: solr-user@lucene.apache.org
Subject: Re: CloudSolrClient (any version). Find the node your query has connected to.

WARNING - External email from lucene.apache.org

OK, now we’re cooking with oil.

First, nodes in recovery shouldn’t make any difference to a query. They should not serve any part of a query so I think/hope that’s a red herring. At worst a node in recovery should pass the query on to another replica that is _not_ recovering.

When you’re looking at this, be aware that as long as _Solr_ is up and running on a node, it’ll accept queries. For simplicity let's say Solr1 hosts _only_ collection1_shard1_replica1 (cs1r1).

Now you fire a query at Solr1. It has the topology from ZooKeeper as well as its own internal knowledge of hosted replicas. For a top-level query it should send sub-queries out only to healthy replicas, bypassing its own recovering replica.

Let’s claim you fire the query at Solr2. First if there’s been time to propagate the down state of cs1r1 to ZooKeeper and Solr2 has the state, it shouldn’t even send a subrequest to cs1r1.

Now let’s say Solr2 hasn’t gotten the message yet and does send a query to cs1r1. cs1r1 should know its state is recovering and either return an error the Solr2 (which will pick a new replica to send that subrequest to) or forward it on to another healthy replica, I’m not quite sure which. In any case it should _not_ service the request from cs1r1.

If you do prove that a node serving requests that is really in recovery, that’s a fairly serious bug and we need to know lots of details.

Second, even if you did have the URL Solr sends the query to it wouldn’t help. Once a Solr node receives a query, it does its _own_ round robin for a subrequest to one replica of each shard, get’s the replies back then goes back out to the same replica for the final documents. So you still wouldn’t know what replica served the queries.

The fact that you say things come back into sync after commit points to autocommit times. I’m assuming you have an autocommit setting that opens a new searcher (<openSearcher>true in the “autocommit” section or any positive time in the autoSoftCommit section of solrconfig.xml). These commit points will fire at different wall-clock time, resulting in replicas temporarily having different searchable documents. BTW, the same thing applies if you send “commitWithin” in a SolrJ cloudSolrClient.add command…

Anyway, if you just fire a query at a specific replica and add &distrib=false, the replica will bring back only documents from that replica. We’re talking the replica, so part of the URL will be the complete replica name like "…./solr/collection1_shard1_replica_n1/query?q=*:*&distrib=false”

A very quick test would be, when you have a replica in recovery, stop indexing and wait for your autocommit interval to expire (one that opens a new searcher) or issue a commit to the collection. My bet/hope is that your counts will be just fine. You can use the &distrib=false parameter to query each replica of the relevant shard directly…

Best,
Erick

> On May 22, 2019, at 8:09 AM, Russell Taylor <Ru...@theice.com> wrote:
>
> Hi Erick,
> Every time any of the replication nodes goes into recovery mode we start seeing queries which don't match the correct count. I'm being told zookeeper will give me the correct node (Not one in recovery), but I want to prove it as the query issue only comes up when any of the nodes are in recovery mode. The application loading the data shows the correct counts and after committing we check the results and they look correct.
>
> If I can get the URL I can prove that the problem is due to doing the query against a node in recovery mode.
>
> I hope that explains the problem, thanks for your time.
>
> Regards
>
> Russell Taylor
>
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: 22 May 2019 15:50
> To: solr-user@lucene.apache.org
> Subject: Re: CloudSolrClient (any version). Find the node your query has connected to.
>
> WARNING - External email from lucene.apache.org
>
> Why do you want to know? You’ve asked how do to X without telling us what problem Y you’re trying to solve (the XY problem) and frequently that leads to a lot of wasted time…..
>
> Under the covers CloudSolrClient uses a pretty simple round-robin load balancer to pick a Solr node to send the query to so “it depends”…..
>
>> On May 22, 2019, at 5:51 AM, Jörn Franke <jo...@gmail.com> wrote:
>>
>> You have to provide the addresses of the zookeeper ensemble - it will figure it out on its own based on information in Zookeeper.
>>
>>> Am 22.05.2019 um 14:38 schrieb Russell Taylor <Ru...@theice.com>:
>>>
>>> Hi,
>>> Using CloudSolrClient, how do I find the node (I have 3 nodes for this collection on our 6 node cluster) the query has connected to.
>>> I'm hoping to get the full URL if possible.
>>>
>>>
>>> Regards
>>>
>>> Russell Taylor
>>>
>>>
>>>
>>> ________________________________
>>>
>>> This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.
>
>
> ________________________________
>
> This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.

________________________________

This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.

Re: CloudSolrClient (any version). Find the node your query has connected to.

Posted by Erick Erickson <er...@gmail.com>.

OK, now we’re cooking with oil.

First, nodes in recovery shouldn’t make any difference to a query. They should not serve any part of a query so I think/hope that’s a red herring. At worst a node in recovery should pass the query on to another replica that is _not_ recovering.

When you’re looking at this, be aware that as long as _Solr_ is up and running on a node, it’ll accept queries. For simplicity let's say Solr1 hosts _only_ collection1_shard1_replica1 (cs1r1).

Now you fire a query at Solr1. It has the topology from ZooKeeper as well as its own internal knowledge of hosted replicas. For a top-level query it should send sub-queries out only to healthy replicas, bypassing its own recovering replica.

Let’s claim you fire the query at Solr2. First if there’s been time to propagate the down state of cs1r1 to ZooKeeper and Solr2 has the state, it shouldn’t even send a subrequest to cs1r1.

Now let’s say Solr2 hasn’t gotten the message yet and does send a query to cs1r1. cs1r1 should know its state is recovering and either return an error the Solr2 (which will pick a new replica to send that subrequest to) or forward it on to another healthy replica, I’m not quite sure which. In any case it should _not_ service the request from cs1r1.

If you do prove that a node serving requests that is really in recovery, that’s a fairly serious bug and we need to know lots of details.

Second, even if you did have the URL Solr sends the query to it wouldn’t help. Once a Solr node receives a query, it does its _own_ round robin for a subrequest to one replica of each shard, get’s the replies back then goes back out to the same replica for the final documents. So you still wouldn’t know what replica served the queries.

The fact that you say things come back into sync after commit points to autocommit times. I’m assuming you have an autocommit setting that opens a new searcher (<openSearcher>true in the “autocommit” section or any positive time in the autoSoftCommit section of solrconfig.xml). These commit points will fire at different wall-clock time, resulting in replicas temporarily having different searchable documents. BTW, the same thing applies if you send “commitWithin” in a SolrJ cloudSolrClient.add command…

Anyway, if you just fire a query at a specific replica and add &distrib=false, the replica will bring back only documents from that replica. We’re talking the replica, so part of the URL will be the complete replica name like "…./solr/collection1_shard1_replica_n1/query?q=*:*&distrib=false”

A very quick test would be, when you have a replica in recovery, stop indexing and wait for your autocommit interval to expire (one that opens a new searcher) or issue a commit to the collection. My bet/hope is that your counts will be just fine. You can use the &distrib=false parameter to query each replica of the relevant shard directly…

Best,
Erick

> On May 22, 2019, at 8:09 AM, Russell Taylor <Ru...@theice.com> wrote:
> 
> Hi Erick,
> Every time any of the replication nodes goes into recovery mode we start seeing queries which don't match the correct count. I'm being told zookeeper will give me the correct node (Not one in recovery), but I want to prove it as the query issue only comes up when any of the nodes are in recovery mode. The application loading the data shows the correct counts and after committing we check the results and they look correct.
> 
> If I can get the URL I can prove that the problem is due to doing the query against a node in recovery mode.
> 
> I hope that explains the problem, thanks for your time.
> 
> Regards
> 
> Russell Taylor
> 
> 
> 
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: 22 May 2019 15:50
> To: solr-user@lucene.apache.org
> Subject: Re: CloudSolrClient (any version). Find the node your query has connected to.
> 
> WARNING - External email from lucene.apache.org
> 
> Why do you want to know? You’ve asked how do to X without telling us what problem Y you’re trying to solve (the XY problem) and frequently that leads to a lot of wasted time…..
> 
> Under the covers CloudSolrClient uses a pretty simple round-robin load balancer to pick a Solr node to send the query to so “it depends”…..
> 
>> On May 22, 2019, at 5:51 AM, Jörn Franke <jo...@gmail.com> wrote:
>> 
>> You have to provide the addresses of the zookeeper ensemble - it will figure it out on its own based on information in Zookeeper.
>> 
>>> Am 22.05.2019 um 14:38 schrieb Russell Taylor <Ru...@theice.com>:
>>> 
>>> Hi,
>>> Using CloudSolrClient, how do I find the node (I have 3 nodes for this collection on our 6 node cluster) the query has connected to.
>>> I'm hoping to get the full URL if possible.
>>> 
>>> 
>>> Regards
>>> 
>>> Russell Taylor
>>> 
>>> 
>>> 
>>> ________________________________
>>> 
>>> This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.
> 
> 
> ________________________________
> 
> This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.

RE: CloudSolrClient (any version). Find the node your query has connected to.

Posted by Russell Taylor <Ru...@theice.com>.

Hi Erick,
 Every time any of the replication nodes goes into recovery mode we start seeing queries which don't match the correct count. I'm being told zookeeper will give me the correct node (Not one in recovery), but I want to prove it as the query issue only comes up when any of the nodes are in recovery mode. The application loading the data shows the correct counts and after committing we check the results and they look correct.

If I can get the URL I can prove that the problem is due to doing the query against a node in recovery mode.

I hope that explains the problem, thanks for your time.

Regards

Russell Taylor

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: 22 May 2019 15:50
To: solr-user@lucene.apache.org
Subject: Re: CloudSolrClient (any version). Find the node your query has connected to.

WARNING - External email from lucene.apache.org

Why do you want to know? You’ve asked how do to X without telling us what problem Y you’re trying to solve (the XY problem) and frequently that leads to a lot of wasted time…..

Under the covers CloudSolrClient uses a pretty simple round-robin load balancer to pick a Solr node to send the query to so “it depends”…..

> On May 22, 2019, at 5:51 AM, Jörn Franke <jo...@gmail.com> wrote:
>
> You have to provide the addresses of the zookeeper ensemble - it will figure it out on its own based on information in Zookeeper.
>
>> Am 22.05.2019 um 14:38 schrieb Russell Taylor <Ru...@theice.com>:
>>
>> Hi,
>> Using CloudSolrClient, how do I find the node (I have 3 nodes for this collection on our 6 node cluster) the query has connected to.
>> I'm hoping to get the full URL if possible.
>>
>>
>> Regards
>>
>> Russell Taylor
>>
>>
>>
>> ________________________________
>>
>> This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.

________________________________

This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.

Re: CloudSolrClient (any version). Find the node your query has connected to.

Posted by Erick Erickson <er...@gmail.com>.

Why do you want to know? You’ve asked how do to X without telling us what problem Y you’re trying to solve (the XY problem) and frequently that leads to a lot of wasted time…..

Under the covers CloudSolrClient uses a pretty simple round-robin load balancer to pick a Solr node to send the query to so “it depends”…..

> On May 22, 2019, at 5:51 AM, Jörn Franke <jo...@gmail.com> wrote:
> 
> You have to provide the addresses of the zookeeper ensemble - it will figure it out on its own based on information in Zookeeper.
> 
>> Am 22.05.2019 um 14:38 schrieb Russell Taylor <Ru...@theice.com>:
>> 
>> Hi,
>> Using CloudSolrClient, how do I find the node (I have 3 nodes for this collection on our 6 node cluster) the query has connected to.
>> I'm hoping to get the full URL if possible.
>> 
>> 
>> Regards
>> 
>> Russell Taylor
>> 
>> 
>> 
>> ________________________________
>> 
>> This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.

Re: CloudSolrClient (any version). Find the node your query has connected to.

Posted by Jörn Franke <jo...@gmail.com>.

You have to provide the addresses of the zookeeper ensemble - it will figure it out on its own based on information in Zookeeper.

> Am 22.05.2019 um 14:38 schrieb Russell Taylor <Ru...@theice.com>:
> 
> Hi,
> Using CloudSolrClient, how do I find the node (I have 3 nodes for this collection on our 6 node cluster) the query has connected to.
> I'm hoping to get the full URL if possible.
> 
> 
> Regards
> 
> Russell Taylor
> 
> 
> 
> ________________________________
> 
> This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.