You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Michael Gibney (Jira)" <ji...@apache.org> on 2021/03/03 22:08:00 UTC

[jira] [Commented] (SOLR-15045) Commit through curl command is causing delay in issuing commit

    [ https://issues.apache.org/jira/browse/SOLR-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294853#comment-17294853 ] 

Michael Gibney commented on SOLR-15045:
---------------------------------------

I independently encountered this while optimizing {{openSearcher}} latency on v7.7. It looks like this is in {{DistributedZkUpdateProcessor.processCommit(...)}} (in 7.7, {{DistributedUpdateProcessor.processCommit(...)}}).

Each SolrCloud commit first gets distributed to leaders that are not the leader core associated with the request; then {{cmdDistrib.blockAndDoRetries()}} blocks waiting for distributed commits to complete, _after which_ {{doLocalCommit(...)}} is called for the core that's locally associated with the request.

[PR #2449|https://github.com/apache/lucene-solr/pull/2449] seeks to address this problem. Notably, in many cases this should effectively cut the user-perceived commit/openSearcher latency in half, even for those who weren't observing consequences as dramatic as the Timeout errors described by [~raj.yadav1].

There are no additional tests at the moment; this change should definitely be in the "hot path" of the test suite, all of which passes with this addition. That said, I'd like to add tests to describe the desired behavior and test against regressions. I hope to do this soon, but wanted to put this fix up asap for feedback/discussion.

This fix will cause "local" commits to be executed in parallel with distributed commits either TOLEADER _or_ FROMLEADER. I haven't investigated, but it looks to me that in the (probably not infrequent) event that there are both TOLEADER _and_ FROMLEADER distrib commits, the FROMLEADER commits would be executed in parallel with the "local" commit, but _after_ issuing and blocking on all the TOLEADER commits. Am I reading this right? I'm happy to investigate further, and it'd probably make sense to address that together with this change, but that change would probably be a _little_ more invasive, so I wanted to mention it first to check my assumption that we'd ideally want all requests -- TOLEADER, FROMLEADER, and local -- to be executed in parallel.

> Commit through curl command is causing delay in issuing commit
> --------------------------------------------------------------
>
>                 Key: SOLR-15045
>                 URL: https://issues.apache.org/jira/browse/SOLR-15045
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 8.5.2
>         Environment: Operating system: Linux (centos 7.7.1908)
>            Reporter: Raj Yadav
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hi All,
> When we issue commit through curl command, not all the shards are getting `start commit` requests at the same time.
> *Solr Setup Detail : (Running in solrCloud mode)*
>  It has 6 shards, and each shard has only one replica (which is also a
>  leader) and the replica type is NRT.
>  Each shards are hosted on the separate physical host.
> Zookeeper => We are using external zookeeper ensemble (3 separate node
>  cluster)
> *Shard and Host name*
>  shard1_0=>solr_199
>  shard1_1=>solr_200
>  shard2_0=> solr_254
>  shard2_1=> solr_132
>  shard3_0=>solr_133
>  shard3_1=>solr_198
> *Request rate on the system is currently zero and only hourly indexing*
>  *running on it.*
> We are using curl command to issue commit.
> {code:java}
> curl
> "http://solr_254:8389/solr/my_collection/update?openSearcher=true&commit=true&wt=json"{code}
> (Using solr_254 host to issue commit)
> On using the above command all the shards have started processing commit (i.e
>  getting `start commit` request) except the one used in curl command (i.e
>  shard2_0 which is hosted on solr_254). Individually each shards takes around
>  10 to 12 min to process hard commit (most of this time is spent on reloading
>  external files).
>  As per logs, shard2_0 is getting `start commit` request after 10 minutes
>  (approx). This leads to following timeout error.
> {code:java}
> 2020-12-06 18:47:47.013 ERROR
> org.apache.solr.client.solrj.SolrServerException: Timeout occured while
> waiting response from server at:
> http://solr_132:9744/solr/my_collection_shard2_1_replica_n21/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2Fsolr_254%3A9744%2Fsolr%2Fmy_collection_shard2_0_replica_n11%2F
>       at
> org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:407)
>       at
> org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:753)
>       at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.request(ConcurrentUpdateHttp2SolrClient.java:369)
>       at
> org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
>       at
> org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:344)
>       at
> org.apache.solr.update.SolrCmdDistributor.lambda$submit$0(SolrCmdDistributor.java:333)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180)
>       at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210)
>       at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
>     Caused by: java.util.concurrent.TimeoutException
>       at
> org.eclipse.jetty.client.util.InputStreamResponseListener.get(InputStreamResponseListener.java:216)
>       at
> org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:398)
>       ... 13 more{code}
> Above timeout error is between solr_254 and solr_132. Similar errors are
>  there between solr_254 and other 4 shards
> Since query load is zero, mostly CPU utilization is around 3%.
>  After issuing curl commit command, CPU goes up to 14% on all shards except
>  shard2_0 (host: solr_254, the one used in curl command).
>  And after 10 minutes (i.e after getting the `start commit` request)  CPU  on
>  shard2_0 also goes up to 14%.
> As I mentioned earlier each shards take around 10-12 mins to process commit
>  and due to delay in starting commit process on one shard (shard2_0) our
>  overall commit time is doubled now. (22-24 minutes approx).
> *We are observing this delay in both hard and soft commit.*
> In our solr-5.4.0(having similar setup), we use the similar curl command to issue commit, and there all the shards are getting `start commit` request at same time. Including the one used in curl command.
>  
> *Impact After deleting external files:*
> In order to nullify the impact of external files, I had deleted external
> files from all the shards and issued commit through the curl command. Commit
> operation got completed in 3 seconds. Individual shards took 1.5 seconds to
> complete the commit operation. But there was a delay of around 1.5 seconds
> on the shard whose hostname was used to issue the commit. Hence overall
> commit time is 3 seconds.
> During this operation, there was no timeout or any other kind of error
> (except `external file not found` error which is expected).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org