You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by emmanuel Gosse <em...@gmail.com> on 2021/06/28 19:53:18 UTC
How do Solr searchers take index changes into account on replicas?

 Hi all,

>
> I am currently facing a serious issue on our Solr 8.8.0 cluster.
> The symptoms are :
> - an increase in the number of threads
> - especially a failure in Solr to really release the index's files.
>
> This brings us to fill the hard drive and to the server crash.
>
>
> First analyse :
> - Files Leak :
> The files are considered deleted by Solr (in DEL mode in LSOF) but the
> filesystem (via the LSOF command) shows them still present and assigned to
> the user who started the Solr.
> - Thread Leak (memory leak too) :
> the qtp pool seems to increase and its threads stay in Time_waiting
>
>
> Context:
> We are a large french ecommerce site.
> We use Solr for the product engine (~100-120 million of products).
> Our indexing thread is permanent (even at night) and intense: 500 products
> per second per instance.
> Each Solr responds to between 50 (thanks to bots at night) and 400
> requests per second (a normal day).
> So and this is one of the parameters of the problem, the instances do not
> have a moment to breathe (even at night).
>
> Architecture :
> This happens on our 2 types of architecture: Tlog - Tlog and Tlog - Pull
> (we are in transition from the 1st to the 2nd).
> And an another parameter of the problem: this bug happens on the
> replication client: Pull or on Tlog follower. Never on a indexer.
>
> The bug has been occurring on all our production clusters for 2 months and
> the installation of Solr 8.8.0 (replacing Solr 7.7.2).
>
> Reproducibility :Clearly on demand on a sandbox in about ten minutes.
>
> How to reproduce :
> - full indexation 24h/24
> - files segments change every minute
> - replication between tlog-pull with 00:00:10 delay
> - permanent search (100 q/s): query + faceting with edismax (I removed all
> specific uses : geolock, timeOut, caches for the bug research)
> For the test, I have 10 000 search requests in a gatling in circular mode.
>
> After few minutes of load, we can see that LSOF command shows files leaks
> with qtp thread. I mean that files have been deleted by IndexFetcher but
> LSOF show them as DEL but they can't be removed because Solr keeps a
> reference.
> In the heap walker of Jprofiler, I can still find Strings with the file
> name of this deleted files. Following the trail, that brings me back to
> SolrCore, SolrIndexSearcher and qtp thread.
> We can see too many Qtp threads locked like this one.
> ###
> "httpShardExecutor-7-thread-58586-processing-x:offers_TP_shard4_replica_p77
> r:core_node78 http:////
> xxxx.cdweb.biz:8983//solr//offers_TP_shard9_replica_p87//|http:////xxxx.cdweb.biz:8983//solr//offers_TP_shard9_replica_t17//
> <http://xxxx.cdweb.biz:8983//solr//offers_TP_shard9_replica_p87//%7Chttp:////xxxx.cdweb.biz:8983//solr//offers_TP_shard9_replica_t17//>
> n:xxxx.cdweb.biz:8983_solr c:offers_TP s:shard4 [http:////
> xxx.cdweb.biz:8983//solr//offers_TP_shard9_replica_p87//, http:////
> xxxx.cdweb.biz:8983//solr//offers_TP_shard9_replica_t17//]" - Thread
> t@103812
>    java.lang.Thread.State: TIMED_WAITING
>         at jdk.internal.misc.Unsafe.park(Native Method)
>         - parking to wait for <5e18a5a8> (a
> java.util.concurrent.SynchronousQueue$TransferStack)
>         at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234)
>         at
> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:462)
>         at
> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:361)
>         at
> java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:937)
>         at
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1053)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1114)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.lang.Thread.run(Thread.java:829)
>    Locked ownable synchronizers:
>         - None
> ###
>
> And if I stop the load, the number of qtp threads will decrease and the
> index files locks will disappear.
>
>
> So what I understand:
> Because of permanent index changes, replications and full time searches,
> SolrIndexSearchers are created and used until they become unnecessary, what
> never happens.
> There's not index change trigger to inform the 'search part' (qtp,
> SolrIndexSearcher, SolrCore ...) that they work on a depreciated index.
> RefCounter search lock have constant requests so they won't release
> anything.
>
>
> So my questions are :
> - Does someone know the concepts of SolrIndexSearcher lifecycle and could
> enlighten me on ?
> - Did I understand well the general behavior of the interweaving of these
> 2 mechanisms ?
> - Does it mean that using replication requires some downtime to allow qtp
> thread pool time to clean up ?
> - What other alternative do we have? (We tried NRT, it does not hold the
> load).
>
> NB :
> if you use LSOF, you will see permanent small locks, more than I
> described, directly at the beginning because of a lazy index files
> releasement by IndexFetcher in 'fetchLatestIndex' method.
> To better see the main bug, I force 'solrCore.closeSearcher()' in it so my
> LSOF is clearly at 0 when it begins.
>
> Already thank you for reading me so far.
> And I hope this will speak to someone
>
> Thanks
>
> Emmanuel
>
>