You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Joshua Humphries (JIRA)" <ji...@apache.org> on 2017/03/15 13:52:41 UTC
[jira] [Comment Edited] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

    [ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926204#comment-15926204 ] 

Joshua Humphries edited comment on SOLR-7191 at 3/15/17 1:51 PM:
-----------------------------------------------------------------

Our cluster has many thousands of collections, most of which have only a single shard and single replica. Restarting a single node takes over two minutes in good circumstances (expected restart, like during upgrades of solr or deployment of new/updated plugins). In bad circumstances, like if machines appear wedged and leader election issues have already caused the overseer queue to grow large, restarting a server can take over 10 minutes!

While watching the overseer queue size in our latest observation of this slowness, I saw that the down node messages take *way* too long to process. I ended up tracking that to an issue where it results in a ZK write for *every* collection, not just the collections that had shard-replicas on that node. In our case, it was processing about 40 times too many collections, making a rolling restart of the whole cluster effectively O(n^2) instead of O( n) in terms of the writes to ZK.

See SOLR-10277.


was (Author: jhump):
Our cluster has many thousands of collections, most of which have only a single shard and single replica. Restarting a single node takes over two minutes in good circumstances (expected restart, like during upgrades of solr or deployment of new/updated plugins). In bad circumstances, like if machines appear wedged and leader election issues have already caused the overseer queue to grow large, restarting a server can take over 10 minutes!

While watching the overseer queue size in our latest observation of this slowness, I saw that the down node messages take *way* too long to process. I ended up tracking that to an issue where it results in a ZK write for *every* collection, not just the collections that had shard-replicas on that node. In our case, it was processing about 40 times too many collections, making a rolling restart of the whole cluster effectively O(n^2) instead of O(n) in terms of the writes to ZK.

See SOLR-10277.

> Improve stability and startup performance of SolrCloud with thousands of collections
> ------------------------------------------------------------------------------------
>
>                 Key: SOLR-7191
>                 URL: https://issues.apache.org/jira/browse/SOLR-7191
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.0
>            Reporter: Shawn Heisey
>            Assignee: Noble Paul
>              Labels: performance, scalability
>             Fix For: 6.3
>
>         Attachments: lots-of-zkstatereader-updates-branch_5x.log, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance and scalability.  It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org