You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Erick Erickson (JIRA)" <ji...@apache.org> on 2016/07/07 20:04:11 UTC

[jira] [Comment Edited] (SOLR-7280) Load cores in sorted order and tweak coreLoadThread counts to improve cluster stability on restarts

    [ https://issues.apache.org/jira/browse/SOLR-7280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366690#comment-15366690 ] 

Erick Erickson edited comment on SOLR-7280 at 7/7/16 8:03 PM:
--------------------------------------------------------------

bq: I don't think it takes a weird topology - just more replicas than thread to load them in a shard.

OK, I think I see what you're saying. You're talking about a "deep" topology, i.e. one with many replicas on a particular shard on a particular instance and I was looking at a "wide" topology, many collections per instance but each shard had only a few replicas. I've seen both in the field as I'm sure you have....

How much of both situations would be handled by creating an ordered list of all replicas that were leaders and loading those first then loading an ordered list of all replicas that weren't labeled as leader? There's still the case of a zillion leaders on a single instance, so some heuristic like you suggest seems to be in order.

I'll emphasize though that the current code (without this patch) can prevent a cluster from coming up at _all_. With this patch the cluster at least comes up, albeit slowly if the leaderVoteWait comes into play. Bumping the number of threads to > the max replicas for a shard can handle the case you mentioned while keeping it "reasonable" can deal with the one I'm seeing.

That said, I think the default should be quite high in the cloud case so we don't change the current behavior and let situations like I'm seeing deal with configuring this. I think it defaults to 8 currently, perhaps 100 (or unlimited) instead in cloud mode?

How much of all of the above makes this patch "good enough for now" with perhaps follow-ons on more sophisticated approaches?


was (Author: erickerickson):
bq: I don't think it takes a weird topology - just more replicas than thread to load them in a shard.

OK, I think I see what you're saying. You're talking about a "deep" topology, i.e. one with many replicas on a particular shard on a particular instance and I was looking at a "wide" topology, many collections per instance but each shard had only a few replicas. I've seen both in the field as I'm sure you have....

How much of both situations would be handled by creating an ordered list of all replicas that were leaders and loading those first then loading an ordered list of all replicas that weren't labeled as leader? There's still the case of a zillion leaders on a single instance, so some heuristic like you suggest seems to be in order.

I'll emphasize though that the current code (without this patch) can prevent a cluster from coming up at _all_. With this patch the cluster at least comes up, albeit slowly if the leaderVoteWait comes into play. Bumping the number of threads can to > the max replicas for a shard can handle the case you mentioned while keeping it "reasonable" can deal with the one I'm seeing.

That said, I think the default should be quite high in the cloud case so we don't change the current behavior and let situations like I'm seeing deal with configuring this. I think it defaults to 8 currently, perhaps 100 (or unlimited) instead in cloud mode?

How much of all of the above makes this patch "good enough for now" with perhaps follow-ons on more sophisticated approaches?

> Load cores in sorted order and tweak coreLoadThread counts to improve cluster stability on restarts
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7280
>                 URL: https://issues.apache.org/jira/browse/SOLR-7280
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Noble Paul
>             Fix For: 5.2, 6.0
>
>         Attachments: SOLR-7280.patch
>
>
> In SOLR-7191, Damien mentioned that by loading solr cores in a sorted order and tweaking some of the coreLoadThread counts, he was able to improve the stability of a cluster with thousands of collections. We should explore some of these changes and fold them into Solr.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org