You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Ilan Ginzburg (Jira)" <ji...@apache.org> on 2021/02/09 00:05:00 UTC
[jira] [Commented] (SOLR-15146) Distribute Collection API command execution

    [ https://issues.apache.org/jira/browse/SOLR-15146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281460#comment-17281460 ] 

Ilan Ginzburg commented on SOLR-15146:
--------------------------------------

I've hacked a PoC distributing only the Collection creation API call (and taking shortcuts, but basically implementing the happy path reasonably) to get an idea of the implementation effort and collect a few numbers.


 The branch is at [github.com/murblanc/lucene-solr/tree/Distributing_Collection_API_PoC|https://github.com/murblanc/lucene-solr/tree/Distributing_Collection_API_PoC] and is based on the code from [PR 2285|https://github.com/apache/lucene-solr/pull/2285] from SOLR-14928.

Here's a few timing values based on runs on my laptop (3 nodes cluster). I've run twice each test and kept the set of values with the lowest average.
 Don't take these numbers too literally when they're close as they can go either way (same tests slightly different values in comment on SOLR-14928 for example), but major differences do show certain strategies are a better fit for the use case. Times in ms.

*Create 100 collections (10 concurrent threads, 10 collections each) of 2 shards of 2 replicas each collection:*

Overseer state + Overseer collection API + json replica state: *Avg 11728*, min 8307, max 15391
 Overseer state + Overseer collection API + *PerReplicaState*: *Avg 11718*, min 5615, max 14565 
 *Distributed state* + Overseer collection API + json replica state: *Avg 7880*, min 6298, max 10986 
 *Distributed state* + Overseer collection API + *PerReplicaState*: *Avg 7768*, min 6902, max 8939
 *Distributed state* + *distributed Collection API* + json replica state: *Avg 8322*, min 6443, max 12285 
 *Distributed state* + *distributed Collection API* + *PerReplicaState*: *Avg 8702*, min 6831, max 13803

*Create 50 collections by 50 concurrent threads (1 collection each), 2 shards 2 replicas each collection:*

Overseer state + Overseer collection API + json replica state: *Avg 45315*, min 40708, max 50431 
 Overseer state + Overseer collection API + *PerReplicaState*: *Avg 46174*, min 43431, max 50025
 *Distributed state* + Overseer collection API + json replica state: *Avg 22365*, min 20591, max 23708 
 *Distributed state* + Overseer collection API + *PerReplicaState*: *Avg 22525*, min 18067, max 24049 
 *Distributed state* + *distributed Collection API* + json replica state: *Avg 18421*, min 16670, max 18968 
 *Distributed state* + *distributed Collection API* + *PerReplicaState*: *Avg 18342*, min 16137, max 18912

> Distribute Collection API command execution
> -------------------------------------------
>
>                 Key: SOLR-15146
>                 URL: https://issues.apache.org/jira/browse/SOLR-15146
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: master (9.0)
>            Reporter: Ilan Ginzburg
>            Assignee: Ilan Ginzburg
>            Priority: Major
>              Labels: collection-api, overseer
>
> Building on the distributed cluster state update changes (SOLR-14928), this ticket will distribute the Collection API so that commands can execute on any node (i.e. the node handling the request through {{CollectionsHandler}}) without having to go through a Zookeeper queue and the Overseer.
> This is the second step (first was SOLR-14928) after which the Overseer could be removed (but the code keeps existing execution options so completion by no means Overseer is gone, but it could be removed in a future release).
> There is a dependency on the distributed cluster state changes because the Overseer locking protecting same collection (or same shard) Collection API commands from executing concurrently will be replaced by optimistic locking of the collection {{state.json}} znodes (or other znodes that will eventually replace/augment {{state.json}}).
> The goal of this ticket is threefold:
> * Simplify the code (running synchronously and not going through the Zookeeper queues and the Overseer dequeue logic is much simpler),
> * Lead to improved performance for most/all use cases (although this is a secondary goal, as long as performance is not degraded) and
> * Allow a future change (in another future Jira) to the way cluster state is cached on the nodes of the cluster (keep less information, be less dependent on Zookeeper watches, do not care about collections not present on the node). This future work will aim to significantly increase the scale (amount of collections) supported by SolrCloud.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org