You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Andrzej Bialecki (Jira)" <ji...@apache.org> on 2020/05/12 09:13:00 UTC
[jira] [Reopened] (SOLR-14347) Autoscaling placement wrong when
concurrent replica placements are calculated
[ https://issues.apache.org/jira/browse/SOLR-14347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Bialecki reopened SOLR-14347:
-------------------------------------
> Autoscaling placement wrong when concurrent replica placements are calculated
> -----------------------------------------------------------------------------
>
> Key: SOLR-14347
> URL: https://issues.apache.org/jira/browse/SOLR-14347
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: AutoScaling
> Affects Versions: 8.5
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
> Priority: Major
> Fix For: 8.6
>
> Attachments: SOLR-14347.patch
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Steps to reproduce:
> * create a cluster of a few nodes (tested with 7 nodes)
> * define per-collection policies that distribute replicas exclusively on different nodes per policy
> * concurrently create a few collections, each using a different policy
> * resulting replica placement will be seriously wrong, causing many policy violations
> Running the same scenario but instead creating collections sequentially results in no violations.
> I suspect this is caused by incorrect locking level for all collection operations (as defined in {{CollectionParams.CollectionAction}}) that create new replica placements - i.e. CREATE, ADDREPLICA, MOVEREPLICA, DELETENODE, REPLACENODE, SPLITSHARD, RESTORE, REINDEXCOLLECTION. All of these operations use the policy engine to create new replica placements, and as a result they change the cluster state. However, currently these operations are locked (in {{OverseerCollectionMessageHandler.lockTask}} ) using {{LockLevel.COLLECTION}}. In practice this means that the lock is held only for the particular collection that is being modified.
> A straightforward fix for this issue is to change the locking level to CLUSTER (and I confirm this fixes the scenario described above). However, this effectively serializes all collection operations listed above, which will result in general slow-down of all collection operations.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org