You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "David Smiley (Jira)" <ji...@apache.org> on 2021/01/26 20:10:00 UTC

[jira] [Assigned] (SOLR-15109) Optimize shard splitByPrefix logic to reduce number of splits required

     [ https://issues.apache.org/jira/browse/SOLR-15109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Smiley reassigned SOLR-15109:
-----------------------------------

    Component/s: SolrCloud
       Assignee: David Smiley
        Summary: Optimize shard splitByPrefix logic to reduce number of splits required  (was: Optimize splitByPrefix logic to reduce number of splits required)

> Optimize shard splitByPrefix logic to reduce number of splits required
> ----------------------------------------------------------------------
>
>                 Key: SOLR-15109
>                 URL: https://issues.apache.org/jira/browse/SOLR-15109
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Megan Carey
>            Assignee: David Smiley
>            Priority: Major
>         Attachments: Split 1 (1).png, Split 2 (1).png, Split 3 (1).png
>
>
> The goal of SplitByPrefix logic is to identify "buckets" within a shard that contain documents that should be co-located (according to their doc prefix), and split such that those buckets are preserved. One issue that we have found with splitByPrefix in practice is that it often takes several splits to isolate a particularly large bucket within the hash range. 
> [~dsmiley] came up with a simple optimization that will reduce the number of splits needed to isolate such a bucket: 
> {quote}Loop over all RangeCounts... does it intersect the middle third of the input?  If not, move on.  If so, track the biggest.  When this loop finishes, you will have the biggest that also intersects the middle third.  Then simply choose the side of this biggest RangeCount that is closest to the middle of the input range.{quote}
> This should be clearer with the following diagrams:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org