You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Yonik Seeley (JIRA)" <ji...@apache.org> on 2019/07/18 14:53:00 UTC

[jira] [Commented] (SOLR-13399) compositeId support for shard splitting

    [ https://issues.apache.org/jira/browse/SOLR-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16888048#comment-16888048 ] 

Yonik Seeley commented on SOLR-13399:
-------------------------------------

Final patch attached, I plan on committing soon. Some implementation notes:
- this only takes into account 2-level prefix keys, not tri-level yet (that can be a followup JIRA)
- we currently only split into 2 ranges (again, can be extended in a followup JIRA)
- if "id_prefix" has no values/data then no "ranges" split recommendation is returned and the split proceeds as if splitByPrefix had not been specified.
  - in the future we could use the "id" field as a slower version
- Split within a prefix is only done if there are not multiple prefix buckets in the shard (i.e. no allowedSizeDifference implemented in this issue)

> compositeId support for shard splitting
> ---------------------------------------
>
>                 Key: SOLR-13399
>                 URL: https://issues.apache.org/jira/browse/SOLR-13399
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Yonik Seeley
>            Priority: Major
>         Attachments: SOLR-13399.patch, SOLR-13399.patch
>
>
> Shard splitting does not currently have a way to automatically take into account the actual distribution (number of documents) in each hash bucket created by using compositeId hashing.
> We should probably add a parameter *splitByPrefix* to the *SPLITSHARD* command that would look at the number of docs sharing each compositeId prefix and use that to create roughly equal sized buckets by document count rather than just assuming an equal distribution across the entire hash range.
> Like normal shard splitting, we should bias against splitting within hash buckets unless necessary (since that leads to larger query fanout.) . Perhaps this warrants a parameter that would control how much of a size mismatch is tolerable before resorting to splitting within a bucket. *allowedSizeDifference*?
> To more quickly calculate the number of docs in each bucket, we could index the prefix in a different field.  Iterating over the terms for this field would quickly give us the number of docs in each (i.e lucene keeps track of the doc count for each term already.)  Perhaps the implementation could be a flag on the *id* field... something like *indexPrefixes* and poly-fields that would cause the indexing to be automatically done and alleviate having to pass in an additional field during indexing and during the call to *SPLITSHARD*.  This whole part is an optimization though and could be split off into its own issue if desired.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org