You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Marc Brette (Jira)" <ji...@apache.org> on 2022/03/21 10:56:00 UTC

[jira] [Commented] (SOLR-16108) Incorrect distribution of records in shards after a split with splitByKeyprefix,when using the CompositeId router with a router field defined

    [ https://issues.apache.org/jira/browse/SOLR-16108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509773#comment-17509773 ] 

Marc Brette commented on SOLR-16108:
------------------------------------

I spent sometimes analyzing but did not have bandwidth to fix it (and finally decided to use id-based routing instead of router.field based routing). Those notes could be helpful if anyone like to tackle it.
There are actually 2 issues - and maybe more down the line. 
 * First in org.apache.solr.handler.admin.SplitOp#getHashHistogramFromId:
 * this code computes the hash ranges that the shards should have after the split
 * the code does not use router.field. It also assume unicity of terms in the field used for computing hash (a simple fix is to use termsEnum.docFreq() there)


 * Second in org.apache.solr.update.SolrIndexSplitter#split:
 * this code actually performs the split based on the hash range computed above.
 * here again even though it looks up the router.field, the logic to find the hash of the documents is incorrect.

> Incorrect distribution of records in shards after a split with splitByKeyprefix,when using the CompositeId router with a router field defined
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-16108
>                 URL: https://issues.apache.org/jira/browse/SOLR-16108
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 8.4
>            Reporter: Marc Brette
>            Priority: Major
>
> When a collection is created using the CompositeId router with a router field defined, and one of its shard contains records with the same routing key, and a split of its shard is performed with splitByKeyprefix parameter, we expect the records to be uniformly distributed between the two resulting shards.
> Instead, one shard contains no record, the other contains all the records.
> Steps to reproduce:
> {code:java}
> docker network create solr-network
> # run in one terminal
> docker run -it -h solr1 --name solr1 --net solr-network -p 18983:8983 solr:8.4 /opt/solr/bin/solr -c -f
> # run in another terminal
> docker run -it -h solr2 --name solr2 --net solr-network -p 28983:8983 solr:8.4 /opt/solr/bin/solr -c -f -z solr1:9983
> #-----------------------------------------------------------------------------------------------
> # Works, documents are split between the 2 shards
> # Create collection with default compositeId router, routing key in the id, only one shard
> curl --request GET \
>   --url 'http://localhost:18983/solr/admin/collections?action=CREATE&name=routing_by_id&numShards=1'
> # Create enough documents, they all have the same routing key (france!)
> for i in {0..100}
> do
>   curl --request POST \
>   --url http://localhost:18983/solr/routing_by_id/update/json/docs?commit=true \
>   --header 'Content-Type: application/json' \
>   --data "[{
>     \"id\": \"france\!${i}0\",
>     \"title_t\": \"hi\"
> }]"
> done
> # Check it is indexed correctly
> curl --request GET \
>   --url 'http://localhost:18983/solr/routing_by_id/select?q=*%3A*'
> # Split the shard
> curl --request GET \
>   --url 'http://localhost:18983/solr/admin/collections?action=SPLITSHARD&collection=routing_by_id&shard=shard1&splitByPrefix=true'
> # Check records in shard1_0 (~half of the documents there)
> curl --request GET \
>   --url 'http://localhost:18983/solr/routing_by_id/select?q=*%3A*&shards=shard1_0'
> # Check records in shard1_1(~half of the documents there)
> curl --request GET \
>   --url 'http://localhost:18983/solr/routing_by_id/select?q=*%3A*&shards=shard1_1'
> #-----------------------------------------------------------------------------------------------
> # Fails, does not split documents in both shards
> # Create collection with default compositeId router, routing key in the field "route_t", only one shard
> curl --request GET \
>   --url 'http://localhost:18983/solr/admin/collections?action=CREATE&name=routing_by_field&numShards=1&router.field=route_t'
> # Create enough documents, they all have the same routing key (france!)
> for i in {0..100}
> do
>   curl --request POST \
>   --url http://localhost:18983/solr/routing_by_field/update/json/docs?commit=true \
>   --header 'Content-Type: application/json' \
>   --data "[{
>     \"id\": \"${i}0\",
>     \"title_t\": \"hi\",
>     \"route_t\": \"france\"
> }]"
> done
> # Check it is indexed correctly
> curl --request GET \
>   --url 'http://localhost:18983/solr/routing_by_field/select?q=*%3A*'
> # Split the shard
> curl --request GET \
>   --url 'http://localhost:18983/solr/admin/collections?action=SPLITSHARD&collection=routing_by_field&shard=shard1&splitByPrefix=true'
> # Check records in shard1_0: no document!
> curl --request GET \
>   --url 'http://localhost:18983/solr/routing_by_field/select?q=*%3A*&shards=shard1_0'
> # Check records in shard1_1: all documents!
> curl --request GET \
>   --url 'http://localhost:18983/solr/routing_by_field/select?q=*%3A*&shards=shard1_1'
>    {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org