You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Sathyam <sa...@gmail.com> on 2017/06/01 06:22:54 UTC

Solr Document Routing

HI,

I am indexing documents to a 10 shard collection (testcollection, having no
replicas) in solr6 cluster using CloudSolrClient. I saw that there is a lot
of peer to peer document distribution going on when I looked at the solr
logs.

An example log statement is as follows:
2017-06-01 06:07:28.378 INFO  (qtp1358444045-3673692) [c:testcollection
s:shard8 r:core_node7 x:testcollection_shard8_replica1]
o.a.s.u.p.LogUpdateProcessorFactory [testcollection_shard8_replica1]
 webapp=/solr path=/update params={update.distrib=TOLEADER&distrib.from=
http://10.199.42.29:8983/solr/testcollection_shard7_replica1/&wt=javabin&version=2}{add=[BQECDwZGTCEBHZZBBiIP
(1568981383488995328), BQEBBQZB2il3wGT/0/mB (1568981383490043904),
BQEBBQZFnhOJRj+m9RJC (1568981383491092480), BQEGBgZIeBE1klHS4fxk
(1568981383492141056), BQEBBQZFVTmRx2VuCgfV (1568981383493189632)]} 0 25

When I went through the code of CloudSolrClient on grepcode I saw that the
client itself finds out which server it needs to hit by using the message
id hash and getting the shard range information from state.json.
Then it is quite confusing to me why there is a distribution of data
between peers as there is no replication and each shard is a leader.

I would like to know why this is happening and how to avoid it or if the
above log statement means something else and I am misinterpreting something.

-- 
Sathyam Doraswamy

Re: Solr Document Routing

Posted by Erick Erickson <er...@gmail.com>.
Can you check if those IDs are on shard8? You can do this by pointing
the URL at the core and specifying &distrib=false...

Best,
Erick

On Thu, Jun 1, 2017 at 1:42 AM, Amrit Sarkar <sa...@gmail.com> wrote:
> Sorry, The confluence link:
> https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>
> On Thu, Jun 1, 2017 at 2:11 PM, Amrit Sarkar <sa...@gmail.com> wrote:
>
>> Sathyam,
>>
>> It seems your interpretation is wrong as CloudSolrClient calculates
>> (hashes the document id and determine the range it belongs to) which shard
>> the document incoming belongs to. As you have 10 shards, the document will
>> belong to one of them, that is what being calculated and eventually pushed
>> to the leader of that shard.
>>
>> The confluence link provides the insights in much detail:
>> https://lucidworks.com/2013/06/13/solr-cloud-document-routing/
>> Another useful link: https://lucidworks.com/2013/06/13/solr-cloud-
>> document-routing/
>>
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>>
>> On Thu, Jun 1, 2017 at 11:52 AM, Sathyam <sa...@gmail.com>
>> wrote:
>>
>>> HI,
>>>
>>> I am indexing documents to a 10 shard collection (testcollection, having
>>> no
>>> replicas) in solr6 cluster using CloudSolrClient. I saw that there is a
>>> lot
>>> of peer to peer document distribution going on when I looked at the solr
>>> logs.
>>>
>>> An example log statement is as follows:
>>> 2017-06-01 06:07:28.378 INFO  (qtp1358444045-3673692) [c:testcollection
>>> s:shard8 r:core_node7 x:testcollection_shard8_replica1]
>>> o.a.s.u.p.LogUpdateProcessorFactory [testcollection_shard8_replica1]
>>>  webapp=/solr path=/update params={update.distrib=TOLEADER&distrib.from=
>>> http://10.199.42.29:8983/solr/testcollection_shard7_replica1
>>> /&wt=javabin&version=2}{add=[BQECDwZGTCEBHZZBBiIP
>>> (1568981383488995328), BQEBBQZB2il3wGT/0/mB (1568981383490043904),
>>> BQEBBQZFnhOJRj+m9RJC (1568981383491092480), BQEGBgZIeBE1klHS4fxk
>>> (1568981383492141056), BQEBBQZFVTmRx2VuCgfV (1568981383493189632)]} 0 25
>>>
>>> When I went through the code of CloudSolrClient on grepcode I saw that the
>>> client itself finds out which server it needs to hit by using the message
>>> id hash and getting the shard range information from state.json.
>>> Then it is quite confusing to me why there is a distribution of data
>>> between peers as there is no replication and each shard is a leader.
>>>
>>> I would like to know why this is happening and how to avoid it or if the
>>> above log statement means something else and I am misinterpreting
>>> something.
>>>
>>> --
>>> Sathyam Doraswamy
>>>
>>
>>

Re: Solr Document Routing

Posted by Amrit Sarkar <sa...@gmail.com>.
Sorry, The confluence link:
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Thu, Jun 1, 2017 at 2:11 PM, Amrit Sarkar <sa...@gmail.com> wrote:

> Sathyam,
>
> It seems your interpretation is wrong as CloudSolrClient calculates
> (hashes the document id and determine the range it belongs to) which shard
> the document incoming belongs to. As you have 10 shards, the document will
> belong to one of them, that is what being calculated and eventually pushed
> to the leader of that shard.
>
> The confluence link provides the insights in much detail:
> https://lucidworks.com/2013/06/13/solr-cloud-document-routing/
> Another useful link: https://lucidworks.com/2013/06/13/solr-cloud-
> document-routing/
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>
> On Thu, Jun 1, 2017 at 11:52 AM, Sathyam <sa...@gmail.com>
> wrote:
>
>> HI,
>>
>> I am indexing documents to a 10 shard collection (testcollection, having
>> no
>> replicas) in solr6 cluster using CloudSolrClient. I saw that there is a
>> lot
>> of peer to peer document distribution going on when I looked at the solr
>> logs.
>>
>> An example log statement is as follows:
>> 2017-06-01 06:07:28.378 INFO  (qtp1358444045-3673692) [c:testcollection
>> s:shard8 r:core_node7 x:testcollection_shard8_replica1]
>> o.a.s.u.p.LogUpdateProcessorFactory [testcollection_shard8_replica1]
>>  webapp=/solr path=/update params={update.distrib=TOLEADER&distrib.from=
>> http://10.199.42.29:8983/solr/testcollection_shard7_replica1
>> /&wt=javabin&version=2}{add=[BQECDwZGTCEBHZZBBiIP
>> (1568981383488995328), BQEBBQZB2il3wGT/0/mB (1568981383490043904),
>> BQEBBQZFnhOJRj+m9RJC (1568981383491092480), BQEGBgZIeBE1klHS4fxk
>> (1568981383492141056), BQEBBQZFVTmRx2VuCgfV (1568981383493189632)]} 0 25
>>
>> When I went through the code of CloudSolrClient on grepcode I saw that the
>> client itself finds out which server it needs to hit by using the message
>> id hash and getting the shard range information from state.json.
>> Then it is quite confusing to me why there is a distribution of data
>> between peers as there is no replication and each shard is a leader.
>>
>> I would like to know why this is happening and how to avoid it or if the
>> above log statement means something else and I am misinterpreting
>> something.
>>
>> --
>> Sathyam Doraswamy
>>
>
>

Re: Solr Document Routing

Posted by Amrit Sarkar <sa...@gmail.com>.
Sathyam,

It seems your interpretation is wrong as CloudSolrClient calculates (hashes
the document id and determine the range it belongs to) which shard the
document incoming belongs to. As you have 10 shards, the document will
belong to one of them, that is what being calculated and eventually pushed
to the leader of that shard.

The confluence link provides the insights in much detail:
https://lucidworks.com/2013/06/13/solr-cloud-document-routing/
Another useful link:
https://lucidworks.com/2013/06/13/solr-cloud-document-routing/

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Thu, Jun 1, 2017 at 11:52 AM, Sathyam <sa...@gmail.com>
wrote:

> HI,
>
> I am indexing documents to a 10 shard collection (testcollection, having no
> replicas) in solr6 cluster using CloudSolrClient. I saw that there is a lot
> of peer to peer document distribution going on when I looked at the solr
> logs.
>
> An example log statement is as follows:
> 2017-06-01 06:07:28.378 INFO  (qtp1358444045-3673692) [c:testcollection
> s:shard8 r:core_node7 x:testcollection_shard8_replica1]
> o.a.s.u.p.LogUpdateProcessorFactory [testcollection_shard8_replica1]
>  webapp=/solr path=/update params={update.distrib=TOLEADER&distrib.from=
> http://10.199.42.29:8983/solr/testcollection_shard7_
> replica1/&wt=javabin&version=2}{add=[BQECDwZGTCEBHZZBBiIP
> (1568981383488995328), BQEBBQZB2il3wGT/0/mB (1568981383490043904),
> BQEBBQZFnhOJRj+m9RJC (1568981383491092480), BQEGBgZIeBE1klHS4fxk
> (1568981383492141056), BQEBBQZFVTmRx2VuCgfV (1568981383493189632)]} 0 25
>
> When I went through the code of CloudSolrClient on grepcode I saw that the
> client itself finds out which server it needs to hit by using the message
> id hash and getting the shard range information from state.json.
> Then it is quite confusing to me why there is a distribution of data
> between peers as there is no replication and each shard is a leader.
>
> I would like to know why this is happening and how to avoid it or if the
> above log statement means something else and I am misinterpreting
> something.
>
> --
> Sathyam Doraswamy
>