You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jamie Johnson <je...@gmail.com> on 2012/03/20 15:16:45 UTC

SolrCloud replica and leader out of Sync somehow

I'm trying to figure out how it's possible for 2 solr instances (1
which is leader 1 is replica) to be out of sync.  I've done commits to
the solr instances, forced replication but still the solr instances
have different info.  The relevant snippet from my clusterstate.json
is listed below.


    \"shard3\":{
      \"host2:7577_solr_shard3-core2\":{
        \"shard\":\"shard3\",
        \"leader\":\"true\",
        \"state\":\"active\",
        \"core\":\"shard3-core2\",
        \"collection\":\"collection1\",
        \"node_name\":\"host2:7577_solr\",
        \"base_url\":\"http://host2:7577/solr\"},
      \"host1:7575_solr_shard3-core1\":{
        \"shard\":\"shard3\",
        \"state\":\"active\",
        \"core\":\"shard3-core1\",
        \"collection\":\"collection1\",
        \"node_name\":\"host1:7575_solr\",
        \"base_url\":\"http://host1:7575/solr\"}},


Where can I look to see why this is happening?

Re: SolrCloud replica and leader out of Sync somehow

Posted by Jamie Johnson <je...@gmail.com>.
awesome Yonik.  I'll indeed try this.  Thanks!

On Thu, Apr 5, 2012 at 10:20 AM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> On Thu, Apr 5, 2012 at 12:19 AM, Jamie Johnson <je...@gmail.com> wrote:
>> Not sure if this got lost in the shuffle, were there any thoughts on this?
>
> Sorting by "id" could be pretty expensive (memory-wise), so I don't
> think it should be default or anything.
> We also need a way for a client to hit the same set of servers again
> anyway (to handle other possible variations like commit time).
>
> To handle the tiebreak stuff, you could also sort by _version_ - that
> should be unique in an index and is already used under the covers and
> hence shouldn't add any extra memory overhead.  versions increase over
> time, so "_version desc" should give you newer documents first.
>
> -Yonik
> lucenerevolution.com - Lucene/Solr Open Source Search Conference.
> Boston May 7-10
>
>
>
>
>> On Wed, Mar 21, 2012 at 11:02 AM, Jamie Johnson <je...@gmail.com> wrote:
>>> Given that in a distributed environment the docids are not guaranteed
>>> to be the same across shards should the sorting use the uniqueId field
>>> as the tie breaker by default?
>>>
>>> On Tue, Mar 20, 2012 at 2:10 PM, Yonik Seeley
>>> <yo...@lucidimagination.com> wrote:
>>>> On Tue, Mar 20, 2012 at 2:02 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>>> I'll try to dig for the JIRA.  Also I'm assuming this could happen on
>>>>> any sort, not just score correct?  Meaning if we sorted by a date
>>>>> field and there were duplicates in that date field order wouldn't be
>>>>> guaranteed for the same reasons right?
>>>>
>>>> Correct - internal docid is the tiebreaker for all sorts.
>>>>
>>>> -Yonik
>>>> lucenerevolution.com - Lucene/Solr Open Source Search Conference.
>>>> Boston May 7-10

Re: SolrCloud replica and leader out of Sync somehow

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Thu, Apr 5, 2012 at 12:19 AM, Jamie Johnson <je...@gmail.com> wrote:
> Not sure if this got lost in the shuffle, were there any thoughts on this?

Sorting by "id" could be pretty expensive (memory-wise), so I don't
think it should be default or anything.
We also need a way for a client to hit the same set of servers again
anyway (to handle other possible variations like commit time).

To handle the tiebreak stuff, you could also sort by _version_ - that
should be unique in an index and is already used under the covers and
hence shouldn't add any extra memory overhead.  versions increase over
time, so "_version desc" should give you newer documents first.

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10




> On Wed, Mar 21, 2012 at 11:02 AM, Jamie Johnson <je...@gmail.com> wrote:
>> Given that in a distributed environment the docids are not guaranteed
>> to be the same across shards should the sorting use the uniqueId field
>> as the tie breaker by default?
>>
>> On Tue, Mar 20, 2012 at 2:10 PM, Yonik Seeley
>> <yo...@lucidimagination.com> wrote:
>>> On Tue, Mar 20, 2012 at 2:02 PM, Jamie Johnson <je...@gmail.com> wrote:
>>>> I'll try to dig for the JIRA.  Also I'm assuming this could happen on
>>>> any sort, not just score correct?  Meaning if we sorted by a date
>>>> field and there were duplicates in that date field order wouldn't be
>>>> guaranteed for the same reasons right?
>>>
>>> Correct - internal docid is the tiebreaker for all sorts.
>>>
>>> -Yonik
>>> lucenerevolution.com - Lucene/Solr Open Source Search Conference.
>>> Boston May 7-10

Re: SolrCloud replica and leader out of Sync somehow

Posted by Jamie Johnson <je...@gmail.com>.
Not sure if this got lost in the shuffle, were there any thoughts on this?

On Wed, Mar 21, 2012 at 11:02 AM, Jamie Johnson <je...@gmail.com> wrote:
> Given that in a distributed environment the docids are not guaranteed
> to be the same across shards should the sorting use the uniqueId field
> as the tie breaker by default?
>
> On Tue, Mar 20, 2012 at 2:10 PM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
>> On Tue, Mar 20, 2012 at 2:02 PM, Jamie Johnson <je...@gmail.com> wrote:
>>> I'll try to dig for the JIRA.  Also I'm assuming this could happen on
>>> any sort, not just score correct?  Meaning if we sorted by a date
>>> field and there were duplicates in that date field order wouldn't be
>>> guaranteed for the same reasons right?
>>
>> Correct - internal docid is the tiebreaker for all sorts.
>>
>> -Yonik
>> lucenerevolution.com - Lucene/Solr Open Source Search Conference.
>> Boston May 7-10

Re: SolrCloud replica and leader out of Sync somehow

Posted by Jamie Johnson <je...@gmail.com>.
Given that in a distributed environment the docids are not guaranteed
to be the same across shards should the sorting use the uniqueId field
as the tie breaker by default?

On Tue, Mar 20, 2012 at 2:10 PM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> On Tue, Mar 20, 2012 at 2:02 PM, Jamie Johnson <je...@gmail.com> wrote:
>> I'll try to dig for the JIRA.  Also I'm assuming this could happen on
>> any sort, not just score correct?  Meaning if we sorted by a date
>> field and there were duplicates in that date field order wouldn't be
>> guaranteed for the same reasons right?
>
> Correct - internal docid is the tiebreaker for all sorts.
>
> -Yonik
> lucenerevolution.com - Lucene/Solr Open Source Search Conference.
> Boston May 7-10

Re: SolrCloud replica and leader out of Sync somehow

Posted by Jamie Johnson <je...@gmail.com>.
Thanks Yonik, I really appreciate the explanation.  It sounds like the
best solution for me to solve this is to add the additional sort
parameter.  That being said is there a significant memory increase to
do this when sorting by score?  I don't see how with SolrCloud I can
avoid doing this, and how others wouldn't need to do the same thing.

On Tue, Mar 20, 2012 at 1:38 PM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> On Tue, Mar 20, 2012 at 1:07 PM, Jamie Johnson <je...@gmail.com> wrote:
>> I believe we're using replication to only duplicate the index
>> (standard SolrCloud nothing special on our end) so I don't see why the
>> docids wouldn't be the same....am I missing something that is
>> happening there that I am unaware of?
>
> Each document is pushed to the replicas (i.e. standard whole-index
> "replication" is only used in recovery scenarios).  If you're using
> multiple threads to index, then docA can be indexed before docB on one
> replica and vice-versa on a different replica (or on the leader).
> Although even if this were not the case, I don't believe Lucene is
> deterministic in this respect anyway (i.e. indexing identically on two
> different boxes is not guaranteed to result in the exact same internal
> document order).
>
> -Yonik
> lucenerevolution.com - Lucene/Solr Open Source Search Conference.
> Boston May 7-10

Re: SolrCloud replica and leader out of Sync somehow

Posted by Jamie Johnson <je...@gmail.com>.
I believe we're using replication to only duplicate the index
(standard SolrCloud nothing special on our end) so I don't see why the
docids wouldn't be the same....am I missing something that is
happening there that I am unaware of?

On Tue, Mar 20, 2012 at 11:50 AM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> On Tue, Mar 20, 2012 at 11:39 AM, Jamie Johnson <je...@gmail.com> wrote:
>> Hmmm....Ok, I don't see how it's possible for me to ensure that there
>> are no ties.  If a query were for *:* everything has a constant score,
>> if the user requested 1 page then requested the next the results on
>> the second page could be duplicates from what was on the first page.
>> I don't remember ever seeing this issue on older versions of
>> SolrCloud, although from what you're saying I should have.  What could
>> explain why I never saw this before?
>
> If you use replication only to duplicate an index (and avoid any
> merges), then you will have identical docids.
>
>> Another possible fix to ensure proper ordering couldn't we always
>> specify a sort order which contained the key?  So for instance the
>> user asks for score asc, we'd make this score asc,key asc so that
>> results would be order by score and then by key so the results across
>> pages would be consistent?
>
> Yep.
>
> And like I said, this is also an issue even on a single node.
> docid A can be before docid B, then a segment merge can cause these to
> be shuffled.
>
> -Yonik
> lucenerevolution.com - Lucene/Solr Open Source Search Conference.
> Boston May 7-10

Re: SolrCloud replica and leader out of Sync somehow

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Tue, Mar 20, 2012 at 11:39 AM, Jamie Johnson <je...@gmail.com> wrote:
> Hmmm....Ok, I don't see how it's possible for me to ensure that there
> are no ties.  If a query were for *:* everything has a constant score,
> if the user requested 1 page then requested the next the results on
> the second page could be duplicates from what was on the first page.
> I don't remember ever seeing this issue on older versions of
> SolrCloud, although from what you're saying I should have.  What could
> explain why I never saw this before?

If you use replication only to duplicate an index (and avoid any
merges), then you will have identical docids.

> Another possible fix to ensure proper ordering couldn't we always
> specify a sort order which contained the key?  So for instance the
> user asks for score asc, we'd make this score asc,key asc so that
> results would be order by score and then by key so the results across
> pages would be consistent?

Yep.

And like I said, this is also an issue even on a single node.
docid A can be before docid B, then a segment merge can cause these to
be shuffled.

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10

Re: SolrCloud replica and leader out of Sync somehow

Posted by Jamie Johnson <je...@gmail.com>.
Hmmm....Ok, I don't see how it's possible for me to ensure that there
are no ties.  If a query were for *:* everything has a constant score,
if the user requested 1 page then requested the next the results on
the second page could be duplicates from what was on the first page.
I don't remember ever seeing this issue on older versions of
SolrCloud, although from what you're saying I should have.  What could
explain why I never saw this before?

Another possible fix to ensure proper ordering couldn't we always
specify a sort order which contained the key?  So for instance the
user asks for score asc, we'd make this score asc,key asc so that
results would be order by score and then by key so the results across
pages would be consistent?


On Tue, Mar 20, 2012 at 11:30 AM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> On Tue, Mar 20, 2012 at 11:17 AM, Jamie Johnson <je...@gmail.com> wrote:
>> ok, with my custom component out of the picture I still have the same
>> issue.  Specifically, when sorting by score on a leader and replica I
>> am getting different doc orderings.  Is this something anyone has
>> seen?
>
> This is certainly possible and expected - sorting tiebreakers is by
> internal lucene docid, which can change (even on a single node!)
> If you need lists that don't shift around due to unrelated changes,
> make sure you don't have any ties!
>
> -Yonik
> lucenerevolution.com - Lucene/Solr Open Source Search Conference.
> Boston May 7-10

Re: SolrCloud replica and leader out of Sync somehow

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Tue, Mar 20, 2012 at 11:17 AM, Jamie Johnson <je...@gmail.com> wrote:
> ok, with my custom component out of the picture I still have the same
> issue.  Specifically, when sorting by score on a leader and replica I
> am getting different doc orderings.  Is this something anyone has
> seen?

This is certainly possible and expected - sorting tiebreakers is by
internal lucene docid, which can change (even on a single node!)
If you need lists that don't shift around due to unrelated changes,
make sure you don't have any ties!

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10

Re: SolrCloud replica and leader out of Sync somehow

Posted by Jamie Johnson <je...@gmail.com>.
ok, with my custom component out of the picture I still have the same
issue.  Specifically, when sorting by score on a leader and replica I
am getting different doc orderings.  Is this something anyone has
seen?

On Tue, Mar 20, 2012 at 11:09 AM, Jamie Johnson <je...@gmail.com> wrote:
> DocCounts are the same.  I am going to disable my custom component to
> see if that is mucking with something but it seems to be working
> properly.
>
> After looking at the results a little closer (expanding the number of
> results coming back) it seems that the same information is in both but
> the order in which the items are being returned is not the same.  I'm
> sorting by score when they seem to be in different orders, if I sort
> by key then the results look the same.
>
> On Tue, Mar 20, 2012 at 10:52 AM, Mark Miller <ma...@gmail.com> wrote:
>> Do you have the logs for this? Either around startup or when you are forcing replication. Logs around both would be helpful.
>>
>> Also the doc counts for each shard?
>>
>> On Mar 20, 2012, at 10:16 AM, Jamie Johnson wrote:
>>
>>> I'm trying to figure out how it's possible for 2 solr instances (1
>>> which is leader 1 is replica) to be out of sync.  I've done commits to
>>> the solr instances, forced replication but still the solr instances
>>> have different info.  The relevant snippet from my clusterstate.json
>>> is listed below.
>>>
>>>
>>>    \"shard3\":{
>>>      \"host2:7577_solr_shard3-core2\":{
>>>        \"shard\":\"shard3\",
>>>        \"leader\":\"true\",
>>>        \"state\":\"active\",
>>>        \"core\":\"shard3-core2\",
>>>        \"collection\":\"collection1\",
>>>        \"node_name\":\"host2:7577_solr\",
>>>        \"base_url\":\"http://host2:7577/solr\"},
>>>      \"host1:7575_solr_shard3-core1\":{
>>>        \"shard\":\"shard3\",
>>>        \"state\":\"active\",
>>>        \"core\":\"shard3-core1\",
>>>        \"collection\":\"collection1\",
>>>        \"node_name\":\"host1:7575_solr\",
>>>        \"base_url\":\"http://host1:7575/solr\"}},
>>>
>>>
>>> Where can I look to see why this is happening?
>>
>> - Mark Miller
>> lucidimagination.com
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>

Re: SolrCloud replica and leader out of Sync somehow

Posted by Jamie Johnson <je...@gmail.com>.
DocCounts are the same.  I am going to disable my custom component to
see if that is mucking with something but it seems to be working
properly.

After looking at the results a little closer (expanding the number of
results coming back) it seems that the same information is in both but
the order in which the items are being returned is not the same.  I'm
sorting by score when they seem to be in different orders, if I sort
by key then the results look the same.

On Tue, Mar 20, 2012 at 10:52 AM, Mark Miller <ma...@gmail.com> wrote:
> Do you have the logs for this? Either around startup or when you are forcing replication. Logs around both would be helpful.
>
> Also the doc counts for each shard?
>
> On Mar 20, 2012, at 10:16 AM, Jamie Johnson wrote:
>
>> I'm trying to figure out how it's possible for 2 solr instances (1
>> which is leader 1 is replica) to be out of sync.  I've done commits to
>> the solr instances, forced replication but still the solr instances
>> have different info.  The relevant snippet from my clusterstate.json
>> is listed below.
>>
>>
>>    \"shard3\":{
>>      \"host2:7577_solr_shard3-core2\":{
>>        \"shard\":\"shard3\",
>>        \"leader\":\"true\",
>>        \"state\":\"active\",
>>        \"core\":\"shard3-core2\",
>>        \"collection\":\"collection1\",
>>        \"node_name\":\"host2:7577_solr\",
>>        \"base_url\":\"http://host2:7577/solr\"},
>>      \"host1:7575_solr_shard3-core1\":{
>>        \"shard\":\"shard3\",
>>        \"state\":\"active\",
>>        \"core\":\"shard3-core1\",
>>        \"collection\":\"collection1\",
>>        \"node_name\":\"host1:7575_solr\",
>>        \"base_url\":\"http://host1:7575/solr\"}},
>>
>>
>> Where can I look to see why this is happening?
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>

Re: SolrCloud replica and leader out of Sync somehow

Posted by Mark Miller <ma...@gmail.com>.
Do you have the logs for this? Either around startup or when you are forcing replication. Logs around both would be helpful.

Also the doc counts for each shard?

On Mar 20, 2012, at 10:16 AM, Jamie Johnson wrote:

> I'm trying to figure out how it's possible for 2 solr instances (1
> which is leader 1 is replica) to be out of sync.  I've done commits to
> the solr instances, forced replication but still the solr instances
> have different info.  The relevant snippet from my clusterstate.json
> is listed below.
> 
> 
>    \"shard3\":{
>      \"host2:7577_solr_shard3-core2\":{
>        \"shard\":\"shard3\",
>        \"leader\":\"true\",
>        \"state\":\"active\",
>        \"core\":\"shard3-core2\",
>        \"collection\":\"collection1\",
>        \"node_name\":\"host2:7577_solr\",
>        \"base_url\":\"http://host2:7577/solr\"},
>      \"host1:7575_solr_shard3-core1\":{
>        \"shard\":\"shard3\",
>        \"state\":\"active\",
>        \"core\":\"shard3-core1\",
>        \"collection\":\"collection1\",
>        \"node_name\":\"host1:7575_solr\",
>        \"base_url\":\"http://host1:7575/solr\"}},
> 
> 
> Where can I look to see why this is happening?

- Mark Miller
lucidimagination.com