You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Varun Thacker <va...@vthacker.in> on 2018/01/05 21:18:33 UTC

Indexing fingerprinting and PeerSync

Hi Everyone,

I was looking into a scenario where PeerSync failed even when we had a high
number maxNumLogsToKeep ( 200 ) and numRecordsToKeep ( 200000 )

The log excerpt is at
https://gist.github.com/vthacker/fb536c6f1146dd0d7513afb9960a10e3 and I am
still trying to pinpoint the actual cause . It looks to me that the replica
has more number of documents till that version ( numVersions ) than the
leader and I can't tell why. Does this look like a bug?

While trying to reproduce it locally here is one scenario that I ran into :

   1. I kept a very low numRecordsToKeep ( 5 ) . Indexed like 3 or 4 docs
   while the replica was down and then started it up. PeerSync failed because
   of
   https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/PeerSync.java#L655
   . Do we need to do a threshold check when we are verifying via
   fingerprinting if the indexes are the same or not? From my understanding we
   can avoid this check when fingerprinting is enabled but wanted to check
   before filing a Jira

Re: Indexing fingerprinting and PeerSync

Posted by Pushkar Raste <pu...@gmail.com>.
Ah, so the replica has higher version after updates are applied. One
possible reason could be that the replica did not buffer the updates that
came in while it was recovering. While your test it failed before updates
were applied as the check you are pointing out happens even before the
version ranges are computed.

As mentioned the only thing that pops out from the logs is that there are
too many version ranges. Ideally there should only be one version range
with lower version corresponding to the last version replica received
before going down and higher version corresponding to the last version
replica received before it started buffering updates.

My apologies for not being able to point the exact issue.

There is PeerSyncReplicationTest that can be useful to verify if PeerSync
is broken.  If you can send me the test I can take a look at it.

On Jan 6, 2018 11:45 PM, "Varun Thacker" <va...@vthacker.in> wrote:

Hi Pushkar,

So the shard had only 2 replicas and the leader was always constant .

Does anything look odd from the log snippet to you? Like why would leader
have ( numVersions=104904602 ) and replica have ( numVersions=104904618 )
which is more than the leader after the updates were applied?

On Sat, Jan 6, 2018 at 8:05 PM, Pushkar Raste <pu...@gmail.com>
wrote:

> Hi Varun,
> - I noticed in your logs there are multiple version ranges, this would
> happen only if replica has missed versions intermittently. I would expect
> only a single range or only a couple of ranges at the best. One possible
> explanation is there was another leader when the recovering replica went
> down and then some other replica became leader while the recovering replica
> came back up.
>
>
> - Not sure which threshold are you referring to. The line you are pointing
> to checks if the recovering replica's highest version is newer than the
> leader's highest version. This check happens before version diff (version
> ranges) is computed. This check happens irrespective of whether fingerprint
> check is enabled or not. Last time I looked at this code, there was check
> to ensure replica's versions and leader's versions have enough overlap (I
> think heuristics was there is at the least 20% overlap), which I don't see
> anymore but you can see there are comments still lingering about the
> overlap.
>
>   While the check to ensure leader has higher versions than replica
> happens way early in the PeerSync, the fingerprint check happens only after
> the updates are applied to the replica. Think of the version check as a
> short circuit test.
>
> I have made lot of changes to the PeerSync code some time ago, and would
> be happy to provide you details about what I know.
>
> On Fri, Jan 5, 2018 at 4:18 PM, Varun Thacker <va...@vthacker.in> wrote:
>
>> Hi Everyone,
>>
>> I was looking into a scenario where PeerSync failed even when we had a
>> high number maxNumLogsToKeep ( 200 ) and numRecordsToKeep ( 200000 )
>>
>> The log excerpt is at https://gist.github.com/vthack
>> er/fb536c6f1146dd0d7513afb9960a10e3 and I am still trying to pinpoint
>> the actual cause . It looks to me that the replica has more number of
>> documents till that version ( numVersions ) than the leader and I can't
>> tell why. Does this look like a bug?
>>
>> While trying to reproduce it locally here is one scenario that I ran into
>> :
>>
>>    1. I kept a very low numRecordsToKeep ( 5 ) . Indexed like 3 or 4
>>    docs while the replica was down and then started it up. PeerSync failed
>>    because of  https://github.com/apache/lucene-solr/blob/master/solr/c
>>    ore/src/java/org/apache/solr/update/PeerSync.java#L655
>>    <https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/PeerSync.java#L655>
>>    . Do we need to do a threshold check when we are verifying via
>>    fingerprinting if the indexes are the same or not? From my understanding we
>>    can avoid this check when fingerprinting is enabled but wanted to check
>>    before filing a Jira
>>
>>
>>
>>
>

Re: Indexing fingerprinting and PeerSync

Posted by Varun Thacker <va...@vthacker.in>.
Hi Pushkar,

So the shard had only 2 replicas and the leader was always constant .

Does anything look odd from the log snippet to you? Like why would leader
have ( numVersions=104904602 ) and replica have ( numVersions=104904618 )
which is more than the leader after the updates were applied?

On Sat, Jan 6, 2018 at 8:05 PM, Pushkar Raste <pu...@gmail.com>
wrote:

> Hi Varun,
> - I noticed in your logs there are multiple version ranges, this would
> happen only if replica has missed versions intermittently. I would expect
> only a single range or only a couple of ranges at the best. One possible
> explanation is there was another leader when the recovering replica went
> down and then some other replica became leader while the recovering replica
> came back up.
>
>
> - Not sure which threshold are you referring to. The line you are pointing
> to checks if the recovering replica's highest version is newer than the
> leader's highest version. This check happens before version diff (version
> ranges) is computed. This check happens irrespective of whether fingerprint
> check is enabled or not. Last time I looked at this code, there was check
> to ensure replica's versions and leader's versions have enough overlap (I
> think heuristics was there is at the least 20% overlap), which I don't see
> anymore but you can see there are comments still lingering about the
> overlap.
>
>   While the check to ensure leader has higher versions than replica
> happens way early in the PeerSync, the fingerprint check happens only after
> the updates are applied to the replica. Think of the version check as a
> short circuit test.
>
> I have made lot of changes to the PeerSync code some time ago, and would
> be happy to provide you details about what I know.
>
> On Fri, Jan 5, 2018 at 4:18 PM, Varun Thacker <va...@vthacker.in> wrote:
>
>> Hi Everyone,
>>
>> I was looking into a scenario where PeerSync failed even when we had a
>> high number maxNumLogsToKeep ( 200 ) and numRecordsToKeep ( 200000 )
>>
>> The log excerpt is at https://gist.github.com/vthack
>> er/fb536c6f1146dd0d7513afb9960a10e3 and I am still trying to pinpoint
>> the actual cause . It looks to me that the replica has more number of
>> documents till that version ( numVersions ) than the leader and I can't
>> tell why. Does this look like a bug?
>>
>> While trying to reproduce it locally here is one scenario that I ran into
>> :
>>
>>    1. I kept a very low numRecordsToKeep ( 5 ) . Indexed like 3 or 4
>>    docs while the replica was down and then started it up. PeerSync failed
>>    because of  https://github.com/apache/lucene-solr/blob/master/solr/c
>>    ore/src/java/org/apache/solr/update/PeerSync.java#L655
>>    <https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/PeerSync.java#L655>
>>    . Do we need to do a threshold check when we are verifying via
>>    fingerprinting if the indexes are the same or not? From my understanding we
>>    can avoid this check when fingerprinting is enabled but wanted to check
>>    before filing a Jira
>>
>>
>>
>>
>

Re: Indexing fingerprinting and PeerSync

Posted by Pushkar Raste <pu...@gmail.com>.
Hi Varun,
- I noticed in your logs there are multiple version ranges, this would
happen only if replica has missed versions intermittently. I would expect
only a single range or only a couple of ranges at the best. One possible
explanation is there was another leader when the recovering replica went
down and then some other replica became leader while the recovering replica
came back up.


- Not sure which threshold are you referring to. The line you are pointing
to checks if the recovering replica's highest version is newer than the
leader's highest version. This check happens before version diff (version
ranges) is computed. This check happens irrespective of whether fingerprint
check is enabled or not. Last time I looked at this code, there was check
to ensure replica's versions and leader's versions have enough overlap (I
think heuristics was there is at the least 20% overlap), which I don't see
anymore but you can see there are comments still lingering about the
overlap.

  While the check to ensure leader has higher versions than replica happens
way early in the PeerSync, the fingerprint check happens only after the
updates are applied to the replica. Think of the version check as a short
circuit test.

I have made lot of changes to the PeerSync code some time ago, and would be
happy to provide you details about what I know.

On Fri, Jan 5, 2018 at 4:18 PM, Varun Thacker <va...@vthacker.in> wrote:

> Hi Everyone,
>
> I was looking into a scenario where PeerSync failed even when we had a
> high number maxNumLogsToKeep ( 200 ) and numRecordsToKeep ( 200000 )
>
> The log excerpt is at https://gist.github.com/vthacker/
> fb536c6f1146dd0d7513afb9960a10e3 and I am still trying to pinpoint the
> actual cause . It looks to me that the replica has more number of documents
> till that version ( numVersions ) than the leader and I can't tell why.
> Does this look like a bug?
>
> While trying to reproduce it locally here is one scenario that I ran into :
>
>    1. I kept a very low numRecordsToKeep ( 5 ) . Indexed like 3 or 4 docs
>    while the replica was down and then started it up. PeerSync failed because
>    of  https://github.com/apache/lucene-solr/blob/master/solr/
>    core/src/java/org/apache/solr/update/PeerSync.java#L655
>    <https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/PeerSync.java#L655>
>    . Do we need to do a threshold check when we are verifying via
>    fingerprinting if the indexes are the same or not? From my understanding we
>    can avoid this check when fingerprinting is enabled but wanted to check
>    before filing a Jira
>
>
>
>