You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@solr.apache.org by Alex Deparvu <st...@apache.org> on 2023/03/29 17:21:28 UTC

ShardSplitTest flakiness investigation

Hi,

Following David's recent email about the failing tests (and obviously not
knowing any better) I started to look at the ShardSplitTest failures.

The easy part was turning the current NPEs into an actual assertion. The
hard part is that the documents seem to sometimes be missing the _version_
field which causes the test to fail.
If anyone has more pointers into where the cause might be, please let me
know I would be happy to dig further down this rabbit hole.

This is what I have so far https://github.com/apache/solr/pull/1504

best,
alex

Re: ShardSplitTest flakiness investigation

Posted by Ishan Chattopadhyaya <ic...@gmail.com>.
Hi Alex,
Thank you for your investigation. I will be able to take a deeper look at
it by Thursday or Friday. This fix is potentially very important for my
client, Fullstory, so thank you for your work on this on their behalf too.
Regards,
Ishan

On Mon, 10 Apr, 2023, 23:54 Alex Deparvu, <st...@apache.org> wrote:

> Hi,
>
> Gentle reminder for anyone interested to take a look at the PR. I have
> added a lot of detail to this failure (and a fix proposal) and would
> appreciate some feedback, or at least let me know if you are planning on
> taking a look at some point in the near future.
> This would help both with validating some of the findings so far and
> building some more confidence to keep digging into the 'delete' scenario
> failures.
> PR https://github.com/apache/solr/pull/1504
>
> best
> alex
>
>
>
> On Mon, Apr 3, 2023 at 9:49 AM Alex Deparvu <st...@apache.org> wrote:
>
> > An update on the failing test.
> >
> > The failure comes from documents having null version field from the add
> > operations running in parallel with the split shard operation.
> >
> > Digging into this a bit more (and running the test a very large number of
> > times) I think I identified 3 problematic cases that can cause failures.
> > will list them in order of importance.
> >
> > First, documents persisted with null version. I think this is by far the
> > most important one to fix,and my PR is focused on closing this gap.
> > The race condition seems to be when the new shard slice will receive
> > pending updates from the old shard leader and persist them 'as is', but
> the
> > trouble is some of these have zero version. so the docs will get
> persisted
> > without version, breaking the test's consistency checks. My tentative fix
> > is to allow the new slice to act as a master and attach a version.
> > I also added a version integrity check that will not allow
> leader-received
> > documents to be persisted without a version. I am considering moving this
> > behind some sort of system property/flag to disable on demand if it turns
> > out this is too strict.
> > As far as testing is concerned: there was no need for more tests, the
> > current one reproduces this issue easily (unfortunately it did not apply
> > stricter assertions regarding errors on the parallel operations), so I
> made
> > it a tiny bit stricter, in the hope that the test fails early and with a
> > clearer message.
> > I am looking for some thoughts and feedback on the PR itself but also on
> > understanding the impact. This smells like a data corruption situation,
> but
> > being reduced to the split shard operation, so maybe the window is not
> very
> > big. please correct me if I am overstating the problem.
> >
> > Second. Another race condition happens on the deletes as well. but due to
> > the first exception it is invisible. this one is milder, in the sense
> where
> > deletes will be rejected, so less data corruption opportunity. here as
> well
> > there is some opportunity to close this gap. I tried to do it as a part
> of
> > this PR but I hit some weirdness, so instead of sitting on this more, I
> > would prefer to have it reviewed and approach deletes in a subsequent PR.
> >
> > Third, I keep seeing yet another exception "Request says it is coming
> from
> > parent shard leader but we are in active state", which seems to be more
> > buffered requests on the old leader being sent over. I did not get any
> > chance to look at this, so cannot comment on the impact yet.
> >
> >
> > If anyone has interest (or some knowledge) in this area, please take a
> > look at the PR and let me know your thoughts.
> > PR https://github.com/apache/solr/pull/1504
> >
> > best
> > alex
> >
> >
> >
> >
> >
> > On Thu, Mar 30, 2023 at 8:06 AM Kevin Risden <kr...@apache.org> wrote:
> >
> >> I don't have any helpful pointers as to why this might be happening -
> but
> >> I
> >> do want to say thanks for digging in and finding out the cause as well
> as
> >> finding some related old jiras. Its helpful to even just make small
> steps
> >> forward.
> >>
> >> Kevin Risden
> >>
> >>
> >> On Wed, Mar 29, 2023 at 1:21 PM Alex Deparvu <st...@apache.org>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > Following David's recent email about the failing tests (and obviously
> >> not
> >> > knowing any better) I started to look at the ShardSplitTest failures.
> >> >
> >> > The easy part was turning the current NPEs into an actual assertion.
> The
> >> > hard part is that the documents seem to sometimes be missing the
> >> _version_
> >> > field which causes the test to fail.
> >> > If anyone has more pointers into where the cause might be, please let
> me
> >> > know I would be happy to dig further down this rabbit hole.
> >> >
> >> > This is what I have so far https://github.com/apache/solr/pull/1504
> >> >
> >> > best,
> >> > alex
> >> >
> >>
> >
>

Re: ShardSplitTest flakiness investigation

Posted by Alex Deparvu <st...@apache.org>.
Hi,

Gentle reminder for anyone interested to take a look at the PR. I have
added a lot of detail to this failure (and a fix proposal) and would
appreciate some feedback, or at least let me know if you are planning on
taking a look at some point in the near future.
This would help both with validating some of the findings so far and
building some more confidence to keep digging into the 'delete' scenario
failures.
PR https://github.com/apache/solr/pull/1504

best
alex



On Mon, Apr 3, 2023 at 9:49 AM Alex Deparvu <st...@apache.org> wrote:

> An update on the failing test.
>
> The failure comes from documents having null version field from the add
> operations running in parallel with the split shard operation.
>
> Digging into this a bit more (and running the test a very large number of
> times) I think I identified 3 problematic cases that can cause failures.
> will list them in order of importance.
>
> First, documents persisted with null version. I think this is by far the
> most important one to fix,and my PR is focused on closing this gap.
> The race condition seems to be when the new shard slice will receive
> pending updates from the old shard leader and persist them 'as is', but the
> trouble is some of these have zero version. so the docs will get persisted
> without version, breaking the test's consistency checks. My tentative fix
> is to allow the new slice to act as a master and attach a version.
> I also added a version integrity check that will not allow leader-received
> documents to be persisted without a version. I am considering moving this
> behind some sort of system property/flag to disable on demand if it turns
> out this is too strict.
> As far as testing is concerned: there was no need for more tests, the
> current one reproduces this issue easily (unfortunately it did not apply
> stricter assertions regarding errors on the parallel operations), so I made
> it a tiny bit stricter, in the hope that the test fails early and with a
> clearer message.
> I am looking for some thoughts and feedback on the PR itself but also on
> understanding the impact. This smells like a data corruption situation, but
> being reduced to the split shard operation, so maybe the window is not very
> big. please correct me if I am overstating the problem.
>
> Second. Another race condition happens on the deletes as well. but due to
> the first exception it is invisible. this one is milder, in the sense where
> deletes will be rejected, so less data corruption opportunity. here as well
> there is some opportunity to close this gap. I tried to do it as a part of
> this PR but I hit some weirdness, so instead of sitting on this more, I
> would prefer to have it reviewed and approach deletes in a subsequent PR.
>
> Third, I keep seeing yet another exception "Request says it is coming from
> parent shard leader but we are in active state", which seems to be more
> buffered requests on the old leader being sent over. I did not get any
> chance to look at this, so cannot comment on the impact yet.
>
>
> If anyone has interest (or some knowledge) in this area, please take a
> look at the PR and let me know your thoughts.
> PR https://github.com/apache/solr/pull/1504
>
> best
> alex
>
>
>
>
>
> On Thu, Mar 30, 2023 at 8:06 AM Kevin Risden <kr...@apache.org> wrote:
>
>> I don't have any helpful pointers as to why this might be happening - but
>> I
>> do want to say thanks for digging in and finding out the cause as well as
>> finding some related old jiras. Its helpful to even just make small steps
>> forward.
>>
>> Kevin Risden
>>
>>
>> On Wed, Mar 29, 2023 at 1:21 PM Alex Deparvu <st...@apache.org>
>> wrote:
>>
>> > Hi,
>> >
>> > Following David's recent email about the failing tests (and obviously
>> not
>> > knowing any better) I started to look at the ShardSplitTest failures.
>> >
>> > The easy part was turning the current NPEs into an actual assertion. The
>> > hard part is that the documents seem to sometimes be missing the
>> _version_
>> > field which causes the test to fail.
>> > If anyone has more pointers into where the cause might be, please let me
>> > know I would be happy to dig further down this rabbit hole.
>> >
>> > This is what I have so far https://github.com/apache/solr/pull/1504
>> >
>> > best,
>> > alex
>> >
>>
>

Re: ShardSplitTest flakiness investigation

Posted by Alex Deparvu <st...@apache.org>.
An update on the failing test.

The failure comes from documents having null version field from the add
operations running in parallel with the split shard operation.

Digging into this a bit more (and running the test a very large number of
times) I think I identified 3 problematic cases that can cause failures.
will list them in order of importance.

First, documents persisted with null version. I think this is by far the
most important one to fix,and my PR is focused on closing this gap.
The race condition seems to be when the new shard slice will receive
pending updates from the old shard leader and persist them 'as is', but the
trouble is some of these have zero version. so the docs will get persisted
without version, breaking the test's consistency checks. My tentative fix
is to allow the new slice to act as a master and attach a version.
I also added a version integrity check that will not allow leader-received
documents to be persisted without a version. I am considering moving this
behind some sort of system property/flag to disable on demand if it turns
out this is too strict.
As far as testing is concerned: there was no need for more tests, the
current one reproduces this issue easily (unfortunately it did not apply
stricter assertions regarding errors on the parallel operations), so I made
it a tiny bit stricter, in the hope that the test fails early and with a
clearer message.
I am looking for some thoughts and feedback on the PR itself but also on
understanding the impact. This smells like a data corruption situation, but
being reduced to the split shard operation, so maybe the window is not very
big. please correct me if I am overstating the problem.

Second. Another race condition happens on the deletes as well. but due to
the first exception it is invisible. this one is milder, in the sense where
deletes will be rejected, so less data corruption opportunity. here as well
there is some opportunity to close this gap. I tried to do it as a part of
this PR but I hit some weirdness, so instead of sitting on this more, I
would prefer to have it reviewed and approach deletes in a subsequent PR.

Third, I keep seeing yet another exception "Request says it is coming from
parent shard leader but we are in active state", which seems to be more
buffered requests on the old leader being sent over. I did not get any
chance to look at this, so cannot comment on the impact yet.


If anyone has interest (or some knowledge) in this area, please take a look
at the PR and let me know your thoughts.
PR https://github.com/apache/solr/pull/1504

best
alex





On Thu, Mar 30, 2023 at 8:06 AM Kevin Risden <kr...@apache.org> wrote:

> I don't have any helpful pointers as to why this might be happening - but I
> do want to say thanks for digging in and finding out the cause as well as
> finding some related old jiras. Its helpful to even just make small steps
> forward.
>
> Kevin Risden
>
>
> On Wed, Mar 29, 2023 at 1:21 PM Alex Deparvu <st...@apache.org> wrote:
>
> > Hi,
> >
> > Following David's recent email about the failing tests (and obviously not
> > knowing any better) I started to look at the ShardSplitTest failures.
> >
> > The easy part was turning the current NPEs into an actual assertion. The
> > hard part is that the documents seem to sometimes be missing the
> _version_
> > field which causes the test to fail.
> > If anyone has more pointers into where the cause might be, please let me
> > know I would be happy to dig further down this rabbit hole.
> >
> > This is what I have so far https://github.com/apache/solr/pull/1504
> >
> > best,
> > alex
> >
>

Re: ShardSplitTest flakiness investigation

Posted by Kevin Risden <kr...@apache.org>.
I don't have any helpful pointers as to why this might be happening - but I
do want to say thanks for digging in and finding out the cause as well as
finding some related old jiras. Its helpful to even just make small steps
forward.

Kevin Risden


On Wed, Mar 29, 2023 at 1:21 PM Alex Deparvu <st...@apache.org> wrote:

> Hi,
>
> Following David's recent email about the failing tests (and obviously not
> knowing any better) I started to look at the ShardSplitTest failures.
>
> The easy part was turning the current NPEs into an actual assertion. The
> hard part is that the documents seem to sometimes be missing the _version_
> field which causes the test to fail.
> If anyone has more pointers into where the cause might be, please let me
> know I would be happy to dig further down this rabbit hole.
>
> This is what I have so far https://github.com/apache/solr/pull/1504
>
> best,
> alex
>