You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@solr.apache.org by Alex Deparvu <st...@apache.org> on 2023/06/15 21:49:08 UTC

Flaky DeleteReplicaTest

Hi,

I wanted to take a look at the flaky DeleteReplicaTest test.

Some background first:
- Past 7 days trend:
Class: org.apache.solr.cloud.DeleteReplicaTest
Method: raceConditionOnDeleteAndRegisterReplica
Failures: 15.56% (63 / 405)

- Test failure is caused by a NullPointerException:
ERROR (coreZkRegister-772-thread-1-processing-127.0.0.1:40471_solr)
[n:127.0.0.1:40471_solr c:raceDeleteReplicaCollection s:shard1 r:core_node4
x:raceDeleteReplicaCollection_shard1_replica_n2] o.a.s.c.DeleteReplicaTest
Failed to delete replica
 => java.lang.NullPointerException: Cannot invoke
"org.apache.solr.core.CoreContainer.getZkController()" because the return
value of "org.apache.solr.embedded.JettySolrRunner.getCoreContainer()" is
null

I am having some trouble reproducing on my local and I don't see how the
getCoreContainer() method might return null. Could this be a timing issue
somehow?
If anyone has an idea on how to approach this, I would be happy to hear it.

best,
alex

Re: Flaky DeleteReplicaTest

Posted by Alex Deparvu <st...@apache.org>.
Thank you Mark for your thoughts.

Docker idea was very interesting and it greatly simplified the 'beast'
testing, but unfortunately I was still not able to reproduce.

I do however have an idea based on code review, outlined on
https://issues.apache.org/jira/browse/SOLR-16848
Would appreciate your thoughts on the analysis.

best,
alex



On Thu, Jun 15, 2023 at 4:55 PM Mark Miller <ma...@gmail.com> wrote:

> Oh one more good for duplicating that type of fail - run it in docker, or a
> VM, or maybe Multipass, and give it anemic resources (though enough that
> the test doesn't OOM or something)
>
> On Thu, Jun 15, 2023 at 5:34 PM Mark Miller <ma...@gmail.com> wrote:
>
> > Why don't you see how it can return null?
> >
> > I'm looking at an older checkout, but I see JettySolrRunner checking for
> > null core containers all over, and I see it passing back null explicitly
> in
> > at least one case.
> >
> > When I peek at where that core container might be coming from, I see a
> > provider and a field that looks like it's home (which I note doesn't look
> > protected by any memory barrier? e.g., volatile, lock, sync). And I see
> > that it could start as null. Get set to null on close as well?
> >
> > So I wonder about that lack of a memory barrier, but there are probably
> > plenty of cases where some random jobs/threads are still running past
> that
> > close as well, is another thought I have. And I bet one of them comes in
> > and looks for that core container late, and he's already clocked out.
> >
> > Older checkout, so I don't know what you are looking at, but if it hasn't
> > changed drastically recently, it seems easy to return a null.
> >
> > If you want to duplicate a situation that might hit - try running the
> test
> > with 10-20 instances simultaneously looped.
> >
> > Or loop one, and hammer your system with some unrelated load for a while.
> >
> > On Thu, Jun 15, 2023 at 4:49 PM Alex Deparvu <st...@apache.org>
> wrote:
> >
> >> Hi,
> >>
> >> I wanted to take a look at the flaky DeleteReplicaTest test.
> >>
> >> Some background first:
> >> - Past 7 days trend:
> >> Class: org.apache.solr.cloud.DeleteReplicaTest
> >> Method: raceConditionOnDeleteAndRegisterReplica
> >> Failures: 15.56% (63 / 405)
> >>
> >> - Test failure is caused by a NullPointerException:
> >> ERROR (coreZkRegister-772-thread-1-processing-127.0.0.1:40471_solr)
> >> [n:127.0.0.1:40471_solr c:raceDeleteReplicaCollection s:shard1
> >> r:core_node4
> >> x:raceDeleteReplicaCollection_shard1_replica_n2]
> o.a.s.c.DeleteReplicaTest
> >> Failed to delete replica
> >>  => java.lang.NullPointerException: Cannot invoke
> >> "org.apache.solr.core.CoreContainer.getZkController()" because the
> return
> >> value of "org.apache.solr.embedded.JettySolrRunner.getCoreContainer()"
> is
> >> null
> >>
> >> I am having some trouble reproducing on my local and I don't see how the
> >> getCoreContainer() method might return null. Could this be a timing
> issue
> >> somehow?
> >> If anyone has an idea on how to approach this, I would be happy to hear
> >> it.
> >>
> >> best,
> >> alex
> >>
> >
> >
> > --
> > - MRM
> >
>
>
> --
> - MRM
>

Re: Flaky DeleteReplicaTest

Posted by Mark Miller <ma...@gmail.com>.
Oh one more good for duplicating that type of fail - run it in docker, or a
VM, or maybe Multipass, and give it anemic resources (though enough that
the test doesn't OOM or something)

On Thu, Jun 15, 2023 at 5:34 PM Mark Miller <ma...@gmail.com> wrote:

> Why don't you see how it can return null?
>
> I'm looking at an older checkout, but I see JettySolrRunner checking for
> null core containers all over, and I see it passing back null explicitly in
> at least one case.
>
> When I peek at where that core container might be coming from, I see a
> provider and a field that looks like it's home (which I note doesn't look
> protected by any memory barrier? e.g., volatile, lock, sync). And I see
> that it could start as null. Get set to null on close as well?
>
> So I wonder about that lack of a memory barrier, but there are probably
> plenty of cases where some random jobs/threads are still running past that
> close as well, is another thought I have. And I bet one of them comes in
> and looks for that core container late, and he's already clocked out.
>
> Older checkout, so I don't know what you are looking at, but if it hasn't
> changed drastically recently, it seems easy to return a null.
>
> If you want to duplicate a situation that might hit - try running the test
> with 10-20 instances simultaneously looped.
>
> Or loop one, and hammer your system with some unrelated load for a while.
>
> On Thu, Jun 15, 2023 at 4:49 PM Alex Deparvu <st...@apache.org> wrote:
>
>> Hi,
>>
>> I wanted to take a look at the flaky DeleteReplicaTest test.
>>
>> Some background first:
>> - Past 7 days trend:
>> Class: org.apache.solr.cloud.DeleteReplicaTest
>> Method: raceConditionOnDeleteAndRegisterReplica
>> Failures: 15.56% (63 / 405)
>>
>> - Test failure is caused by a NullPointerException:
>> ERROR (coreZkRegister-772-thread-1-processing-127.0.0.1:40471_solr)
>> [n:127.0.0.1:40471_solr c:raceDeleteReplicaCollection s:shard1
>> r:core_node4
>> x:raceDeleteReplicaCollection_shard1_replica_n2] o.a.s.c.DeleteReplicaTest
>> Failed to delete replica
>>  => java.lang.NullPointerException: Cannot invoke
>> "org.apache.solr.core.CoreContainer.getZkController()" because the return
>> value of "org.apache.solr.embedded.JettySolrRunner.getCoreContainer()" is
>> null
>>
>> I am having some trouble reproducing on my local and I don't see how the
>> getCoreContainer() method might return null. Could this be a timing issue
>> somehow?
>> If anyone has an idea on how to approach this, I would be happy to hear
>> it.
>>
>> best,
>> alex
>>
>
>
> --
> - MRM
>


-- 
- MRM

Re: Flaky DeleteReplicaTest

Posted by Mark Miller <ma...@gmail.com>.
Why don't you see how it can return null?

I'm looking at an older checkout, but I see JettySolrRunner checking for
null core containers all over, and I see it passing back null explicitly in
at least one case.

When I peek at where that core container might be coming from, I see a
provider and a field that looks like it's home (which I note doesn't look
protected by any memory barrier? e.g., volatile, lock, sync). And I see
that it could start as null. Get set to null on close as well?

So I wonder about that lack of a memory barrier, but there are probably
plenty of cases where some random jobs/threads are still running past that
close as well, is another thought I have. And I bet one of them comes in
and looks for that core container late, and he's already clocked out.

Older checkout, so I don't know what you are looking at, but if it hasn't
changed drastically recently, it seems easy to return a null.

If you want to duplicate a situation that might hit - try running the test
with 10-20 instances simultaneously looped.

Or loop one, and hammer your system with some unrelated load for a while.

On Thu, Jun 15, 2023 at 4:49 PM Alex Deparvu <st...@apache.org> wrote:

> Hi,
>
> I wanted to take a look at the flaky DeleteReplicaTest test.
>
> Some background first:
> - Past 7 days trend:
> Class: org.apache.solr.cloud.DeleteReplicaTest
> Method: raceConditionOnDeleteAndRegisterReplica
> Failures: 15.56% (63 / 405)
>
> - Test failure is caused by a NullPointerException:
> ERROR (coreZkRegister-772-thread-1-processing-127.0.0.1:40471_solr)
> [n:127.0.0.1:40471_solr c:raceDeleteReplicaCollection s:shard1
> r:core_node4
> x:raceDeleteReplicaCollection_shard1_replica_n2] o.a.s.c.DeleteReplicaTest
> Failed to delete replica
>  => java.lang.NullPointerException: Cannot invoke
> "org.apache.solr.core.CoreContainer.getZkController()" because the return
> value of "org.apache.solr.embedded.JettySolrRunner.getCoreContainer()" is
> null
>
> I am having some trouble reproducing on my local and I don't see how the
> getCoreContainer() method might return null. Could this be a timing issue
> somehow?
> If anyone has an idea on how to approach this, I would be happy to hear it.
>
> best,
> alex
>


-- 
- MRM