You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Erick Erickson <er...@gmail.com> on 2020/09/16 12:15:20 UTC

I'd like some discussion about the problem outlined in SOLR-14861

All:

The test framework, and perhaps all of Solr has a disorderly shutdown process. I’ve seen at least one case where this is responsible for “bogus” test failures, bogus in the sense that due to race conditions the test failed with unreleased objects. The short form is that our test harness can call CoreContainer.shutdown() directly, and we got to it while reload() operations were in-flight and had gotten past the test for CoreContainer.isShutdown(). Then the reload() thread is time-sliced out, the shutdown() thread gets partway through and the reload() thread then picks up, but CoreContainer is partly shutdown and things go wonky.

The focus on CoreContainer.isShutdown is just for illustration and is somewhat of a legacy problem since the test harness manipulates at this level.

Then looking through the code, there are a number of places outside CoreContainer that check the isShutdown flag in CoreContainer, so the problem is more widespread than just CoreContainer.

Don’t look at the patch on that JIRA, it’s a totally bad approach the more I think about it.

Generically, we need a mechanism that, when we shut Solr down we

1> stop any new requests from being processed. IMO they should be rejected immediately
2> wait for all in-flight operations to complete. This could get tricky if one of the operations is, say, optimize.
3> then shut down.

Then perhaps rework the test harness to use that mechanism rather than call CoreContainer.shutdown() directly.

That said, I don’t have a clue how to make that happen.

Erick
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: I'd like some discussion about the problem outlined in SOLR-14861

Posted by Erick Erickson <er...@gmail.com>.

I put these two e-mails on the JIRA for posterity...

> On Oct 3, 2020, at 2:59 PM, David Smiley <ds...@apache.org> wrote:
> 
> Definitely makes sense to me -- those 3 phases of shutdown.  That is what I think CoreContainer.shutdown() should itself do.  Flag that shutdown has been requested, then wait sufficiently aided by a Semaphore perhaps (note all requests would need to take/release this), then actually do shutdown.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: I'd like some discussion about the problem outlined in SOLR-14861

Posted by David Smiley <ds...@apache.org>.

Definitely makes sense to me -- those 3 phases of shutdown.  That is what I
think CoreContainer.shutdown() should itself do.  Flag that shutdown has
been requested, then wait sufficiently aided by a Semaphore perhaps (note
all requests would need to take/release this), then actually do shutdown.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Sep 16, 2020 at 8:15 AM Erick Erickson <er...@gmail.com>
wrote:

> All:
>
> The test framework, and perhaps all of Solr has a disorderly shutdown
> process. I’ve seen at least one case where this is responsible for “bogus”
> test failures, bogus in the sense that due to race conditions the test
> failed with unreleased objects. The short form is that our test harness can
> call CoreContainer.shutdown() directly, and we got to it while reload()
> operations were in-flight and had gotten past the test for
> CoreContainer.isShutdown(). Then the reload() thread is time-sliced out,
> the shutdown() thread gets partway through and the reload() thread then
> picks up, but CoreContainer is partly shutdown and things go wonky.
>
> The focus on CoreContainer.isShutdown is just for illustration and is
> somewhat of a legacy problem since the test harness manipulates at this
> level.
>
> Then looking through the code, there are a number of places outside
> CoreContainer that check the isShutdown flag in CoreContainer, so the
> problem is more widespread than just CoreContainer.
>
> Don’t look at the patch on that JIRA, it’s a totally bad approach the more
> I think about it.
>
> Generically, we need a mechanism that, when we shut Solr down we
>
> 1> stop any new requests from being processed. IMO they should be rejected
> immediately
> 2> wait for all in-flight operations to complete. This could get tricky if
> one of the operations is, say, optimize.
> 3> then shut down.
>
> Then perhaps rework the test harness to use that mechanism rather than
> call CoreContainer.shutdown() directly.
>
> That said, I don’t have a clue how to make that happen.
>
> Erick
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>