You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Steven Schlansker <st...@gmail.com> on 2022/07/01 20:11:35 UTC
Re: Replicator PrimaryNode waits forever for remotes to close


> On Jun 30, 2022, at 10:40 AM, Michael McCandless <lu...@mikemccandless.com> wrote:
> 
> +1 to provide a timeout, or, to simply fix close to aggressively close regardless of what the replicas are doing?

Yes, aggressively closing would be great for us - we already expect the primary can and will crash, so an aggressive close is no worse than that.
I proposed the timeout on the theory that There Must Be A Reason It Is This Way :) but if the simpler solution is acceptable that's great for us!

> It's not a great design for primary to be so dependent on the replicas (but vice/versa makes sense?).

In our case, we use stateless HTTP to do the replication instead of the stateful sockets the reference implementation does.
This makes the reference counting for CopyState a little messy but has other benefits that for us outweigh the costs.
So for us, I think this might be the only place the primary depends on the replicas at all, and it'd be wonderful to break that dependency.

> Maybe open a Jira issue or starting PR so we can discuss?

I filed https://issues.apache.org/jira/browse/LUCENE-10638 for further discussion. Thanks!

> Thanks for uncovering this and proposing a fix!
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Wed, Jun 29, 2022 at 7:36 PM Steven Schlansker <st...@gmail.com> wrote:
> Hi Lucene fans,
> 
> We use lucene-replicator to copy our indexes from a primary to replica nodes.
> Usually, startup and shutdown are fine. In particular we call PrimaryNode.close.
> 
> But, in some edge cases - dropped connection? IOException? some process crashed? -
> we sometimes hang in PrimaryNode.waitForAllRemotesToClose, which never returns.
> 
> I suspect we have a reference counting bug: in some exceptional case, we forget to release our CopyState.
> This definitely should be fixed, but in the meantime, it's very unhelpful for the primary node to never come down.
> 
> I was considering submitting a PR to add a configurable timeout for the shutdown wait - and after the timeout expires,
> continue with closing even though some replicas did not terminate.
> They will possibly crash with an "IOException: directory closed" later, or maybe never come back at all.
> 
> Does this sound like a welcome change? Is there a better way to avoid hanging here, other than to be bug-free?
> It's quite challenging to figure out where the CopyState wasn't released, as only a count is kept.
> 
> Thanks!
> 
> Steven Schlansker
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org