You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Scott Blum <dr...@gmail.com> on 2016/02/15 22:59:07 UTC

ZK related startup fixes -- pre-review requested?

Hi folks (paticularly Erick and Shalin),

Before I go through the cycle of creating JIRAs and requesting formal
review, I wondered if I could get some feedback on some work I've been
doing to allow SolrCloud to startup faster and more reliably.

Problems:

1) Quickly restarting a node makes leader election unreliable; the existing
ZK node hasn't yet disappeared and confuses the current logic.  I believe I
have fixed this and simplified the logic.  This affects overseer election.

2) ZkController.publishAndWaitForDownStates() occurs before overseer
election.  That means if there is currently no overseer, there is
ironically no one to actually service the down state changes it's waiting
on.  This particularly affects a single-node cluster such as you might run
locally for development.

3) Audited our current implementations of process(WatchedEvent) for
consistency and handling edge cases.

4) Simplified DistributedMap; there's a whole lot more API surface area and
implementation machinery than we're using.

Code is here: https://github.com/fullstorydev/lucene-solr/pull/1
The individual commits might be informative.

Would some some feedback, and if these seem reasonable I'll open one or
more JIRAs and rebase the changes to trunk.

Thanks!
Scott

RE: ZK related startup fixes -- pre-review requested?

Posted by St...@ext.cdiscount.com.

Hi Scott,

Do you think your fix could improve the problems we seen on Solr 5.4 described in this old issue ?
https://issues.apache.org/jira/browse/SOLR-3274?focusedCommentId=15123736&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15123736

Thanks,
Stephan

De : Scott Blum [mailto:dragonsinth@gmail.com]
Envoyé : mercredi 17 février 2016 23:02
À : dev@lucene.apache.org
Cc : Erick Erickson <er...@gmail.com>; Shalin Shekhar Mangar <sh...@lucidworks.com>
Objet : Re: ZK related startup fixes -- pre-review requested?

Awesome, thanks Shalin!

On Wed, Feb 17, 2016 at 3:21 PM, Shalin Shekhar Mangar <sh...@gmail.com>> wrote:
Hi Scott,

Those all sound very important fixes. I skimmed the changes and they
all look good to me. I think the ZkController changes are
straightforward. The leader election changes should get some more eyes
(maybe Mark Miller can chime in) but please do open the jira issues
(preferably separate ones for easier review+commit).

Thanks!

On Mon, Feb 15, 2016 at 1:59 PM, Scott Blum <dr...@gmail.com>> wrote:
> Hi folks (paticularly Erick and Shalin),
>
> Before I go through the cycle of creating JIRAs and requesting formal
> review, I wondered if I could get some feedback on some work I've been doing
> to allow SolrCloud to startup faster and more reliably.
>
> Problems:
>
> 1) Quickly restarting a node makes leader election unreliable; the existing
> ZK node hasn't yet disappeared and confuses the current logic.  I believe I
> have fixed this and simplified the logic.  This affects overseer election.
>
> 2) ZkController.publishAndWaitForDownStates() occurs before overseer
> election.  That means if there is currently no overseer, there is ironically
> no one to actually service the down state changes it's waiting on.  This
> particularly affects a single-node cluster such as you might run locally for
> development.
>
> 3) Audited our current implementations of process(WatchedEvent) for
> consistency and handling edge cases.
>
> 4) Simplified DistributedMap; there's a whole lot more API surface area and
> implementation machinery than we're using.
>
> Code is here: https://github.com/fullstorydev/lucene-solr/pull/1
> The individual commits might be informative.
>
> Would some some feedback, and if these seem reasonable I'll open one or more
> JIRAs and rebase the changes to trunk.
>
> Thanks!
> Scott

--
Regards,
Shalin Shekhar Mangar.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>

Re: ZK related startup fixes -- pre-review requested?

Posted by Scott Blum <dr...@gmail.com>.

https://issues.apache.org/jira/browse/SOLR-8693
https://issues.apache.org/jira/browse/SOLR-8694
https://issues.apache.org/jira/browse/SOLR-8695
https://issues.apache.org/jira/browse/SOLR-8696
https://issues.apache.org/jira/browse/SOLR-8697

The first four should be super easy, the last one is the tougher one.

On Wed, Feb 17, 2016 at 5:01 PM, Scott Blum <dr...@gmail.com> wrote:

> Awesome, thanks Shalin!
>
> On Wed, Feb 17, 2016 at 3:21 PM, Shalin Shekhar Mangar <
> shalinmangar@gmail.com> wrote:
>
>> Hi Scott,
>>
>> Those all sound very important fixes. I skimmed the changes and they
>> all look good to me. I think the ZkController changes are
>> straightforward. The leader election changes should get some more eyes
>> (maybe Mark Miller can chime in) but please do open the jira issues
>> (preferably separate ones for easier review+commit).
>>
>> Thanks!
>>
>> On Mon, Feb 15, 2016 at 1:59 PM, Scott Blum <dr...@gmail.com>
>> wrote:
>> > Hi folks (paticularly Erick and Shalin),
>> >
>> > Before I go through the cycle of creating JIRAs and requesting formal
>> > review, I wondered if I could get some feedback on some work I've been
>> doing
>> > to allow SolrCloud to startup faster and more reliably.
>> >
>> > Problems:
>> >
>> > 1) Quickly restarting a node makes leader election unreliable; the
>> existing
>> > ZK node hasn't yet disappeared and confuses the current logic.  I
>> believe I
>> > have fixed this and simplified the logic.  This affects overseer
>> election.
>> >
>> > 2) ZkController.publishAndWaitForDownStates() occurs before overseer
>> > election.  That means if there is currently no overseer, there is
>> ironically
>> > no one to actually service the down state changes it's waiting on.  This
>> > particularly affects a single-node cluster such as you might run
>> locally for
>> > development.
>> >
>> > 3) Audited our current implementations of process(WatchedEvent) for
>> > consistency and handling edge cases.
>> >
>> > 4) Simplified DistributedMap; there's a whole lot more API surface area
>> and
>> > implementation machinery than we're using.
>> >
>> > Code is here: https://github.com/fullstorydev/lucene-solr/pull/1
>> > The individual commits might be informative.
>> >
>> > Would some some feedback, and if these seem reasonable I'll open one or
>> more
>> > JIRAs and rebase the changes to trunk.
>> >
>> > Thanks!
>> > Scott
>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

Re: ZK related startup fixes -- pre-review requested?

Posted by Scott Blum <dr...@gmail.com>.

Awesome, thanks Shalin!

On Wed, Feb 17, 2016 at 3:21 PM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> Hi Scott,
>
> Those all sound very important fixes. I skimmed the changes and they
> all look good to me. I think the ZkController changes are
> straightforward. The leader election changes should get some more eyes
> (maybe Mark Miller can chime in) but please do open the jira issues
> (preferably separate ones for easier review+commit).
>
> Thanks!
>
> On Mon, Feb 15, 2016 at 1:59 PM, Scott Blum <dr...@gmail.com> wrote:
> > Hi folks (paticularly Erick and Shalin),
> >
> > Before I go through the cycle of creating JIRAs and requesting formal
> > review, I wondered if I could get some feedback on some work I've been
> doing
> > to allow SolrCloud to startup faster and more reliably.
> >
> > Problems:
> >
> > 1) Quickly restarting a node makes leader election unreliable; the
> existing
> > ZK node hasn't yet disappeared and confuses the current logic.  I
> believe I
> > have fixed this and simplified the logic.  This affects overseer
> election.
> >
> > 2) ZkController.publishAndWaitForDownStates() occurs before overseer
> > election.  That means if there is currently no overseer, there is
> ironically
> > no one to actually service the down state changes it's waiting on.  This
> > particularly affects a single-node cluster such as you might run locally
> for
> > development.
> >
> > 3) Audited our current implementations of process(WatchedEvent) for
> > consistency and handling edge cases.
> >
> > 4) Simplified DistributedMap; there's a whole lot more API surface area
> and
> > implementation machinery than we're using.
> >
> > Code is here: https://github.com/fullstorydev/lucene-solr/pull/1
> > The individual commits might be informative.
> >
> > Would some some feedback, and if these seem reasonable I'll open one or
> more
> > JIRAs and rebase the changes to trunk.
> >
> > Thanks!
> > Scott
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: ZK related startup fixes -- pre-review requested?

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

Hi Scott,

Those all sound very important fixes. I skimmed the changes and they
all look good to me. I think the ZkController changes are
straightforward. The leader election changes should get some more eyes
(maybe Mark Miller can chime in) but please do open the jira issues
(preferably separate ones for easier review+commit).

Thanks!

On Mon, Feb 15, 2016 at 1:59 PM, Scott Blum <dr...@gmail.com> wrote:
> Hi folks (paticularly Erick and Shalin),
>
> Before I go through the cycle of creating JIRAs and requesting formal
> review, I wondered if I could get some feedback on some work I've been doing
> to allow SolrCloud to startup faster and more reliably.
>
> Problems:
>
> 1) Quickly restarting a node makes leader election unreliable; the existing
> ZK node hasn't yet disappeared and confuses the current logic.  I believe I
> have fixed this and simplified the logic.  This affects overseer election.
>
> 2) ZkController.publishAndWaitForDownStates() occurs before overseer
> election.  That means if there is currently no overseer, there is ironically
> no one to actually service the down state changes it's waiting on.  This
> particularly affects a single-node cluster such as you might run locally for
> development.
>
> 3) Audited our current implementations of process(WatchedEvent) for
> consistency and handling edge cases.
>
> 4) Simplified DistributedMap; there's a whole lot more API surface area and
> implementation machinery than we're using.
>
> Code is here: https://github.com/fullstorydev/lucene-solr/pull/1
> The individual commits might be informative.
>
> Would some some feedback, and if these seem reasonable I'll open one or more
> JIRAs and rebase the changes to trunk.
>
> Thanks!
> Scott



-- 
Regards,
Shalin Shekhar Mangar.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org