You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Scott Blum <dr...@gmail.com> on 2016/11/23 22:05:32 UTC

Massive state-update bottleneck at scale

I've been fighting fires the last day where certain of our solr nodes will
have a long GC pauses that cause them to lose their ZK connection and have
to reconnect.  That would be annoying, but survivable, although obvious
it's something I want to fix.

But what makes it fatal is the current design of the state update queue.

Every time one of our nodes flaps, it ends up shoving thousands of state
updates and leader requests onto the queue, most of them ultimately
futile.  By the time the state is actually published, it's already stale.
At one point we had 400,000 items in the queue and I just had to declare
bankruptcy, delete the entire queue, and elect a new overseer.  Later, we
had 70,000 items from several flaps that took an hour to churn through.
even after I'd shut down the problematic nodes.  Again, almost entirely
useless, repetitive work.

Digging through ZKController and related code, the current model just seems
terribly outdated and non-scalable now.  If a node flaps for just a moment,
do we really need to laboriously update every core's state down, just to
mark it up again?  What purpose does this serve that isn't already served
by the global live_nodes presence indication and/or leader election nodes?

Rebooting a node creates a similar set of problems, a couple hundred cores
end up generating thousands of ZK operations to just to back to normal
state.

We're at enough of breaking point that I *have* to do something here for
our own cluster.  I would love to put my head together with some of the
more knowledgeable Solr operations folks to help redesign something that
could land in master and improve scalability for everyone.  I'd also love
to hear about any prior art or experiments folks have done.  And if there
are already efforts in process to address this very issue, apologies for
being out of the loop.

Thanks!
Scott

Re: Massive state-update bottleneck at scale

Posted by Scott Blum <dr...@gmail.com>.
The second thing I want to look at doing is replacing queued state update
operations with local CAS loops for state format v2 collections, with
in-process collection-level mutex to ensure that a node isn't contending
with itself.  This would only be for state updates, anything more complex
would still go to overseer.

Then at least if a Solr node gets kill -9'd it immediately stops hitting ZK
instead of leaving a bunch of garbage in the queue.  This would require
some changes in ZKStateWriter's assumptions.

On Wed, Nov 23, 2016 at 6:59 PM, Scott Blum <dr...@gmail.com> wrote:

> On Wed, Nov 23, 2016 at 5:45 PM, Mark Miller <ma...@gmail.com>
>  wrote:
>
>> One thing is, when you reconnect after connecting to ZK, it should now
>> efficiently set every core as down in a single command, not each core.
>
>
> Yeah, I backported downnode, but it still actually takes a long time for
> overseer to execute, and there can be a bunch of these in the queue for the
> same node.
>
> On Wed, Nov 23, 2016 at 5:53 PM, Mark Miller <ma...@gmail.com>
> wrote:
>
>> In many cases other nodes need to see a progression of state changes. You
>> really have to clear the deck and try to start from 0.
>
>
> This is exactly the kind of detail I'm looking for.  Can you elaborate?
>
> Unless we can come up with a better idea, my first experiment will be to
> try to eliminate the "DOWN" replica state in all practical cases, relying
> only on careful management of live_nodes presence.  For example, the
> startup sequence (or reconnect sequence) would skip marking replicas down
> and just ensure they're ACTIVE or else put them into RECOVERING, join shard
> leader elections, and finally join live_nodes when that's done.
>
> What land mines am I likely to run into or existing assumptions am I
> likely to violate if I do that?
>

Re: Massive state-update bottleneck at scale

Posted by Scott Blum <dr...@gmail.com>.
On Wed, Nov 23, 2016 at 5:45 PM, Mark Miller <ma...@gmail.com> wrote:

> One thing is, when you reconnect after connecting to ZK, it should now
> efficiently set every core as down in a single command, not each core.


Yeah, I backported downnode, but it still actually takes a long time for
overseer to execute, and there can be a bunch of these in the queue for the
same node.

On Wed, Nov 23, 2016 at 5:53 PM, Mark Miller <ma...@gmail.com> wrote:

> In many cases other nodes need to see a progression of state changes. You
> really have to clear the deck and try to start from 0.


This is exactly the kind of detail I'm looking for.  Can you elaborate?

Unless we can come up with a better idea, my first experiment will be to
try to eliminate the "DOWN" replica state in all practical cases, relying
only on careful management of live_nodes presence.  For example, the
startup sequence (or reconnect sequence) would skip marking replicas down
and just ensure they're ACTIVE or else put them into RECOVERING, join shard
leader elections, and finally join live_nodes when that's done.

What land mines am I likely to run into or existing assumptions am I likely
to violate if I do that?

Re: Massive state-update bottleneck at scale

Posted by Walter Underwood <wu...@wunderwood.org>.
Yes, publishing state for other nodes seems like a problem waiting to happen.

Let Zookeeper be the single source of truth and live with the extra traffic.

How about a first change to ignore any second-hand state, then later we can stop sending it and collapse the queues. Maybe receiving queues could be collapsed as soon as they ignore the second-hand state.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 23, 2016, at 3:08 PM, Mark Miller <ma...@gmail.com> wrote:
> 
> Although, if we fixed that the leader sometimes publishes state for replicas (which I think is a mistake, I worked hard initially to avoid a node ever publishing state for another node) you could at least track the last state
> Published and avoid repeating it over and over pretty easily. 
> On Wed, Nov 23, 2016 at 6:03 PM Mark Miller <markrmiller@gmail.com <ma...@gmail.com>> wrote:
> I didn't say stale state though actually. I said state progressions. 
> On Wed, Nov 23, 2016 at 6:03 PM Mark Miller <markrmiller@gmail.com <ma...@gmail.com>> wrote:
> Because that's how it works. 
> On Wed, Nov 23, 2016 at 5:57 PM Walter Underwood <wunder@wunderwood.org <ma...@wunderwood.org>> wrote:
> Why would other nodes need to see stale state?
> 
> If they really need intermediate state changes, that sounds like a problem.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org <ma...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
> 
>> On Nov 23, 2016, at 2:53 PM, Mark Miller <markrmiller@gmail.com <ma...@gmail.com>> wrote:
>> 
>> In many cases other nodes need to see a progression of state changes. You really have to clear the deck and try to start from 0. 
>> On Wed, Nov 23, 2016 at 5:50 PM Walter Underwood <wunder@wunderwood.org <ma...@wunderwood.org>> wrote:
>> If the queue is local and the state messages are complete, the local queue should only send the latest, most accurate update. The rest can be skipped.
>> 
>> The same could be done on the receiving end. Suck the queue dry, then choose the most recent.
>> 
>> If the updates depend on previous updates, it would be a lot more work to compile the latest delta.
>> 
>> wunder
>> 
>> Walter Underwood
>> wunder@wunderwood.org <ma...@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>> 
>> 
>>> On Nov 23, 2016, at 2:45 PM, Mark Miller <markrmiller@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> I talked about this type of thing with Jessica at Lucene / Sole revolution. One thing is, when you reconnect after connecting to ZK, it should now efficiently set every core as down in a single command, not each core. Beyond that, any single node knows how fast it's sending overseer updates. Each should have a governor. If the rate is too high, a node should know it's best to just forgive up and assume things are screwed. It could try and reset from ground zero.
>>> 
>>> There are other things things that can be done, but given the current design, the simplest win is that a replica can easily prevent itself from spamming the overseer queue. 
>>> 
>>> Mark
>>> On Wed, Nov 23, 2016 at 5:05 PM Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>>> I've been fighting fires the last day where certain of our solr nodes will have a long GC pauses that cause them to lose their ZK connection and have to reconnect.  That would be annoying, but survivable, although obvious it's something I want to fix.
>>> 
>>> But what makes it fatal is the current design of the state update queue.
>>> 
>>> Every time one of our nodes flaps, it ends up shoving thousands of state updates and leader requests onto the queue, most of them ultimately futile.  By the time the state is actually published, it's already stale.  At one point we had 400,000 items in the queue and I just had to declare bankruptcy, delete the entire queue, and elect a new overseer.  Later, we had 70,000 items from several flaps that took an hour to churn through. even after I'd shut down the problematic nodes.  Again, almost entirely useless, repetitive work.
>>> 
>>> Digging through ZKController and related code, the current model just seems terribly outdated and non-scalable now.  If a node flaps for just a moment, do we really need to laboriously update every core's state down, just to mark it up again?  What purpose does this serve that isn't already served by the global live_nodes presence indication and/or leader election nodes?
>>> 
>>> Rebooting a node creates a similar set of problems, a couple hundred cores end up generating thousands of ZK operations to just to back to normal state.
>>> 
>>> We're at enough of breaking point that I have to do something here for our own cluster.  I would love to put my head together with some of the more knowledgeable Solr operations folks to help redesign something that could land in master and improve scalability for everyone.  I'd also love to hear about any prior art or experiments folks have done.  And if there are already efforts in process to address this very issue, apologies for being out of the loop.
>>> 
>>> Thanks!
>>> Scott
>>> 
>>> -- 
>>> - Mark 
>>> about.me/markrmiller <http://about.me/markrmiller>
>> -- 
>> - Mark 
>> about.me/markrmiller <http://about.me/markrmiller>
> -- 
> - Mark 
> about.me/markrmiller <http://about.me/markrmiller>
> -- 
> - Mark 
> about.me/markrmiller <http://about.me/markrmiller>
> -- 
> - Mark 
> about.me/markrmiller <http://about.me/markrmiller>

Re: Massive state-update bottleneck at scale

Posted by Mark Miller <ma...@gmail.com>.
Although, if we fixed that the leader sometimes publishes state for
replicas (which I think is a mistake, I worked hard initially to avoid a
node ever publishing state for another node) you could at least track the
last state
Published and avoid repeating it over and over pretty easily.
On Wed, Nov 23, 2016 at 6:03 PM Mark Miller <ma...@gmail.com> wrote:

> I didn't say stale state though actually. I said state progressions.
> On Wed, Nov 23, 2016 at 6:03 PM Mark Miller <ma...@gmail.com> wrote:
>
> Because that's how it works.
> On Wed, Nov 23, 2016 at 5:57 PM Walter Underwood <wu...@wunderwood.org>
> wrote:
>
> Why would other nodes need to see stale state?
>
> If they really need intermediate state changes, that sounds like a problem.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> On Nov 23, 2016, at 2:53 PM, Mark Miller <ma...@gmail.com> wrote:
>
> In many cases other nodes need to see a progression of state changes. You
> really have to clear the deck and try to start from 0.
> On Wed, Nov 23, 2016 at 5:50 PM Walter Underwood <wu...@wunderwood.org>
> wrote:
>
> If the queue is local and the state messages are complete, the local queue
> should only send the latest, most accurate update. The rest can be skipped.
>
> The same could be done on the receiving end. Suck the queue dry, then
> choose the most recent.
>
> If the updates depend on previous updates, it would be a lot more work to
> compile the latest delta.
>
> wunder
>
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> On Nov 23, 2016, at 2:45 PM, Mark Miller <ma...@gmail.com> wrote:
>
> I talked about this type of thing with Jessica at Lucene / Sole
> revolution. One thing is, when you reconnect after connecting to ZK, it
> should now efficiently set every core as down in a single command, not each
> core. Beyond that, any single node knows how fast it's sending overseer
> updates. Each should have a governor. If the rate is too high, a node
> should know it's best to just forgive up and assume things are screwed. It
> could try and reset from ground zero.
>
> There are other things things that can be done, but given the current
> design, the simplest win is that a replica can easily prevent itself from
> spamming the overseer queue.
>
> Mark
> On Wed, Nov 23, 2016 at 5:05 PM Scott Blum <dr...@gmail.com> wrote:
>
> I've been fighting fires the last day where certain of our solr nodes will
> have a long GC pauses that cause them to lose their ZK connection and have
> to reconnect.  That would be annoying, but survivable, although obvious
> it's something I want to fix.
>
> But what makes it fatal is the current design of the state update queue.
>
> Every time one of our nodes flaps, it ends up shoving thousands of state
> updates and leader requests onto the queue, most of them ultimately
> futile.  By the time the state is actually published, it's already stale.
> At one point we had 400,000 items in the queue and I just had to declare
> bankruptcy, delete the entire queue, and elect a new overseer.  Later, we
> had 70,000 items from several flaps that took an hour to churn through.
> even after I'd shut down the problematic nodes.  Again, almost entirely
> useless, repetitive work.
>
> Digging through ZKController and related code, the current model just
> seems terribly outdated and non-scalable now.  If a node flaps for just a
> moment, do we really need to laboriously update every core's state down,
> just to mark it up again?  What purpose does this serve that isn't already
> served by the global live_nodes presence indication and/or leader election
> nodes?
>
> Rebooting a node creates a similar set of problems, a couple hundred cores
> end up generating thousands of ZK operations to just to back to normal
> state.
>
> We're at enough of breaking point that I *have* to do something here for
> our own cluster.  I would love to put my head together with some of the
> more knowledgeable Solr operations folks to help redesign something that
> could land in master and improve scalability for everyone.  I'd also love
> to hear about any prior art or experiments folks have done.  And if there
> are already efforts in process to address this very issue, apologies for
> being out of the loop.
>
> Thanks!
> Scott
>
> --
> - Mark
> about.me/markrmiller
>
>
> --
> - Mark
> about.me/markrmiller
>
>
> --
> - Mark
> about.me/markrmiller
>
> --
> - Mark
> about.me/markrmiller
>
-- 
- Mark
about.me/markrmiller

Re: Massive state-update bottleneck at scale

Posted by Mark Miller <ma...@gmail.com>.
I didn't say stale state though actually. I said state progressions.
On Wed, Nov 23, 2016 at 6:03 PM Mark Miller <ma...@gmail.com> wrote:

> Because that's how it works.
> On Wed, Nov 23, 2016 at 5:57 PM Walter Underwood <wu...@wunderwood.org>
> wrote:
>
> Why would other nodes need to see stale state?
>
> If they really need intermediate state changes, that sounds like a problem.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> On Nov 23, 2016, at 2:53 PM, Mark Miller <ma...@gmail.com> wrote:
>
> In many cases other nodes need to see a progression of state changes. You
> really have to clear the deck and try to start from 0.
> On Wed, Nov 23, 2016 at 5:50 PM Walter Underwood <wu...@wunderwood.org>
> wrote:
>
> If the queue is local and the state messages are complete, the local queue
> should only send the latest, most accurate update. The rest can be skipped.
>
> The same could be done on the receiving end. Suck the queue dry, then
> choose the most recent.
>
> If the updates depend on previous updates, it would be a lot more work to
> compile the latest delta.
>
> wunder
>
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> On Nov 23, 2016, at 2:45 PM, Mark Miller <ma...@gmail.com> wrote:
>
> I talked about this type of thing with Jessica at Lucene / Sole
> revolution. One thing is, when you reconnect after connecting to ZK, it
> should now efficiently set every core as down in a single command, not each
> core. Beyond that, any single node knows how fast it's sending overseer
> updates. Each should have a governor. If the rate is too high, a node
> should know it's best to just forgive up and assume things are screwed. It
> could try and reset from ground zero.
>
> There are other things things that can be done, but given the current
> design, the simplest win is that a replica can easily prevent itself from
> spamming the overseer queue.
>
> Mark
> On Wed, Nov 23, 2016 at 5:05 PM Scott Blum <dr...@gmail.com> wrote:
>
> I've been fighting fires the last day where certain of our solr nodes will
> have a long GC pauses that cause them to lose their ZK connection and have
> to reconnect.  That would be annoying, but survivable, although obvious
> it's something I want to fix.
>
> But what makes it fatal is the current design of the state update queue.
>
> Every time one of our nodes flaps, it ends up shoving thousands of state
> updates and leader requests onto the queue, most of them ultimately
> futile.  By the time the state is actually published, it's already stale.
> At one point we had 400,000 items in the queue and I just had to declare
> bankruptcy, delete the entire queue, and elect a new overseer.  Later, we
> had 70,000 items from several flaps that took an hour to churn through.
> even after I'd shut down the problematic nodes.  Again, almost entirely
> useless, repetitive work.
>
> Digging through ZKController and related code, the current model just
> seems terribly outdated and non-scalable now.  If a node flaps for just a
> moment, do we really need to laboriously update every core's state down,
> just to mark it up again?  What purpose does this serve that isn't already
> served by the global live_nodes presence indication and/or leader election
> nodes?
>
> Rebooting a node creates a similar set of problems, a couple hundred cores
> end up generating thousands of ZK operations to just to back to normal
> state.
>
> We're at enough of breaking point that I *have* to do something here for
> our own cluster.  I would love to put my head together with some of the
> more knowledgeable Solr operations folks to help redesign something that
> could land in master and improve scalability for everyone.  I'd also love
> to hear about any prior art or experiments folks have done.  And if there
> are already efforts in process to address this very issue, apologies for
> being out of the loop.
>
> Thanks!
> Scott
>
> --
> - Mark
> about.me/markrmiller
>
>
> --
> - Mark
> about.me/markrmiller
>
>
> --
> - Mark
> about.me/markrmiller
>
-- 
- Mark
about.me/markrmiller

Re: Massive state-update bottleneck at scale

Posted by Mark Miller <ma...@gmail.com>.
Because that's how it works.
On Wed, Nov 23, 2016 at 5:57 PM Walter Underwood <wu...@wunderwood.org>
wrote:

> Why would other nodes need to see stale state?
>
> If they really need intermediate state changes, that sounds like a problem.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> On Nov 23, 2016, at 2:53 PM, Mark Miller <ma...@gmail.com> wrote:
>
> In many cases other nodes need to see a progression of state changes. You
> really have to clear the deck and try to start from 0.
> On Wed, Nov 23, 2016 at 5:50 PM Walter Underwood <wu...@wunderwood.org>
> wrote:
>
> If the queue is local and the state messages are complete, the local queue
> should only send the latest, most accurate update. The rest can be skipped.
>
> The same could be done on the receiving end. Suck the queue dry, then
> choose the most recent.
>
> If the updates depend on previous updates, it would be a lot more work to
> compile the latest delta.
>
> wunder
>
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> On Nov 23, 2016, at 2:45 PM, Mark Miller <ma...@gmail.com> wrote:
>
> I talked about this type of thing with Jessica at Lucene / Sole
> revolution. One thing is, when you reconnect after connecting to ZK, it
> should now efficiently set every core as down in a single command, not each
> core. Beyond that, any single node knows how fast it's sending overseer
> updates. Each should have a governor. If the rate is too high, a node
> should know it's best to just forgive up and assume things are screwed. It
> could try and reset from ground zero.
>
> There are other things things that can be done, but given the current
> design, the simplest win is that a replica can easily prevent itself from
> spamming the overseer queue.
>
> Mark
> On Wed, Nov 23, 2016 at 5:05 PM Scott Blum <dr...@gmail.com> wrote:
>
> I've been fighting fires the last day where certain of our solr nodes will
> have a long GC pauses that cause them to lose their ZK connection and have
> to reconnect.  That would be annoying, but survivable, although obvious
> it's something I want to fix.
>
> But what makes it fatal is the current design of the state update queue.
>
> Every time one of our nodes flaps, it ends up shoving thousands of state
> updates and leader requests onto the queue, most of them ultimately
> futile.  By the time the state is actually published, it's already stale.
> At one point we had 400,000 items in the queue and I just had to declare
> bankruptcy, delete the entire queue, and elect a new overseer.  Later, we
> had 70,000 items from several flaps that took an hour to churn through.
> even after I'd shut down the problematic nodes.  Again, almost entirely
> useless, repetitive work.
>
> Digging through ZKController and related code, the current model just
> seems terribly outdated and non-scalable now.  If a node flaps for just a
> moment, do we really need to laboriously update every core's state down,
> just to mark it up again?  What purpose does this serve that isn't already
> served by the global live_nodes presence indication and/or leader election
> nodes?
>
> Rebooting a node creates a similar set of problems, a couple hundred cores
> end up generating thousands of ZK operations to just to back to normal
> state.
>
> We're at enough of breaking point that I *have* to do something here for
> our own cluster.  I would love to put my head together with some of the
> more knowledgeable Solr operations folks to help redesign something that
> could land in master and improve scalability for everyone.  I'd also love
> to hear about any prior art or experiments folks have done.  And if there
> are already efforts in process to address this very issue, apologies for
> being out of the loop.
>
> Thanks!
> Scott
>
> --
> - Mark
> about.me/markrmiller
>
>
> --
> - Mark
> about.me/markrmiller
>
>
> --
- Mark
about.me/markrmiller

Re: Massive state-update bottleneck at scale

Posted by Walter Underwood <wu...@wunderwood.org>.
Why would other nodes need to see stale state?

If they really need intermediate state changes, that sounds like a problem.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 23, 2016, at 2:53 PM, Mark Miller <ma...@gmail.com> wrote:
> 
> In many cases other nodes need to see a progression of state changes. You really have to clear the deck and try to start from 0. 
> On Wed, Nov 23, 2016 at 5:50 PM Walter Underwood <wunder@wunderwood.org <ma...@wunderwood.org>> wrote:
> If the queue is local and the state messages are complete, the local queue should only send the latest, most accurate update. The rest can be skipped.
> 
> The same could be done on the receiving end. Suck the queue dry, then choose the most recent.
> 
> If the updates depend on previous updates, it would be a lot more work to compile the latest delta.
> 
> wunder
> 
> Walter Underwood
> wunder@wunderwood.org <ma...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
> 
>> On Nov 23, 2016, at 2:45 PM, Mark Miller <markrmiller@gmail.com <ma...@gmail.com>> wrote:
>> 
>> I talked about this type of thing with Jessica at Lucene / Sole revolution. One thing is, when you reconnect after connecting to ZK, it should now efficiently set every core as down in a single command, not each core. Beyond that, any single node knows how fast it's sending overseer updates. Each should have a governor. If the rate is too high, a node should know it's best to just forgive up and assume things are screwed. It could try and reset from ground zero.
>> 
>> There are other things things that can be done, but given the current design, the simplest win is that a replica can easily prevent itself from spamming the overseer queue. 
>> 
>> Mark
>> On Wed, Nov 23, 2016 at 5:05 PM Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
>> I've been fighting fires the last day where certain of our solr nodes will have a long GC pauses that cause them to lose their ZK connection and have to reconnect.  That would be annoying, but survivable, although obvious it's something I want to fix.
>> 
>> But what makes it fatal is the current design of the state update queue.
>> 
>> Every time one of our nodes flaps, it ends up shoving thousands of state updates and leader requests onto the queue, most of them ultimately futile.  By the time the state is actually published, it's already stale.  At one point we had 400,000 items in the queue and I just had to declare bankruptcy, delete the entire queue, and elect a new overseer.  Later, we had 70,000 items from several flaps that took an hour to churn through. even after I'd shut down the problematic nodes.  Again, almost entirely useless, repetitive work.
>> 
>> Digging through ZKController and related code, the current model just seems terribly outdated and non-scalable now.  If a node flaps for just a moment, do we really need to laboriously update every core's state down, just to mark it up again?  What purpose does this serve that isn't already served by the global live_nodes presence indication and/or leader election nodes?
>> 
>> Rebooting a node creates a similar set of problems, a couple hundred cores end up generating thousands of ZK operations to just to back to normal state.
>> 
>> We're at enough of breaking point that I have to do something here for our own cluster.  I would love to put my head together with some of the more knowledgeable Solr operations folks to help redesign something that could land in master and improve scalability for everyone.  I'd also love to hear about any prior art or experiments folks have done.  And if there are already efforts in process to address this very issue, apologies for being out of the loop.
>> 
>> Thanks!
>> Scott
>> 
>> -- 
>> - Mark 
>> about.me/markrmiller <http://about.me/markrmiller>
> -- 
> - Mark 
> about.me/markrmiller <http://about.me/markrmiller>

Re: Massive state-update bottleneck at scale

Posted by Mark Miller <ma...@gmail.com>.
In many cases other nodes need to see a progression of state changes. You
really have to clear the deck and try to start from 0.
On Wed, Nov 23, 2016 at 5:50 PM Walter Underwood <wu...@wunderwood.org>
wrote:

> If the queue is local and the state messages are complete, the local queue
> should only send the latest, most accurate update. The rest can be skipped.
>
> The same could be done on the receiving end. Suck the queue dry, then
> choose the most recent.
>
> If the updates depend on previous updates, it would be a lot more work to
> compile the latest delta.
>
> wunder
>
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> On Nov 23, 2016, at 2:45 PM, Mark Miller <ma...@gmail.com> wrote:
>
> I talked about this type of thing with Jessica at Lucene / Sole
> revolution. One thing is, when you reconnect after connecting to ZK, it
> should now efficiently set every core as down in a single command, not each
> core. Beyond that, any single node knows how fast it's sending overseer
> updates. Each should have a governor. If the rate is too high, a node
> should know it's best to just forgive up and assume things are screwed. It
> could try and reset from ground zero.
>
> There are other things things that can be done, but given the current
> design, the simplest win is that a replica can easily prevent itself from
> spamming the overseer queue.
>
> Mark
> On Wed, Nov 23, 2016 at 5:05 PM Scott Blum <dr...@gmail.com> wrote:
>
> I've been fighting fires the last day where certain of our solr nodes will
> have a long GC pauses that cause them to lose their ZK connection and have
> to reconnect.  That would be annoying, but survivable, although obvious
> it's something I want to fix.
>
> But what makes it fatal is the current design of the state update queue.
>
> Every time one of our nodes flaps, it ends up shoving thousands of state
> updates and leader requests onto the queue, most of them ultimately
> futile.  By the time the state is actually published, it's already stale.
> At one point we had 400,000 items in the queue and I just had to declare
> bankruptcy, delete the entire queue, and elect a new overseer.  Later, we
> had 70,000 items from several flaps that took an hour to churn through.
> even after I'd shut down the problematic nodes.  Again, almost entirely
> useless, repetitive work.
>
> Digging through ZKController and related code, the current model just
> seems terribly outdated and non-scalable now.  If a node flaps for just a
> moment, do we really need to laboriously update every core's state down,
> just to mark it up again?  What purpose does this serve that isn't already
> served by the global live_nodes presence indication and/or leader election
> nodes?
>
> Rebooting a node creates a similar set of problems, a couple hundred cores
> end up generating thousands of ZK operations to just to back to normal
> state.
>
> We're at enough of breaking point that I *have* to do something here for
> our own cluster.  I would love to put my head together with some of the
> more knowledgeable Solr operations folks to help redesign something that
> could land in master and improve scalability for everyone.  I'd also love
> to hear about any prior art or experiments folks have done.  And if there
> are already efforts in process to address this very issue, apologies for
> being out of the loop.
>
> Thanks!
> Scott
>
> --
> - Mark
> about.me/markrmiller
>
>
> --
- Mark
about.me/markrmiller

Re: Massive state-update bottleneck at scale

Posted by Walter Underwood <wu...@wunderwood.org>.
If the queue is local and the state messages are complete, the local queue should only send the latest, most accurate update. The rest can be skipped.

The same could be done on the receiving end. Suck the queue dry, then choose the most recent.

If the updates depend on previous updates, it would be a lot more work to compile the latest delta.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 23, 2016, at 2:45 PM, Mark Miller <ma...@gmail.com> wrote:
> 
> I talked about this type of thing with Jessica at Lucene / Sole revolution. One thing is, when you reconnect after connecting to ZK, it should now efficiently set every core as down in a single command, not each core. Beyond that, any single node knows how fast it's sending overseer updates. Each should have a governor. If the rate is too high, a node should know it's best to just forgive up and assume things are screwed. It could try and reset from ground zero.
> 
> There are other things things that can be done, but given the current design, the simplest win is that a replica can easily prevent itself from spamming the overseer queue. 
> 
> Mark
> On Wed, Nov 23, 2016 at 5:05 PM Scott Blum <dragonsinth@gmail.com <ma...@gmail.com>> wrote:
> I've been fighting fires the last day where certain of our solr nodes will have a long GC pauses that cause them to lose their ZK connection and have to reconnect.  That would be annoying, but survivable, although obvious it's something I want to fix.
> 
> But what makes it fatal is the current design of the state update queue.
> 
> Every time one of our nodes flaps, it ends up shoving thousands of state updates and leader requests onto the queue, most of them ultimately futile.  By the time the state is actually published, it's already stale.  At one point we had 400,000 items in the queue and I just had to declare bankruptcy, delete the entire queue, and elect a new overseer.  Later, we had 70,000 items from several flaps that took an hour to churn through. even after I'd shut down the problematic nodes.  Again, almost entirely useless, repetitive work.
> 
> Digging through ZKController and related code, the current model just seems terribly outdated and non-scalable now.  If a node flaps for just a moment, do we really need to laboriously update every core's state down, just to mark it up again?  What purpose does this serve that isn't already served by the global live_nodes presence indication and/or leader election nodes?
> 
> Rebooting a node creates a similar set of problems, a couple hundred cores end up generating thousands of ZK operations to just to back to normal state.
> 
> We're at enough of breaking point that I have to do something here for our own cluster.  I would love to put my head together with some of the more knowledgeable Solr operations folks to help redesign something that could land in master and improve scalability for everyone.  I'd also love to hear about any prior art or experiments folks have done.  And if there are already efforts in process to address this very issue, apologies for being out of the loop.
> 
> Thanks!
> Scott
> 
> -- 
> - Mark 
> about.me/markrmiller <http://about.me/markrmiller>

Re: Massive state-update bottleneck at scale

Posted by Mark Miller <ma...@gmail.com>.
I talked about this type of thing with Jessica at Lucene / Sole revolution.
One thing is, when you reconnect after connecting to ZK, it should now
efficiently set every core as down in a single command, not each core.
Beyond that, any single node knows how fast it's sending overseer updates.
Each should have a governor. If the rate is too high, a node should know
it's best to just forgive up and assume things are screwed. It could try
and reset from ground zero.

There are other things things that can be done, but given the current
design, the simplest win is that a replica can easily prevent itself from
spamming the overseer queue.

Mark
On Wed, Nov 23, 2016 at 5:05 PM Scott Blum <dr...@gmail.com> wrote:

> I've been fighting fires the last day where certain of our solr nodes will
> have a long GC pauses that cause them to lose their ZK connection and have
> to reconnect.  That would be annoying, but survivable, although obvious
> it's something I want to fix.
>
> But what makes it fatal is the current design of the state update queue.
>
> Every time one of our nodes flaps, it ends up shoving thousands of state
> updates and leader requests onto the queue, most of them ultimately
> futile.  By the time the state is actually published, it's already stale.
> At one point we had 400,000 items in the queue and I just had to declare
> bankruptcy, delete the entire queue, and elect a new overseer.  Later, we
> had 70,000 items from several flaps that took an hour to churn through.
> even after I'd shut down the problematic nodes.  Again, almost entirely
> useless, repetitive work.
>
> Digging through ZKController and related code, the current model just
> seems terribly outdated and non-scalable now.  If a node flaps for just a
> moment, do we really need to laboriously update every core's state down,
> just to mark it up again?  What purpose does this serve that isn't already
> served by the global live_nodes presence indication and/or leader election
> nodes?
>
> Rebooting a node creates a similar set of problems, a couple hundred cores
> end up generating thousands of ZK operations to just to back to normal
> state.
>
> We're at enough of breaking point that I *have* to do something here for
> our own cluster.  I would love to put my head together with some of the
> more knowledgeable Solr operations folks to help redesign something that
> could land in master and improve scalability for everyone.  I'd also love
> to hear about any prior art or experiments folks have done.  And if there
> are already efforts in process to address this very issue, apologies for
> being out of the loop.
>
> Thanks!
> Scott
>
> --
- Mark
about.me/markrmiller