You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@solr.apache.org by Mark Miller <ma...@gmail.com> on 2021/09/23 01:05:45 UTC

ZkCmdExecutor

I’m checking that I’m not in some old branch somehow … I’d have sweared
someone got rid of ZkCmdExecutor.

I can’t touch this overseer, I’m dying to see it go, so forgetting about
the fact that it’s insane that it goes to zk like this to deal with
leadership or that it’s half impervious to interrupts or any reasonable
shutdown behavior…

If someone gets an itch towards some more proper zk behavior, a decent
start is to kill these fall off retries.

Zk alerts us when it losses a connection via callback. When the connection
is back, another callback. An unlimited number of locations trying to work
this out on there own is terrible zk. In an ideal world, everything enters
a zk quiete mode and re-engaged when zk says hello again. A simpler shorter
term improvement is to simply  sink all the zk calls when they hit the zk
connection manager and don’t let them go until the connection is restored.


1 thread leaked from SUITE scope at org.apache.solr.handler.
TestHdfsBackupRestoreCore:
   1) Thread[id=1131, name=OverseerExitThread, state=TIMED_WAITING,
group=Overseer state updater.]
        at java.base@11.0.12/java.lang.Thread.sleep(Native Method)
        at app//org.apache.solr.common.cloud.ZkCmdExecutor.
retryDelay(ZkCmdExecutor.java:156)
        at app//org.apache.solr.common.cloud.ZkCmdExecutor.
retryOperation(ZkCmdExecutor.java:89)
        at app//org.apache.solr.common.cloud.SolrZkClient.getData(
SolrZkClient.java:343)
        at app//org.apache.solr.cloud.Overseer$ClusterStateUpdater.
checkIfIamStillLeader(Overseer.java:412)
        at app//org.apache.solr.cloud.Overseer$ClusterStateUpdater$$
Lambda$835/0x0000000100902440.run(Unknown Source)
        at java.base@11.0.12/java.lang.Thread.run(Thread.java:829)
com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from
SUITE scope at org.apache.solr.handler.TestHdfsBackupRestoreCore:
   1) Thread[id=1131, name=OverseerExitThread, state=TIMED_WAITING,
group=Overseer state updater.]
        at java.base@11.0.12/java.lang.Thread.sleep(Native Method)
        at app//org.apache.solr.common.cloud.ZkCmdExecutor.
retryDelay(ZkCmdExecutor.java:156)
        at app//org.apache.solr.common.cloud.ZkCmdExecutor.
retryOperation(ZkCmdExecutor.java:89)
        at app//org.apache.solr.common.cloud.SolrZkClient.getData(
SolrZkClient.java:343)
        at app//org.apache.solr.cloud.Overseer$ClusterStateUpdater.
checkIfIamStillLeader(Overseer.java:412)
        at app//org.apache.solr.cloud.Overseer$ClusterStateUpdater$$
Lambda$835/0x0000000100902440.run(Unknown Source)
        at java.base@11.0.12/java.lang.Thread.run(Thread.java:829)
at __randomizedtesting.SeedInfo.seed([8F6FE499FACF34E4]:0)
-- 
- Mark

http://about.me/markrmiller

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

Okay, I added some basic suggestions to that leader election Jira.

Between everything I’ve dropped in this thread, I don’t see why anyone
could not fix leader election and leader sync up or come up with good
replacements or make good improvements, so I’ll just leave it at that.

Finally, if anyone is interested in why/how I think the current design can
scale very well and with great performance and that it’s scalability can’t
remotely be judged by the current implementation performance or
extrapolating current implementation choices, I’ve outlined some key frames
on implementation changes to the current design that would allow it to
scale and perform. It’s not a prescription, it’s an example implementation
that can be super scalable, super fast, super stable. Other implementation
choices could also do this. Still others could be even less scalable and
performant than the current one.

In the current system, implementation often gets confused for design.
Sometimes the edges of such things can be fuzzy regardless.

Some people have specific needs they would create a design to favor. Some
have specific features they would design to favor.

I would do a variety things different in a new design. The NoSql type
features would likely go. Real-time search would not be done a document at
a time. Some of these personal choices would ripple around in a system
design.

This is not the system I would design. It’s an implementation that meets
the current design that scales and performs at a high level.

https://gist.github.com/markrmiller/5cf1dc414b626583ffb25ee1aee914f7

-- 
- Mark

http://about.me/markrmiller

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

Ilan:

So I have never had any disagreements with your analysis of what does not
work. I have never had any competing designs or approaches. I am not
against different designs.

When I assert this design works and scales, it's mainly to point out that
design is never the problem I've seen here. I've gone through a couple
times to discover all the problems - make everything work and work stable
and fast.

When you say, who cares about all these other things, about scale and
performance, what about fundamental issues that prevent basic good behavior
like leadership and leadership sync up, I have no disagreement there. If
you are using the system today, those are the things I would want addressed.

The problems that I see that have me caring little about other designs are
that those fundamental blocks have been in about the same state since the
system went out. In 2013 and 2014, when I had moved to Cloudera and like
two companies used cloud, I understood why things were in that state.

My charge was HDFS, and when I could, I would scour the email lists for
anyone even trying the system. Almost no one else worked on it. I made
quick hack improvements or additions when I could.

From then on, use took off. The number of developers doing things on cloud
took off. A thousand features and code and improvements and changes came
in. Many companies started making and saving millions of dollars on the
system.

And those fundamental issues remained the same. Most of the core cloud code
itself remained the same. Things got improved here and there. LIR came in,
and then was redone to be even better. But then look at the issues you are
facing that it solves easily with tiny effort but still has not solved.

Look at the code that fuels the basic issues you are concerned with and how
much if it is almost exactly what it always was. Look at the mountain of
development and effort that has been put on top of it.

That is why new designs did not excite me in 2018, 2019 or today. How do
they solve the fact that the current designs could have worked 6 years ago.
4 years ago. 3 years ago.

There are some great memes out there lining me up with Trump. It's all
broken and dilapidated. And only I can fix it. Mexico will pay for it. I'm
a playful person with a child's mind. I'll lean right into that kind of fun.

But partially why I enjoy that humor so much is because I enjoy irony,
absurdity, and sarcasm. That's my language. The opposite of reality is a
fun place.

But the basic fundamental problem I have seen and don't see how new designs
or anything that's been suggested to me will fix is that I'm not the only
one that can fix it. Fixing the most basic problems that everyone that has
used cloud has struggled with,
invested huge outside effort or restrictions working around, gone out of
their way to build around - is not genius work. In many cases it's not even
major work. And yet those fundamental problems and that fundamental code
remains little changed.
For ages. And even still. As arguments and bike shedding and tower building
have soared along on top. And so what is wrong? And even if I fixed it all.
Fixed everything else too. What would change about the fundamental issue
that has caused this situation? Am I the only one that can fix it? Two,
three years ago it's a funny dig.
Today, things are in the same state.

And that is why I will offer help in whatever you or anyone else someday
attempts to do better or other. But I don't see things just waiting for
some fix to turn these basic issues around and so I'm not looking to lead
that effort or design new approaches.

I've got some implementation effort to add for anyone that does want to do
that though.

- Mark

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

I think the other thing is that many devs like to understand what they are
doing and why at a high level rather than reach into the mud much and feel
around.

You will find a lot of devs that will spend a tremendous amount of time
working to solve problems with what they have learned twiddling gc
parameters.
You find far fewer that will do the unknown work necessary of looking into
GC problems due to the system and addressing terrible behavior or leaks -
unless some work incident presents some obvious whale. If it's bad enough,
a customer and employment will set up the mission and params - if it's bad
but not quite that bad, it can live almost forever. I pointed out a pretty
nasty long survivor to Noble, one of the survivors that also had teeth, so
it was easy to remember and relate. The nice thing about Noble, of course,
is no fear, and sometimes we miss the communication link, but if he catches
on to what I'm trying to say, he's like oh man, and he addresses it. I
checked on that one just today, he got it. I've found a variety of things I
had related to him where I don't know if they really registered or not, for
most others they don't, and he went and got them. Nice but rare.

So all that Cloud low level stuff, data loss, naive startup, unnecessary
waits, multiple waits, unnecessary syncs, broken, sily return syncs,
unreliable leadership, unreliable zk ...

It's not something you sit there and reason all out at a high level. Maybe
a work incident occasionally points you to a whale but even that is a high
bar to doing more than very targeted soldering - maybe you do more harm
than good in the small window of your strike mission - so conservative is
your friend.

So the system can be fairly silly at the core and as long as it's not just
flatly dead in your face, much like most any given test, there is gc that
could be tuned. Review nits that could be found. High level feature
improvements that could be made. Which is why 'you' even running through
and taking notes and spotting issues at a more fundamental level, while it
sounds silly to say, is pretty unusual and a bit of fresh air and that's
why I often point it out. There is plenty you can spot. There is plenty
that is beyond complexity's mercy and you have to pull them out. I found a
great hammer for that, but it requires something. In any case, if you look,
there are enough items, enough complexity, enough unknown requirements to
geting things improved, that most don't even want to risk diving in for an
item like improving how we timeout on startup and ditch data. It went in on
day one. I'm sure it's goten it's dusting once since or something, new
timeout param name, new default timeout, something, but touching too hard
is not safe. It's not warm. Like gc tuning and high level api refining.
There is major fear, because in isolated endeavors there is little time for
understanding and exploration. Sometimes there are more fearless moments,
that is what it took for LIR. But that brings no guarantees either. The
first LIR caused some of the most in the users face problems, competing for
the problems it addressed but would be rarely noticed, for some time due to
the time it took for upgrades beyond that and the time it took to
repeatedly attempt to mitigate it and release those mitigations. The second
LIR  was a fantastic improvement, but many years on, sits well below it's
even basic finished state, promise and potential. Isolated mission.
Conservative. Forgotten. You don't set up a new strike team to go after the
terrorist you mostly got I guess. Those remaining don't have the same
potential, there is comfortable, understandable, customer driven stuff to
do that is much more amenable. Anyway, it's a culture thing, a more common
thing, an employer thing, the rare outliers are just that and their value
tends to keep them from setting up camp to dig on a system that is hard to
pull satisfaction from.

So anyway, another thing you could look at is ConnectionManager.java.

File:
/mnt/s1/solr3/solr/solrj/src/java/org/apache/solr/common/cloud/ConnectionManager.java
52:   // Track the likely expired state
53:   private static class LikelyExpiredState {
54:     private static LikelyExpiredState NOT_EXPIRED = new
LikelyExpiredState(StateType.NOT_EXPIRED, 0);
55:     private static LikelyExpiredState EXPIRED = new
LikelyExpiredState(StateType.EXPIRED, 0);
56:
57:     public enum StateType {
58:       NOT_EXPIRED,    // definitely not expired
59:       EXPIRED,        // definitely expired
60:       TRACKING_TIME   // not sure, tracking time of last disconnect
61:     }
62:
63:     private StateType stateType;
64:     private long lastDisconnectTime;
65:     public LikelyExpiredState(StateType stateType, long
lastDisconnectTime) {
66:       this.stateType = stateType;
67:       this.lastDisconnectTime = lastDisconnectTime;
68:     }
69:
70:     public boolean isLikelyExpired(long timeToExpire) {
71:       return stateType == StateType.EXPIRED
72:         || ( stateType == StateType.TRACKING_TIME && (System.nanoTime()
- lastDisconnectTime >  TimeUnit.NANOSECONDS.convert(timeToExpire,
TimeUnit.MILLISECONDS)));
73:     }
74:   }

This way we track 'likelyExpired'. Usually the issue faced is that the
machine is a bit overloaded. Dealing with gc pauses that are too long. Too
many threads and updates. It's not that a meteor hit Zk server 1 and so 2
is taking over. That is pretty rare in comparison. But even that probably
does not favor this behavior. The system is having Zk connection problems
and our strategy is to basically say, how long do you think we can ignore
it? And the thing is, ignoring it is not going to even often end up so
great in the best of cases - we need to call out to zk in some surprising
places. There is even a spot that the update chain with a great comment of
shame somewhere calls to ZK directly. But it's also the opposite of what ZK
tells you is the right idea, and they are correct. Back off on connection
issues - chill out - let it come back - then continue. That is, among other
reasons, why retrying the way we do with ZkCmdExecutor is also not a good
idea. If you do this, intermittent problems tends to resolve much faster -
not spiral down - and so you just wait up a bit, rather than kicking back
exceptions  and fails to the user right away either, the system is much
much more stable and reliable.

Next, if you dig through the process method of the watcher:

File:
/mnt/s1/solr3/solr/solrj/src/java/org/apache/solr/common/cloud/ConnectionManager.java
109:  @Override
110:   public void process(WatchedEvent event) {
111:     if (event.getState() == AuthFailed || event.getState() ==
Disconnected || event.getState() == Expired) {
112:       log.warn("Watcher {} name: {} got event {} path: {} type: {}",
this, name, event, event.getPath(), event.getType());
113:     } else {
114:       if (log.isDebugEnabled()) {
115:         log.debug("Watcher {} name: {} got event {} path: {} type:
{}", this, name, event, event.getPath(), event.getType());
116:       }
117:     }
118:
119:     if (isClosed()) {
120:       log.debug("Client->ZooKeeper status change trigger but we are
already closed");
121:       return;
122:     }
123:
124:     KeeperState state = event.getState();
125:
126:     if (state == KeeperState.SyncConnected) {
127:       log.info("zkClient has connected");
128:       connected();
129:       connectionStrategy.connected();
130:     } else if (state == Expired) {
131:       if (isClosed()) {
132:         return;
133:       }
134:       // we don't call disconnected here, because we know we are
expired
135:       connected = false;
136:       likelyExpiredState = LikelyExpiredState.EXPIRED;
137:
138:       log.warn("Our previous ZooKeeper session was expired. Attempting
to reconnect to recover relationship with ZooKeeper...");
139:
140:       if (beforeReconnect != null) {
141:         try {
142:           beforeReconnect.command();
143:         } catch (Exception e) {
144:           log.warn("Exception running beforeReconnect command", e);
145:         }
146:       }
147:

Look at everything we do in line in that process method. Here and there we
have some very small window synchronization or whatever.

Now, you normally don't have to worry about what you do in a watcher
process method. We can say that, because every watcher has that
notification fired on a thread from a big fat executor, not the zk event
thread. This is not really typical ZK, but it kind of let's you mitigate
and not have to future worry about what you do in that process loop. It's
also got plenty of downsides in terms of resource management.  The result
is a bit tough to manage chaos. Those watcher events now can come out of
order. Or you limit the executor to one thread and they come in order, but
serially.

Anyway, the ConnectionManager, this class, uses a separate executor with 1
thread. So basically, through everything we do in that process method, we
have locked off ZK connection even notifications. All I can say is that may
not be the best situation leading to the ideal behaviors.

This stuff is tricky in the best of cases - on very old code stood up on
very old ideas of what was reasonable, tricky would be an enjoyment.
Especially since as much complexity and poor behavior as you can find in
each of these classes and distinct functions and implementations - they all
tie together in a complexity multiplication party. Which is why I try to
balance that I know it can be addressed and I also know in many cases that
is likely poor information if taken the wrong way. Curator ;)

You will also see that waitForConnectedMethod I mentioned - I can't
remember if it wants improvement or is fine close to as is, but if you
look, too random places already use it this way. One is in ZkShardTerms:

File:
/mnt/s1/solr3/solr/core/src/java/org/apache/solr/cloud/ZkShardTerms.java
369:   private void retryRegisterWatcher() {
370:     while (!isClosed.get()) {
371:       try {
372:         registerWatcher();
373:         return;
374:       } catch (KeeperException.SessionExpiredException |
KeeperException.AuthFailedException e) {
375:         isClosed.set(true);
376:         log.error("Failed watching shard term for collection: {} due
to unrecoverable exception", collection, e);
377:         return;
378:       } catch (KeeperException e) {
379:         log.warn("Failed watching shard term for collection: {},
retrying!", collection, e);
380:         try {
381:
zkClient.getConnectionManager().waitForConnected(zkClient.getZkClientTimeout());
382:         } catch (TimeoutException te) {
383:           if (Thread.interrupted()) {
384:             throw new
SolrException(SolrException.ErrorCode.SERVER_ERROR, "Error watching shard
term for collection: " + collection, te);
385:           }
386:         }
387:       }
388:     }
389:   }

That Dat is sharp as hell, he was on the right path of things left and
right with surely not enough deference or time to devote to it. I tried to
get him back in the game, but I'd already given up my clout and he was
location tricky.

But that is the move.

This ConnectionManager, the Leader Election, the Leader Sync, the
SolrZkClient, the Overseers distaste and disregard for the rest of the
system. These are the core, long lived, fundamental issues. You can find a
lot of issues in a code base of this size and complexity that has battled
off so many for so long, but a lot of that is much more livable. The fact
that the heart of the system is these heavily flawed and neglected pieces
is where the real meat is. And there is more than I can just pull off the
top of my head taking a quick gander. I didn't find great items by just
starting at code for 30 seconds and using some pre gained knowledge to
turbocharge a quick glance analysis. I spent a tremendous amount of time
with a system that exposed issues to me, issues I'd never have reasoned out
or guessed or turbo analyzed.

And I won't pretend that I have the best answers for your specific needs
and situation and desires and coworkers and community future. When I say
these things can work and to the degree they can work, and when I say
Curator can work and or something else I tried on another go can work,
that's meant to be additional information, helpful information, my
experience that I bring back for ill or good. In many cases, I personally
would not travel many of the roads I've said can be made to work.
Replacement, design around, simplify, scale down, I'd look at the whole
toolset depending on all the constraints. To know that many of these things
can largely work as is, or with some specified different component or
direction is from my end simply more data points in the cap. One of the
often and bat to the head takeaways I repeatedly got when working through
making something work was how it was a terrible trappy situation to begin
with even in the best case. The first time I really started getting to the
bottom of stuff, I stopped thinking so hard about what could work and how
well and I started really focusing on what are the problems and why and
what to do to to combat those issues in a group of disparate developers of
different levels and code familiarity - much more so than what you needed
to do to combat the system.

But the group of developers around the surface and edges of the cloud core
were even more indignified and outraged with that angle than the 'it's
pretty damn poor and can be pretty damn good' angle. Those closer to the
core were easy, but already few and already filtering off and out or locked
up in various ways.

But that is of course the same calculus today as then. If you can develop
and plan defensively, with all of the current mishaps and silliness that
goes on, and looking through it seeds a bunch of better ideas, you can set
up a lot better than simply fixing and improving the clearly poor bits into
faster, prettier bits. Much more impactful than the quality of them
currently is the story of them and how a new story might end up with
different results.

Which is to say, break things :) Change things. Do things differently. If I
tell you this design can fly and be solid, it's because that is what I have
to tell you. Flying is like one quadrant of 4, flying solid a bit more.
Just parts of the puzzle, the ones I settled into and could enjoy after a
certain point that the other pieces became too unenjoyable.

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

I think the problem has been a mix of cases. For many, search is not the
sytem of record, so data loss is not great, but not the end of the world.
For others, the number of issues you could run into caused them to take
different approaches. If you can keep the ZooKeeper connection stable
enough, you can avoid a good chunk off issues, so you might limit core
sizes, throttle updates, try and contain everything relative to the
hardware. Some do not likely notice data loss that happens. Some will have
extensive monitoring and paging and intervene early, with access to either
backups or in rarer cases, a cross data center pair cluster.

Do to the difficulty of upgrades, perhaps more difficult as you go back, I
don't know, dev tends to be way ahead of what people are on. Users where on
4x for an extremely long time when dev had moved versions past the 4x
branch. A large number are still on 5x now. Large companies put a lot of
internal effort into different mitigations or workarounds in these cases.
Even if they are more up to date, they are building a project now, doing
dev and waiting for a release is usually not really considered an option.

Other companies that might have had more alignment with that type of work
have  their own unuique histories and compouding reasons why that was never
really invested in.

In the early days, there were so many more issues you faced, many are just
willing to stomach a lot of pain for the search side, which is actually
pretty decent. It's free, it's Apache, it's in a niche without a lot of
competition. The licenese and open source model trump a technical desicion
many times and then it's up to the developers to figure it out.

LIR (I know it's a weird name for it now, it made a little more sense with
the impl it replaced) is pretty solid though. There are a few issues, I'll
see if I can dig a couple up.

Yeah, there is probably no reason transient cores could not work with
cloud. I'm sure they would be a trip today. I have little trust in them in
general today. That's part of why  I pushed to make SolrCores so efficient
as well, I didn't even want to touch transient cores given the state they
looked in. Compounding reasons too, SolrCores are just heavy to load, heavy
to reload. Reload a collection with a lot of SolrCores on a machine and
take a look at those system stats. But making them really cheap is beyond
what anyone would invest in such a project and doesn't deal with winding
down large data structures like caches when not being used anyway. It just
aligned with my want to finally see thousands of cores per container work,
all live, and I've always had a distate for transient cores given I think
they likely have a list of problems and have never had any first class
consideration, dev, or tests. They would also be nicer with super fast
SolrCore loads is another reason I guess I wasn't too thrilled with using
them as a solution without makeing SolrCores cheap.

Anyway, I don't see why transient cores couldnt work with Cloud, and
depending on how many you might have to spin up at once, probably just fine
for most use cases given you will have cold caches anyway.

- Mark

On Tue, Oct 5, 2021 at 2:56 AM Ilan Ginzburg <il...@gmail.com> wrote:

> Do typical setups (users) of SolrCloud not care about no data loss in the
> cluster, or are clusters maintained stable enough that these issues do not
> happen? I feel moving to a cloud infra drives more instability and suspect
> these issues becoming more prevalent/problematic then.
>
> Indeed Terms (LIR) and shard leader election serve the same purpose. Maybe
> one could go... I'd like to have a cluster where some replicas are not
> loaded in memory (transient cores) and do not even participate in all the
> ZK activity until they're needed. This would open the door to a
> ridiculously high number of replicas (limited by disk size, not by memory,
> CPU, connections, threads or anything else).
> This would serve well a search hosting use case where most tenants are not
> active at any given time, but there are many of them.
>
> I don't know if it's a realistic evolution of SolrCloud or should be
> considered science fiction at this stage.
>
> Ilan
>
> On Tue, Oct 5, 2021 at 7:33 AM Mark Miller <ma...@gmail.com> wrote:
>
>> Well, the replicas are still waiting for the leader, so not no wait, you
>> just don’t have leaders waiting for full shards that lessens the problem. A
>> leader should have no wait and release any waiting replicas.
>>
>> That core sorter should be looking at LIR to start the leader capable
>> replicas first.
>>
>> Mark
>>
>> On Tue, Oct 5, 2021 at 12:10 AM Mark Miller <ma...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Mon, Oct 4, 2021 at 5:24 AM Ilan Ginzburg <il...@gmail.com> wrote:
>>>
>>>> Thanks Mark for your write ups! This is an area of SolrCloud I'm
>>>> currently actively exploring at work (might publish my notes as well at
>>>> some point).
>>>>
>>>> I think terms value (fitness to become leader) should participate in
>>>> the election node ordering, as well as a terms goal (based on highest known
>>>> term for the shard), and clear options of stalled election vs. data loss
>>>> (and if data loss is not acceptable, an external process automated or human
>>>> likely has to intervene to unblock the stalled election).
>>>>
>>>
>>> There should only be a stalled election if no one can become leader for
>>> some odd reason - agreed it would be good to be able to detect that vs
>>> cycling forever.
>>>
>>> I’ve also always thought that should probably be a configuration. If you
>>> know a replica is absent with more data you just bail (currently it will
>>> wait and then continue) or an operator could configure to continue with the
>>> data you have regardless.
>>>
>>> I used to think about those things, but it’s a boring area to care about
>>> or work on given no one else does.
>>>
>>>
>>>> Even if all updates hit multiple replicas, nothing guarantees that any
>>>> of these copies is present when another replica (without the update)
>>>> starts. If we don't want to wait at startup for other replicas to join an
>>>> election (this can't scale even though CoreSorter does its best... but
>>>> is the most convoluted Comparator I've ever seen) we might need the
>>>> notion of "incomplete leader", i.e. a replica that is the current elected
>>>> leader but that does not have all data (at some later point we might decide
>>>> to accept the loss and consider it's the leader, or when a better
>>>> positioned replica joins, have it become leader). This will require quite
>>>> some assumptions revisiting, so likely should be associated with a
>>>> thorough clean up (and a move to Curator election?).
>>>>
>>>
>>> A new replica that comes up should fit into the above logic - it will
>>> have a term of 0 and other replicas will have higher terms and you will
>>> know from zk - so either you fail shard startup or like today, the other
>>> replicas are not coming and you continue on.
>>>
>>> I don’t think that core sorter does the best sort based on the last time
>>> I looked at it.
>>>
>>> It used to actually matter much more though - because the shard could
>>> not start until all the replicas were up, so with many replicas and shards,
>>> and collections, depending on order you could easily have to start a huge
>>> number of cores to complete a shard before it got moving.
>>>
>>> But that issue should not be nearly the issue that it was. That’s again
>>> from a world with no LIR info. There should be no reason to wait for all
>>> replicas today, no reason to have them all involved in a sync. Any replica
>>> with the highest LIR term can become leader, so sorting to complete shards
>>> might be nice, but it should not be needed like it was. (It still is
>>> needed, but again because we still wait when you don’t need to).
>>>
>>> Technically, you could do a much simpler leader election. With an
>>> overseer, it could just pick the leader. With or without, a replica that
>>> see it has the highest term could just try and create the leader node -
>>> first one wins.
>>>
>>> The current leader election is a recipe zk promotes to avoid a
>>> thundering herd affect - you can have tons of participants and it’s an
>>> efficient flow vs 100 participants fighting to see who creates a zk node
>>> every new election.
>>>
>>> But generally we have 3 replicas. Some outlier users might use more, but
>>> even still it’s not going to be that many.
>>>
>>> Mark
>>>
>>>
>>>> Ilan
>>>>
>>>>
>>>>
>>>> On Sun, Oct 3, 2021 at 4:27 AM Mark Miller <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> I filed
>>>>> https://issues.apache.org/jira/browse/SOLR-15672 Leader Election is
>>>>> flawed - for future reference if anyone looks at tackling leader election
>>>>> issues. I’ll drop a couple notes and random suggestions there
>>>>>
>>>>> Mark
>>>>>
>>>>> On Sat, Oct 2, 2021 at 12:47 PM Mark Miller <ma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> At some point digging through some of this stuff, I often start to
>>>>>> think, I wonder how good our tests are at catching certain categories of
>>>>>> problems. Various groups of code branches and behaviors.
>>>>>>
>>>>>> I do notice that as I get the test flying, they do start to pick up a
>>>>>> lot more issues. A lot more bugs and bad behavior. And as they start to
>>>>>> near max out, I start feeling a little better about a lot of it. But then
>>>>>> I’m looking at things outside of tests still as well. Using my own tools
>>>>>> and setups, using stuff from others. Being cruel in my expectations. And by
>>>>>> then I’ve come a long way, but I can still find problems. Run into bad
>>>>>> situations. If I push, and when I make it so can push harder, i push even
>>>>>> harder. And I want the damn thing solid. Why come all this way if I can’t
>>>>>> have really and truly solid. And that’s when I reach for collection
>>>>>> creation a mixed with cluster restarts.  How about I shove 100000 SolrCores
>>>>>> 10000 collections right down its mouth on a handful of instances on a
>>>>>> single machine in like a minute timeframe. How about 30 seconds. How about
>>>>>> more collections. How about lower time frames. Vary things around. Let’s
>>>>>> just swamp it and demand the setup eats it in silly time frames and stands
>>>>>> up at the end correct and happy.  And then I start to get to the bottom of
>>>>>> the barrel on what’s subverting my solidness. But as I’ve always said, more
>>>>>> and more targeted for tests along with simpler and more understandable
>>>>>> implementations will also cover a lot more ground. I certainly have pushed
>>>>>> on simpler implementations. I’ve never gotten to the point where I have the
>>>>>> energy and time to just push on more, better and more targeted tests, more
>>>>>> unit tests, more mockito, more awaitability as Tims suggested, etc.
>>>>>> --
>>>>>> - Mark
>>>>>>
>>>>>> http://about.me/markrmiller
>>>>>>
>>>>> --
>>>>> - Mark
>>>>>
>>>>> http://about.me/markrmiller
>>>>>
>>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
> --
- Mark

http://about.me/markrmiller

Re: ZkCmdExecutor

Posted by Ilan Ginzburg <il...@gmail.com>.

Do typical setups (users) of SolrCloud not care about no data loss in the
cluster, or are clusters maintained stable enough that these issues do not
happen? I feel moving to a cloud infra drives more instability and suspect
these issues becoming more prevalent/problematic then.

Indeed Terms (LIR) and shard leader election serve the same purpose. Maybe
one could go... I'd like to have a cluster where some replicas are not
loaded in memory (transient cores) and do not even participate in all the
ZK activity until they're needed. This would open the door to a
ridiculously high number of replicas (limited by disk size, not by memory,
CPU, connections, threads or anything else).
This would serve well a search hosting use case where most tenants are not
active at any given time, but there are many of them.

I don't know if it's a realistic evolution of SolrCloud or should be
considered science fiction at this stage.

Ilan

On Tue, Oct 5, 2021 at 7:33 AM Mark Miller <ma...@gmail.com> wrote:

> Well, the replicas are still waiting for the leader, so not no wait, you
> just don’t have leaders waiting for full shards that lessens the problem. A
> leader should have no wait and release any waiting replicas.
>
> That core sorter should be looking at LIR to start the leader capable
> replicas first.
>
> Mark
>
> On Tue, Oct 5, 2021 at 12:10 AM Mark Miller <ma...@gmail.com> wrote:
>
>>
>>
>> On Mon, Oct 4, 2021 at 5:24 AM Ilan Ginzburg <il...@gmail.com> wrote:
>>
>>> Thanks Mark for your write ups! This is an area of SolrCloud I'm
>>> currently actively exploring at work (might publish my notes as well at
>>> some point).
>>>
>>> I think terms value (fitness to become leader) should participate in the
>>> election node ordering, as well as a terms goal (based on highest known
>>> term for the shard), and clear options of stalled election vs. data loss
>>> (and if data loss is not acceptable, an external process automated or human
>>> likely has to intervene to unblock the stalled election).
>>>
>>
>> There should only be a stalled election if no one can become leader for
>> some odd reason - agreed it would be good to be able to detect that vs
>> cycling forever.
>>
>> I’ve also always thought that should probably be a configuration. If you
>> know a replica is absent with more data you just bail (currently it will
>> wait and then continue) or an operator could configure to continue with the
>> data you have regardless.
>>
>> I used to think about those things, but it’s a boring area to care about
>> or work on given no one else does.
>>
>>
>>> Even if all updates hit multiple replicas, nothing guarantees that any
>>> of these copies is present when another replica (without the update)
>>> starts. If we don't want to wait at startup for other replicas to join an
>>> election (this can't scale even though CoreSorter does its best... but
>>> is the most convoluted Comparator I've ever seen) we might need the
>>> notion of "incomplete leader", i.e. a replica that is the current elected
>>> leader but that does not have all data (at some later point we might decide
>>> to accept the loss and consider it's the leader, or when a better
>>> positioned replica joins, have it become leader). This will require quite
>>> some assumptions revisiting, so likely should be associated with a
>>> thorough clean up (and a move to Curator election?).
>>>
>>
>> A new replica that comes up should fit into the above logic - it will
>> have a term of 0 and other replicas will have higher terms and you will
>> know from zk - so either you fail shard startup or like today, the other
>> replicas are not coming and you continue on.
>>
>> I don’t think that core sorter does the best sort based on the last time
>> I looked at it.
>>
>> It used to actually matter much more though - because the shard could not
>> start until all the replicas were up, so with many replicas and shards, and
>> collections, depending on order you could easily have to start a huge
>> number of cores to complete a shard before it got moving.
>>
>> But that issue should not be nearly the issue that it was. That’s again
>> from a world with no LIR info. There should be no reason to wait for all
>> replicas today, no reason to have them all involved in a sync. Any replica
>> with the highest LIR term can become leader, so sorting to complete shards
>> might be nice, but it should not be needed like it was. (It still is
>> needed, but again because we still wait when you don’t need to).
>>
>> Technically, you could do a much simpler leader election. With an
>> overseer, it could just pick the leader. With or without, a replica that
>> see it has the highest term could just try and create the leader node -
>> first one wins.
>>
>> The current leader election is a recipe zk promotes to avoid a thundering
>> herd affect - you can have tons of participants and it’s an efficient flow
>> vs 100 participants fighting to see who creates a zk node every new
>> election.
>>
>> But generally we have 3 replicas. Some outlier users might use more, but
>> even still it’s not going to be that many.
>>
>> Mark
>>
>>
>>> Ilan
>>>
>>>
>>>
>>> On Sun, Oct 3, 2021 at 4:27 AM Mark Miller <ma...@gmail.com>
>>> wrote:
>>>
>>>> I filed
>>>> https://issues.apache.org/jira/browse/SOLR-15672 Leader Election is
>>>> flawed - for future reference if anyone looks at tackling leader election
>>>> issues. I’ll drop a couple notes and random suggestions there
>>>>
>>>> Mark
>>>>
>>>> On Sat, Oct 2, 2021 at 12:47 PM Mark Miller <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> At some point digging through some of this stuff, I often start to
>>>>> think, I wonder how good our tests are at catching certain categories of
>>>>> problems. Various groups of code branches and behaviors.
>>>>>
>>>>> I do notice that as I get the test flying, they do start to pick up a
>>>>> lot more issues. A lot more bugs and bad behavior. And as they start to
>>>>> near max out, I start feeling a little better about a lot of it. But then
>>>>> I’m looking at things outside of tests still as well. Using my own tools
>>>>> and setups, using stuff from others. Being cruel in my expectations. And by
>>>>> then I’ve come a long way, but I can still find problems. Run into bad
>>>>> situations. If I push, and when I make it so can push harder, i push even
>>>>> harder. And I want the damn thing solid. Why come all this way if I can’t
>>>>> have really and truly solid. And that’s when I reach for collection
>>>>> creation a mixed with cluster restarts.  How about I shove 100000 SolrCores
>>>>> 10000 collections right down its mouth on a handful of instances on a
>>>>> single machine in like a minute timeframe. How about 30 seconds. How about
>>>>> more collections. How about lower time frames. Vary things around. Let’s
>>>>> just swamp it and demand the setup eats it in silly time frames and stands
>>>>> up at the end correct and happy.  And then I start to get to the bottom of
>>>>> the barrel on what’s subverting my solidness. But as I’ve always said, more
>>>>> and more targeted for tests along with simpler and more understandable
>>>>> implementations will also cover a lot more ground. I certainly have pushed
>>>>> on simpler implementations. I’ve never gotten to the point where I have the
>>>>> energy and time to just push on more, better and more targeted tests, more
>>>>> unit tests, more mockito, more awaitability as Tims suggested, etc.
>>>>> --
>>>>> - Mark
>>>>>
>>>>> http://about.me/markrmiller
>>>>>
>>>> --
>>>> - Mark
>>>>
>>>> http://about.me/markrmiller
>>>>
>>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
> --
> - Mark
>
> http://about.me/markrmiller
>

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

Well, the replicas are still waiting for the leader, so not no wait, you
just don’t have leaders waiting for full shards that lessens the problem. A
leader should have no wait and release any waiting replicas.

That core sorter should be looking at LIR to start the leader capable
replicas first.

Mark

On Tue, Oct 5, 2021 at 12:10 AM Mark Miller <ma...@gmail.com> wrote:

>
>
> On Mon, Oct 4, 2021 at 5:24 AM Ilan Ginzburg <il...@gmail.com> wrote:
>
>> Thanks Mark for your write ups! This is an area of SolrCloud I'm
>> currently actively exploring at work (might publish my notes as well at
>> some point).
>>
>> I think terms value (fitness to become leader) should participate in the
>> election node ordering, as well as a terms goal (based on highest known
>> term for the shard), and clear options of stalled election vs. data loss
>> (and if data loss is not acceptable, an external process automated or human
>> likely has to intervene to unblock the stalled election).
>>
>
> There should only be a stalled election if no one can become leader for
> some odd reason - agreed it would be good to be able to detect that vs
> cycling forever.
>
> I’ve also always thought that should probably be a configuration. If you
> know a replica is absent with more data you just bail (currently it will
> wait and then continue) or an operator could configure to continue with the
> data you have regardless.
>
> I used to think about those things, but it’s a boring area to care about
> or work on given no one else does.
>
>
>> Even if all updates hit multiple replicas, nothing guarantees that any of
>> these copies is present when another replica (without the update) starts.
>> If we don't want to wait at startup for other replicas to join an election
>> (this can't scale even though CoreSorter does its best... but is the
>> most convoluted Comparator I've ever seen) we might need the notion of
>> "incomplete leader", i.e. a replica that is the current elected leader but
>> that does not have all data (at some later point we might decide to accept
>> the loss and consider it's the leader, or when a better positioned replica
>> joins, have it become leader). This will require quite some assumptions
>> revisiting, so likely should be associated with a thorough clean up (and a
>> move to Curator election?).
>>
>
> A new replica that comes up should fit into the above logic - it will have
> a term of 0 and other replicas will have higher terms and you will know
> from zk - so either you fail shard startup or like today, the other
> replicas are not coming and you continue on.
>
> I don’t think that core sorter does the best sort based on the last time I
> looked at it.
>
> It used to actually matter much more though - because the shard could not
> start until all the replicas were up, so with many replicas and shards, and
> collections, depending on order you could easily have to start a huge
> number of cores to complete a shard before it got moving.
>
> But that issue should not be nearly the issue that it was. That’s again
> from a world with no LIR info. There should be no reason to wait for all
> replicas today, no reason to have them all involved in a sync. Any replica
> with the highest LIR term can become leader, so sorting to complete shards
> might be nice, but it should not be needed like it was. (It still is
> needed, but again because we still wait when you don’t need to).
>
> Technically, you could do a much simpler leader election. With an
> overseer, it could just pick the leader. With or without, a replica that
> see it has the highest term could just try and create the leader node -
> first one wins.
>
> The current leader election is a recipe zk promotes to avoid a thundering
> herd affect - you can have tons of participants and it’s an efficient flow
> vs 100 participants fighting to see who creates a zk node every new
> election.
>
> But generally we have 3 replicas. Some outlier users might use more, but
> even still it’s not going to be that many.
>
> Mark
>
>
>> Ilan
>>
>>
>>
>> On Sun, Oct 3, 2021 at 4:27 AM Mark Miller <ma...@gmail.com> wrote:
>>
>>> I filed
>>> https://issues.apache.org/jira/browse/SOLR-15672 Leader Election is
>>> flawed - for future reference if anyone looks at tackling leader election
>>> issues. I’ll drop a couple notes and random suggestions there
>>>
>>> Mark
>>>
>>> On Sat, Oct 2, 2021 at 12:47 PM Mark Miller <ma...@gmail.com>
>>> wrote:
>>>
>>>> At some point digging through some of this stuff, I often start to
>>>> think, I wonder how good our tests are at catching certain categories of
>>>> problems. Various groups of code branches and behaviors.
>>>>
>>>> I do notice that as I get the test flying, they do start to pick up a
>>>> lot more issues. A lot more bugs and bad behavior. And as they start to
>>>> near max out, I start feeling a little better about a lot of it. But then
>>>> I’m looking at things outside of tests still as well. Using my own tools
>>>> and setups, using stuff from others. Being cruel in my expectations. And by
>>>> then I’ve come a long way, but I can still find problems. Run into bad
>>>> situations. If I push, and when I make it so can push harder, i push even
>>>> harder. And I want the damn thing solid. Why come all this way if I can’t
>>>> have really and truly solid. And that’s when I reach for collection
>>>> creation a mixed with cluster restarts.  How about I shove 100000 SolrCores
>>>> 10000 collections right down its mouth on a handful of instances on a
>>>> single machine in like a minute timeframe. How about 30 seconds. How about
>>>> more collections. How about lower time frames. Vary things around. Let’s
>>>> just swamp it and demand the setup eats it in silly time frames and stands
>>>> up at the end correct and happy.  And then I start to get to the bottom of
>>>> the barrel on what’s subverting my solidness. But as I’ve always said, more
>>>> and more targeted for tests along with simpler and more understandable
>>>> implementations will also cover a lot more ground. I certainly have pushed
>>>> on simpler implementations. I’ve never gotten to the point where I have the
>>>> energy and time to just push on more, better and more targeted tests, more
>>>> unit tests, more mockito, more awaitability as Tims suggested, etc.
>>>> --
>>>> - Mark
>>>>
>>>> http://about.me/markrmiller
>>>>
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>> --
> - Mark
>
> http://about.me/markrmiller
>
-- 
- Mark

http://about.me/markrmiller

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

On Mon, Oct 4, 2021 at 5:24 AM Ilan Ginzburg <il...@gmail.com> wrote:

> Thanks Mark for your write ups! This is an area of SolrCloud I'm currently
> actively exploring at work (might publish my notes as well at some point).
>
> I think terms value (fitness to become leader) should participate in the
> election node ordering, as well as a terms goal (based on highest known
> term for the shard), and clear options of stalled election vs. data loss
> (and if data loss is not acceptable, an external process automated or human
> likely has to intervene to unblock the stalled election).
>

There should only be a stalled election if no one can become leader for
some odd reason - agreed it would be good to be able to detect that vs
cycling forever.

I’ve also always thought that should probably be a configuration. If you
know a replica is absent with more data you just bail (currently it will
wait and then continue) or an operator could configure to continue with the
data you have regardless.

I used to think about those things, but it’s a boring area to care about or
work on given no one else does.

> Even if all updates hit multiple replicas, nothing guarantees that any of
> these copies is present when another replica (without the update) starts.
> If we don't want to wait at startup for other replicas to join an election
> (this can't scale even though CoreSorter does its best... but is the most
> convoluted Comparator I've ever seen) we might need the notion of
> "incomplete leader", i.e. a replica that is the current elected leader but
> that does not have all data (at some later point we might decide to accept
> the loss and consider it's the leader, or when a better positioned replica
> joins, have it become leader). This will require quite some assumptions
> revisiting, so likely should be associated with a thorough clean up (and a
> move to Curator election?).
>

A new replica that comes up should fit into the above logic - it will have
a term of 0 and other replicas will have higher terms and you will know
from zk - so either you fail shard startup or like today, the other
replicas are not coming and you continue on.

I don’t think that core sorter does the best sort based on the last time I
looked at it.

It used to actually matter much more though - because the shard could not
start until all the replicas were up, so with many replicas and shards, and
collections, depending on order you could easily have to start a huge
number of cores to complete a shard before it got moving.

But that issue should not be nearly the issue that it was. That’s again
from a world with no LIR info. There should be no reason to wait for all
replicas today, no reason to have them all involved in a sync. Any replica
with the highest LIR term can become leader, so sorting to complete shards
might be nice, but it should not be needed like it was. (It still is
needed, but again because we still wait when you don’t need to).

Technically, you could do a much simpler leader election. With an overseer,
it could just pick the leader. With or without, a replica that see it has
the highest term could just try and create the leader node - first one
wins.

The current leader election is a recipe zk promotes to avoid a thundering
herd affect - you can have tons of participants and it’s an efficient flow
vs 100 participants fighting to see who creates a zk node every new
election.

But generally we have 3 replicas. Some outlier users might use more, but
even still it’s not going to be that many.

Mark

> Ilan
>
>
>
> On Sun, Oct 3, 2021 at 4:27 AM Mark Miller <ma...@gmail.com> wrote:
>
>> I filed
>> https://issues.apache.org/jira/browse/SOLR-15672 Leader Election is
>> flawed - for future reference if anyone looks at tackling leader election
>> issues. I’ll drop a couple notes and random suggestions there
>>
>> Mark
>>
>> On Sat, Oct 2, 2021 at 12:47 PM Mark Miller <ma...@gmail.com>
>> wrote:
>>
>>> At some point digging through some of this stuff, I often start to
>>> think, I wonder how good our tests are at catching certain categories of
>>> problems. Various groups of code branches and behaviors.
>>>
>>> I do notice that as I get the test flying, they do start to pick up a
>>> lot more issues. A lot more bugs and bad behavior. And as they start to
>>> near max out, I start feeling a little better about a lot of it. But then
>>> I’m looking at things outside of tests still as well. Using my own tools
>>> and setups, using stuff from others. Being cruel in my expectations. And by
>>> then I’ve come a long way, but I can still find problems. Run into bad
>>> situations. If I push, and when I make it so can push harder, i push even
>>> harder. And I want the damn thing solid. Why come all this way if I can’t
>>> have really and truly solid. And that’s when I reach for collection
>>> creation a mixed with cluster restarts.  How about I shove 100000 SolrCores
>>> 10000 collections right down its mouth on a handful of instances on a
>>> single machine in like a minute timeframe. How about 30 seconds. How about
>>> more collections. How about lower time frames. Vary things around. Let’s
>>> just swamp it and demand the setup eats it in silly time frames and stands
>>> up at the end correct and happy.  And then I start to get to the bottom of
>>> the barrel on what’s subverting my solidness. But as I’ve always said, more
>>> and more targeted for tests along with simpler and more understandable
>>> implementations will also cover a lot more ground. I certainly have pushed
>>> on simpler implementations. I’ve never gotten to the point where I have the
>>> energy and time to just push on more, better and more targeted tests, more
>>> unit tests, more mockito, more awaitability as Tims suggested, etc.
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
> --
- Mark

http://about.me/markrmiller

Re: ZkCmdExecutor

Posted by Ilan Ginzburg <il...@gmail.com>.

Thanks Mark for your write ups! This is an area of SolrCloud I'm currently
actively exploring at work (might publish my notes as well at some point).

I think terms value (fitness to become leader) should participate in the
election node ordering, as well as a terms goal (based on highest known
term for the shard), and clear options of stalled election vs. data loss
(and if data loss is not acceptable, an external process automated or human
likely has to intervene to unblock the stalled election).

Even if all updates hit multiple replicas, nothing guarantees that any of
these copies is present when another replica (without the update) starts.
If we don't want to wait at startup for other replicas to join an election
(this can't scale even though CoreSorter does its best... but is the most
convoluted Comparator I've ever seen) we might need the notion of
"incomplete leader", i.e. a replica that is the current elected leader but
that does not have all data (at some later point we might decide to accept
the loss and consider it's the leader, or when a better positioned replica
joins, have it become leader). This will require quite some assumptions
revisiting, so likely should be associated with a thorough clean up (and a
move to Curator election?).

Ilan

On Sun, Oct 3, 2021 at 4:27 AM Mark Miller <ma...@gmail.com> wrote:

> I filed
> https://issues.apache.org/jira/browse/SOLR-15672 Leader Election is
> flawed - for future reference if anyone looks at tackling leader election
> issues. I’ll drop a couple notes and random suggestions there
>
> Mark
>
> On Sat, Oct 2, 2021 at 12:47 PM Mark Miller <ma...@gmail.com> wrote:
>
>> At some point digging through some of this stuff, I often start to think,
>> I wonder how good our tests are at catching certain categories of problems.
>> Various groups of code branches and behaviors.
>>
>> I do notice that as I get the test flying, they do start to pick up a lot
>> more issues. A lot more bugs and bad behavior. And as they start to near
>> max out, I start feeling a little better about a lot of it. But then I’m
>> looking at things outside of tests still as well. Using my own tools and
>> setups, using stuff from others. Being cruel in my expectations. And by
>> then I’ve come a long way, but I can still find problems. Run into bad
>> situations. If I push, and when I make it so can push harder, i push even
>> harder. And I want the damn thing solid. Why come all this way if I can’t
>> have really and truly solid. And that’s when I reach for collection
>> creation a mixed with cluster restarts.  How about I shove 100000 SolrCores
>> 10000 collections right down its mouth on a handful of instances on a
>> single machine in like a minute timeframe. How about 30 seconds. How about
>> more collections. How about lower time frames. Vary things around. Let’s
>> just swamp it and demand the setup eats it in silly time frames and stands
>> up at the end correct and happy.  And then I start to get to the bottom of
>> the barrel on what’s subverting my solidness. But as I’ve always said, more
>> and more targeted for tests along with simpler and more understandable
>> implementations will also cover a lot more ground. I certainly have pushed
>> on simpler implementations. I’ve never gotten to the point where I have the
>> energy and time to just push on more, better and more targeted tests, more
>> unit tests, more mockito, more awaitability as Tims suggested, etc.
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
> --
> - Mark
>
> http://about.me/markrmiller
>

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

I filed
https://issues.apache.org/jira/browse/SOLR-15672 Leader Election is flawed
- for future reference if anyone looks at tackling leader election issues.
I’ll drop a couple notes and random suggestions there

Mark

On Sat, Oct 2, 2021 at 12:47 PM Mark Miller <ma...@gmail.com> wrote:

> At some point digging through some of this stuff, I often start to think,
> I wonder how good our tests are at catching certain categories of problems.
> Various groups of code branches and behaviors.
>
> I do notice that as I get the test flying, they do start to pick up a lot
> more issues. A lot more bugs and bad behavior. And as they start to near
> max out, I start feeling a little better about a lot of it. But then I’m
> looking at things outside of tests still as well. Using my own tools and
> setups, using stuff from others. Being cruel in my expectations. And by
> then I’ve come a long way, but I can still find problems. Run into bad
> situations. If I push, and when I make it so can push harder, i push even
> harder. And I want the damn thing solid. Why come all this way if I can’t
> have really and truly solid. And that’s when I reach for collection
> creation a mixed with cluster restarts.  How about I shove 100000 SolrCores
> 10000 collections right down its mouth on a handful of instances on a
> single machine in like a minute timeframe. How about 30 seconds. How about
> more collections. How about lower time frames. Vary things around. Let’s
> just swamp it and demand the setup eats it in silly time frames and stands
> up at the end correct and happy.  And then I start to get to the bottom of
> the barrel on what’s subverting my solidness. But as I’ve always said, more
> and more targeted for tests along with simpler and more understandable
> implementations will also cover a lot more ground. I certainly have pushed
> on simpler implementations. I’ve never gotten to the point where I have the
> energy and time to just push on more, better and more targeted tests, more
> unit tests, more mockito, more awaitability as Tims suggested, etc.
> --
> - Mark
>
> http://about.me/markrmiller
>
-- 
- Mark

http://about.me/markrmiller

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

At some point digging through some of this stuff, I often start to think, I
wonder how good our tests are at catching certain categories of problems.
Various groups of code branches and behaviors.

I do notice that as I get the test flying, they do start to pick up a lot
more issues. A lot more bugs and bad behavior. And as they start to near
max out, I start feeling a little better about a lot of it. But then I’m
looking at things outside of tests still as well. Using my own tools and
setups, using stuff from others. Being cruel in my expectations. And by
then I’ve come a long way, but I can still find problems. Run into bad
situations. If I push, and when I make it so can push harder, i push even
harder. And I want the damn thing solid. Why come all this way if I can’t
have really and truly solid. And that’s when I reach for collection
creation a mixed with cluster restarts.  How about I shove 100000 SolrCores
10000 collections right down its mouth on a handful of instances on a
single machine in like a minute timeframe. How about 30 seconds. How about
more collections. How about lower time frames. Vary things around. Let’s
just swamp it and demand the setup eats it in silly time frames and stands
up at the end correct and happy.  And then I start to get to the bottom of
the barrel on what’s subverting my solidness. But as I’ve always said, more
and more targeted for tests along with simpler and more understandable
implementations will also cover a lot more ground. I certainly have pushed
on simpler implementations. I’ve never gotten to the point where I have the
energy and time to just push on more, better and more targeted tests, more
unit tests, more mockito, more awaitability as Tims suggested, etc.
-- 
- Mark

http://about.me/markrmiller

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

Tim was referring to code that addresses those issues from the ref branch.

I’ve been trying to remember the items that Ilan has brought up, for reason
I thought this was third one, but I can only come up with shard leader loss
and overseer loss other than leader sync.

I also recalled I have slide tools for this type of thing, so a quick
browse through leader election. A bit more irreverente because i can only
skim the surface of the complexity involved and because there are no real
small effort impactful improvements.

https://www.solrdev.io/leader-election-adventure.html


On Sat, Oct 2, 2021 at 7:51 AM David Smiley <ds...@apache.org> wrote:

> I just want to say that I appreciate the insights you shared over the last
> couple days.  By "copy paste", was Tim referring to copying your insights
> and pasting them into the code?  This is what I was thinking.  Or at least
> some way to make these insights more durable / findable.  Could be a link
> from the code into maybe a wiki page or something.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Fri, Oct 1, 2021 at 11:02 PM Mark Miller <ma...@gmail.com> wrote:
>
>> Tim hit me with the obvious question here.
>>
>> “I’m assuming there are reasons, but what about a little copy past on
>> some of these issues that you mentioned.”
>>
>> I say the obvious question because I kind of flippantly jump through some
>> lines of code and then say and then you just do a, b and c and that’s the
>> ballgame.
>>
>> There are a lot of reasons I can’t cut and paste though. And I can open
>> almost any class and annotate a similar set of issues. So without diving
>> into all the reasons, I would have already if it was so simple. I can
>> certainly help address some things, lean on existing code and efforts, but
>> at the moment I’m in a position where the best I have is to work on things
>> as needed by outside pressures,  items or demands.
>>
>> If I see others improving or redoing any of this core cloud code though,
>> I’d certainly lend a hand on those efforts. Outside of making changes based
>> on external needs, I just got out from under the solo kamakize, and i cant
>> dive back in without it being on contained items and goals that satisfies
>> someone’s needs or joining an existing multi crew effort or goal.
>>
>> If I had to randomly pull threads, repeat efforts yet one more time, and
>> funnel that work through a gauntlet of uninvolved, good intentioned
>> developers, neither me nor anyone else would be pleased.
>>
>> Mark
>>
>> On Fri, Oct 1, 2021 at 2:17 PM Mark Miller <ma...@gmail.com> wrote:
>>
>>> That covers a lot of current silliness you will see, pretty simply as
>>> most of it comes down remove silly stuff, but you can find some related
>>> wildness in ZkController#register.
>>>
>>> // check replica's existence in clusterstate first
>>>
>>> zkStateReader.waitForState(collection, 100, TimeUnit.MILLISECONDS,
>>>     (collectionState) -> getReplicaOrNull(collectionState, shardId, coreZkNodeName) != null);
>>>
>>> 100ms wait, no biggie, and at least it uses waitForState, but we should not need to get our own clusterstate from zk so here care about waiting for this here - if there is an item of data we need, it should have been passed into the core create call.
>>>
>>> Next we get the shard terms object so we can later create our shard terms entry (LIR).
>>>
>>> Slow and bug inducing complicated to have each replica do this here, fighting each other to add an initial entry. You can create the initial shard terms for a replica when you create or update the clusterstate (term {replicaname=0}), and you can do it in
>>>
>>> a single zk call.
>>>
>>>
>>> // in this case, we want to wait for the leader as long as the leader might
>>> // wait for a vote, at least - but also long enough that a large cluster has
>>> // time to get its act together
>>> String leaderUrl = getLeader(cloudDesc, leaderVoteWait + 600000);
>>>
>>> Now we do getLeader, a polling operation that should not be, and wait possibly forever for it. As I mention there should be little wait at most in the notes on leader sync, there should be little wait here. It's also
>>>
>>> one of a variety of places that even if you remove the polling, sucks to wait on. I'm a fan of thousands of cores per machine not being an issue. In many of these cases, you can't achieve that and have 1000 threads hanging out
>>>
>>> all over even if they are not blind polling. This is one of the simpler cases where that can be addressed. I break this method into two and I enhance zkstatereader waitforstate functionality. I allow you to pass a runnable to execute
>>>
>>> when zkstatereader is notified and the given predicate matches. So no need for 1000's or hundreds or dozens of slackers here. Do a couple base register items, call wait for state with a runnable that calls the second part of the logic
>>>
>>> when a leader comes into zkstatereader and go away. We can't eat up threads like this in all these cases.
>>>
>>> Now you can also easily shutdown and reload cores and do various things that are currently harassed by various waits like this slacking off in these wait loops.
>>>
>>>
>>>
>>> The rest is just continuation of this game when it comes to leader selection and finalization and collection creation and replica spin up. You make zkstatereader actually efficient. You make multiple and lazy collections work appropriately,
>>>
>>> and not super inefficient.
>>>
>>> You make leader election a sensible bit of code. As part of zkstatereader sensibility you remove the need for a billion client based watches in zk and in many cases the need for a thousand watcher implementations and instances.
>>>
>>> You let the components dictate how often requests go to services and coalesce dependent code requests instead of letting the dependents dictate service request cadence and size, and you do a lot less sillines like serialize large json
>>>
>>> structures for bit size data updates, and scaling to 10's of k and even 100's of k replicas and collections is doable even
>>>
>>> on single machines and a handful of Solr instances, say nothing about pulling in more hardware. Everything required is cheap cheap cheap. It's the mountain of unrequired that is expensive expensive expensive.
>>>
>>>
>>> On Fri, Oct 1, 2021 at 12:47 PM Mark Miller <ma...@gmail.com>
>>> wrote:
>>>
>>>> Ignoring lots of polling, inefficiencies, early defensive raw sleeps,
>>>> various races and bugs and a laundry list of items involved in making
>>>> leader processes good enough to enter a collection creation contest, here
>>>> is a more practical small set of notes off the top of my head on a quick
>>>> inspection around what is currently just in your face non sensible.
>>>>
>>>> https://gist.github.com/markrmiller/233119ba84ce39d39960de0f35e79fc9
>>>>
>>>
>>>
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
> --
- Mark

http://about.me/markrmiller

Re: ZkCmdExecutor

Posted by David Smiley <ds...@apache.org>.

I just want to say that I appreciate the insights you shared over the last
couple days.  By "copy paste", was Tim referring to copying your insights
and pasting them into the code?  This is what I was thinking.  Or at least
some way to make these insights more durable / findable.  Could be a link
from the code into maybe a wiki page or something.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Fri, Oct 1, 2021 at 11:02 PM Mark Miller <ma...@gmail.com> wrote:

> Tim hit me with the obvious question here.
>
> “I’m assuming there are reasons, but what about a little copy past on some
> of these issues that you mentioned.”
>
> I say the obvious question because I kind of flippantly jump through some
> lines of code and then say and then you just do a, b and c and that’s the
> ballgame.
>
> There are a lot of reasons I can’t cut and paste though. And I can open
> almost any class and annotate a similar set of issues. So without diving
> into all the reasons, I would have already if it was so simple. I can
> certainly help address some things, lean on existing code and efforts, but
> at the moment I’m in a position where the best I have is to work on things
> as needed by outside pressures,  items or demands.
>
> If I see others improving or redoing any of this core cloud code though,
> I’d certainly lend a hand on those efforts. Outside of making changes based
> on external needs, I just got out from under the solo kamakize, and i cant
> dive back in without it being on contained items and goals that satisfies
> someone’s needs or joining an existing multi crew effort or goal.
>
> If I had to randomly pull threads, repeat efforts yet one more time, and
> funnel that work through a gauntlet of uninvolved, good intentioned
> developers, neither me nor anyone else would be pleased.
>
> Mark
>
> On Fri, Oct 1, 2021 at 2:17 PM Mark Miller <ma...@gmail.com> wrote:
>
>> That covers a lot of current silliness you will see, pretty simply as
>> most of it comes down remove silly stuff, but you can find some related
>> wildness in ZkController#register.
>>
>> // check replica's existence in clusterstate first
>>
>> zkStateReader.waitForState(collection, 100, TimeUnit.MILLISECONDS,
>>     (collectionState) -> getReplicaOrNull(collectionState, shardId, coreZkNodeName) != null);
>>
>> 100ms wait, no biggie, and at least it uses waitForState, but we should not need to get our own clusterstate from zk so here care about waiting for this here - if there is an item of data we need, it should have been passed into the core create call.
>>
>> Next we get the shard terms object so we can later create our shard terms entry (LIR).
>>
>> Slow and bug inducing complicated to have each replica do this here, fighting each other to add an initial entry. You can create the initial shard terms for a replica when you create or update the clusterstate (term {replicaname=0}), and you can do it in
>>
>> a single zk call.
>>
>>
>> // in this case, we want to wait for the leader as long as the leader might
>> // wait for a vote, at least - but also long enough that a large cluster has
>> // time to get its act together
>> String leaderUrl = getLeader(cloudDesc, leaderVoteWait + 600000);
>>
>> Now we do getLeader, a polling operation that should not be, and wait possibly forever for it. As I mention there should be little wait at most in the notes on leader sync, there should be little wait here. It's also
>>
>> one of a variety of places that even if you remove the polling, sucks to wait on. I'm a fan of thousands of cores per machine not being an issue. In many of these cases, you can't achieve that and have 1000 threads hanging out
>>
>> all over even if they are not blind polling. This is one of the simpler cases where that can be addressed. I break this method into two and I enhance zkstatereader waitforstate functionality. I allow you to pass a runnable to execute
>>
>> when zkstatereader is notified and the given predicate matches. So no need for 1000's or hundreds or dozens of slackers here. Do a couple base register items, call wait for state with a runnable that calls the second part of the logic
>>
>> when a leader comes into zkstatereader and go away. We can't eat up threads like this in all these cases.
>>
>> Now you can also easily shutdown and reload cores and do various things that are currently harassed by various waits like this slacking off in these wait loops.
>>
>>
>>
>> The rest is just continuation of this game when it comes to leader selection and finalization and collection creation and replica spin up. You make zkstatereader actually efficient. You make multiple and lazy collections work appropriately,
>>
>> and not super inefficient.
>>
>> You make leader election a sensible bit of code. As part of zkstatereader sensibility you remove the need for a billion client based watches in zk and in many cases the need for a thousand watcher implementations and instances.
>>
>> You let the components dictate how often requests go to services and coalesce dependent code requests instead of letting the dependents dictate service request cadence and size, and you do a lot less sillines like serialize large json
>>
>> structures for bit size data updates, and scaling to 10's of k and even 100's of k replicas and collections is doable even
>>
>> on single machines and a handful of Solr instances, say nothing about pulling in more hardware. Everything required is cheap cheap cheap. It's the mountain of unrequired that is expensive expensive expensive.
>>
>>
>> On Fri, Oct 1, 2021 at 12:47 PM Mark Miller <ma...@gmail.com>
>> wrote:
>>
>>> Ignoring lots of polling, inefficiencies, early defensive raw sleeps,
>>> various races and bugs and a laundry list of items involved in making
>>> leader processes good enough to enter a collection creation contest, here
>>> is a more practical small set of notes off the top of my head on a quick
>>> inspection around what is currently just in your face non sensible.
>>>
>>> https://gist.github.com/markrmiller/233119ba84ce39d39960de0f35e79fc9
>>>
>>
>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
> --
> - Mark
>
> http://about.me/markrmiller
>

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

Tim hit me with the obvious question here.

“I’m assuming there are reasons, but what about a little copy past on some
of these issues that you mentioned.”

I say the obvious question because I kind of flippantly jump through some
lines of code and then say and then you just do a, b and c and that’s the
ballgame.

There are a lot of reasons I can’t cut and paste though. And I can open
almost any class and annotate a similar set of issues. So without diving
into all the reasons, I would have already if it was so simple. I can
certainly help address some things, lean on existing code and efforts, but
at the moment I’m in a position where the best I have is to work on things
as needed by outside pressures,  items or demands.

If I see others improving or redoing any of this core cloud code though,
I’d certainly lend a hand on those efforts. Outside of making changes based
on external needs, I just got out from under the solo kamakize, and i cant
dive back in without it being on contained items and goals that satisfies
someone’s needs or joining an existing multi crew effort or goal.

If I had to randomly pull threads, repeat efforts yet one more time, and
funnel that work through a gauntlet of uninvolved, good intentioned
developers, neither me nor anyone else would be pleased.

Mark

On Fri, Oct 1, 2021 at 2:17 PM Mark Miller <ma...@gmail.com> wrote:

> That covers a lot of current silliness you will see, pretty simply as most
> of it comes down remove silly stuff, but you can find some related wildness
> in ZkController#register.
>
> // check replica's existence in clusterstate first
>
> zkStateReader.waitForState(collection, 100, TimeUnit.MILLISECONDS,
>     (collectionState) -> getReplicaOrNull(collectionState, shardId, coreZkNodeName) != null);
>
> 100ms wait, no biggie, and at least it uses waitForState, but we should not need to get our own clusterstate from zk so here care about waiting for this here - if there is an item of data we need, it should have been passed into the core create call.
>
> Next we get the shard terms object so we can later create our shard terms entry (LIR).
>
> Slow and bug inducing complicated to have each replica do this here, fighting each other to add an initial entry. You can create the initial shard terms for a replica when you create or update the clusterstate (term {replicaname=0}), and you can do it in
>
> a single zk call.
>
>
> // in this case, we want to wait for the leader as long as the leader might
> // wait for a vote, at least - but also long enough that a large cluster has
> // time to get its act together
> String leaderUrl = getLeader(cloudDesc, leaderVoteWait + 600000);
>
> Now we do getLeader, a polling operation that should not be, and wait possibly forever for it. As I mention there should be little wait at most in the notes on leader sync, there should be little wait here. It's also
>
> one of a variety of places that even if you remove the polling, sucks to wait on. I'm a fan of thousands of cores per machine not being an issue. In many of these cases, you can't achieve that and have 1000 threads hanging out
>
> all over even if they are not blind polling. This is one of the simpler cases where that can be addressed. I break this method into two and I enhance zkstatereader waitforstate functionality. I allow you to pass a runnable to execute
>
> when zkstatereader is notified and the given predicate matches. So no need for 1000's or hundreds or dozens of slackers here. Do a couple base register items, call wait for state with a runnable that calls the second part of the logic
>
> when a leader comes into zkstatereader and go away. We can't eat up threads like this in all these cases.
>
> Now you can also easily shutdown and reload cores and do various things that are currently harassed by various waits like this slacking off in these wait loops.
>
>
>
> The rest is just continuation of this game when it comes to leader selection and finalization and collection creation and replica spin up. You make zkstatereader actually efficient. You make multiple and lazy collections work appropriately,
>
> and not super inefficient.
>
> You make leader election a sensible bit of code. As part of zkstatereader sensibility you remove the need for a billion client based watches in zk and in many cases the need for a thousand watcher implementations and instances.
>
> You let the components dictate how often requests go to services and coalesce dependent code requests instead of letting the dependents dictate service request cadence and size, and you do a lot less sillines like serialize large json
>
> structures for bit size data updates, and scaling to 10's of k and even 100's of k replicas and collections is doable even
>
> on single machines and a handful of Solr instances, say nothing about pulling in more hardware. Everything required is cheap cheap cheap. It's the mountain of unrequired that is expensive expensive expensive.
>
>
> On Fri, Oct 1, 2021 at 12:47 PM Mark Miller <ma...@gmail.com> wrote:
>
>> Ignoring lots of polling, inefficiencies, early defensive raw sleeps,
>> various races and bugs and a laundry list of items involved in making
>> leader processes good enough to enter a collection creation contest, here
>> is a more practical small set of notes off the top of my head on a quick
>> inspection around what is currently just in your face non sensible.
>>
>> https://gist.github.com/markrmiller/233119ba84ce39d39960de0f35e79fc9
>>
>
>
> --
> - Mark
>
> http://about.me/markrmiller
>
-- 
- Mark

http://about.me/markrmiller

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

That covers a lot of current silliness you will see, pretty simply as most
of it comes down remove silly stuff, but you can find some related wildness
in ZkController#register.

// check replica's existence in clusterstate first

zkStateReader.waitForState(collection, 100, TimeUnit.MILLISECONDS,
    (collectionState) -> getReplicaOrNull(collectionState, shardId,
coreZkNodeName) != null);

100ms wait, no biggie, and at least it uses waitForState, but we
should not need to get our own clusterstate from zk so here care about
waiting for this here - if there is an item of data we need, it should
have been passed into the core create call.

Next we get the shard terms object so we can later create our shard
terms entry (LIR).

Slow and bug inducing complicated to have each replica do this here,
fighting each other to add an initial entry. You can create the
initial shard terms for a replica when you create or update the
clusterstate (term {replicaname=0}), and you can do it in

a single zk call.

// in this case, we want to wait for the leader as long as the leader might
// wait for a vote, at least - but also long enough that a large cluster has
// time to get its act together
String leaderUrl = getLeader(cloudDesc, leaderVoteWait + 600000);

Now we do getLeader, a polling operation that should not be, and wait
possibly forever for it. As I mention there should be little wait at
most in the notes on leader sync, there should be little wait here.
It's also

one of a variety of places that even if you remove the polling, sucks
to wait on. I'm a fan of thousands of cores per machine not being an
issue. In many of these cases, you can't achieve that and have 1000
threads hanging out

all over even if they are not blind polling. This is one of the
simpler cases where that can be addressed. I break this method into
two and I enhance zkstatereader waitforstate functionality. I allow
you to pass a runnable to execute

when zkstatereader is notified and the given predicate matches. So no
need for 1000's or hundreds or dozens of slackers here. Do a couple
base register items, call wait for state with a runnable that calls
the second part of the logic

when a leader comes into zkstatereader and go away. We can't eat up
threads like this in all these cases.

Now you can also easily shutdown and reload cores and do various
things that are currently harassed by various waits like this slacking
off in these wait loops.

The rest is just continuation of this game when it comes to leader
selection and finalization and collection creation and replica spin
up. You make zkstatereader actually efficient. You make multiple and
lazy collections work appropriately,

and not super inefficient.

You make leader election a sensible bit of code. As part of
zkstatereader sensibility you remove the need for a billion client
based watches in zk and in many cases the need for a thousand watcher
implementations and instances.

You let the components dictate how often requests go to services and
coalesce dependent code requests instead of letting the dependents
dictate service request cadence and size, and you do a lot less
sillines like serialize large json

structures for bit size data updates, and scaling to 10's of k and
even 100's of k replicas and collections is doable even

on single machines and a handful of Solr instances, say nothing about
pulling in more hardware. Everything required is cheap cheap cheap.
It's the mountain of unrequired that is expensive expensive expensive.

On Fri, Oct 1, 2021 at 12:47 PM Mark Miller <ma...@gmail.com> wrote:

> Ignoring lots of polling, inefficiencies, early defensive raw sleeps,
> various races and bugs and a laundry list of items involved in making
> leader processes good enough to enter a collection creation contest, here
> is a more practical small set of notes off the top of my head on a quick
> inspection around what is currently just in your face non sensible.
>
> https://gist.github.com/markrmiller/233119ba84ce39d39960de0f35e79fc9
>

-- 
- Mark

http://about.me/markrmiller

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

Ignoring lots of polling, inefficiencies, early defensive raw sleeps,
various races and bugs and a laundry list of items involved in making
leader processes good enough to enter a collection creation contest, here
is a more practical small set of notes off the top of my head on a quick
inspection around what is currently just in your face non sensible.

https://gist.github.com/markrmiller/233119ba84ce39d39960de0f35e79fc9

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

On Thu, Sep 30, 2021 at 3:31 AM Ilan Ginzburg <il...@gmail.com> wrote:

> Independent of how interactions with ZK are implemented (direct or via
> Curator), we should first clean up what these interactions do or expect.
>
> Take shard leader elector. First a replica is elected, then we check if it
> is fit for the job, run another election if not, look at other replicas
> (hopefully) participating in the election, wait a bit more (total wait can
> be 6 minutes), then might decide that an unfit leader is still fit…
>

Personally, I looked at that and saw a whole different set of problems that
had to be solved. No one around on the same page as me there though, and
with everyone else interested in sitting down and coming up with new
designs, I tend to cut out fast when I don’t feel something is going
somewhere. That was a different day though, different people different
agenda.

I will say, I have code (nothing even close to a patch or relevant
practically here, but I have code that follows the white board of that
design that is sub second if there is no data to be synced, and very damn
fast even if there is.

The kind of talk that tends to be taken as I’m promoting or defending some
design, but I’m pretty design agnostic, unless it somehow makes things
impossible.

When I say the same design, that doesn’t mean it does all the same steps.
Just that it follows the same design that Yonik drove the white board of, I
drove the broad impl while Yonik hit critical blocks, and then as my rubber
hit the road, I’d hammer him as needed and lots of back and forth would
fill in the details.  For a host of reasons, the impl would be a very rough
and broad sketch of the actual whiteboard design.

Some of these least dev time was spent on that leader sync process. Just as
one example, the leader syncs to replicas and then asks replicas to sync to
the leader. That second phase is, I believe, kind of silly messed up, and
also unnecessary. Which is a common theme.

I’m surprised to hear it can tale 6 minutes. Hard to remember where every
random thing is in main. At the start, as kind of a prop, we would do some
ridiculous waiting, being very conservative about preventing super easy
large data loss with no code implemented to do anything sensible.

These days, leader initiated recover is there to fill that gap. A you can
say about everything, it has some issues, but fundamentally filling that
gap is not one of them.

Then peersync can be much faster, some details tweaked - Yonik code, so
always ends up more adjusting the block and positioning it than it’s
fundamental structure. Replication, plenty of ugly, slow, inefficient.
RecoveryStrategy, a mess, mess of class, but you’d still recognize it in my
code. Leader election, again, same fundamental design, recognizable, but
fast, stable efficient.  Plenty of that kind of silly messed up and
unnecessary and you name it.

So same intended design, separated by a whole hell of a lot of changes. If
there was a yearly search engine derby that pitted such processes against
each other, I’d March over with glee. Would probably be riddled with
excitement at the prospect.

So my feeling, pick which whatever design fixes or changes you think will
produce a working system. Unless it’s unworkable craziness, the impl of
everything will matter 50x and so just nail that and the rest will be fine.

>
> Before moving this to curator, we should likely simplify the approach or
> it might not look good on curator.
>

When I did curator, I changed plenty. Still same fundamental design, but it
was impossible not to look at the possibilities and it’s algorithms and
kind of go to town.

That was a bit of a luxury though. The mechanics of community, resources,
collaboration, bike shed painting, existing framework forward momentum …
anyone that navigates through with such ambitious plans at this point in
time will have huge pile of my admiration.

> I’m not that worried about Autoscaling (removed in main) or Overseer
> (removed in main if you set the right config).
>

Oh I had no interest in autoscaling relative to many many many things.
That’s really just a stand in for a variety of ambitious higher layers that
AB has a talent for, and the system had a distaste for. It just pains my
sensibilities. A business will have needs and customers and the things a
business will have. And a developer will be assigned to go turn those needs
into code - and it’s quite frustrating when those forces create situations
where a good design, a solid honest effort, someone with a knack for such
implementations - is not going to put out very good utility efficiency into
the world given those systems often really, really need a solid
foundation.  Not that it’s some huge injustice, but I am very prejudiced
against such waste. I like to see good work by good people harnessed into
good things. This is why I ended up running from private development and
into Lucene.

>
> Many other things to worry about though (for example cluster state cache
> maintained async on all nodes at cost of heavy ZK usage on every change).
>

That was honestly one of the easier items, not that it took 5 minutes. I
keep trying to get people to sit down with a pen and paper and sketch out
what actually has to be communicated. How often. What data structures
actually have to move. It’s about 100x less than what goes on in almost
every dimension. Zk and that design are so damn fast scalable, oh man. Yo
me it’s the same as the other stuff. Pick a design, they are all the same
to me unless something is fundamentally ridiculous. As long as that design
does no do 100x more than makes sense, and inefficiently even at that bar,
it will be fine.

IMO, the problem is, trying to come up with a design that fits the rest of
the system and their expectations and connections and often o problems and
or inefficient. I feel like, as often seems to be the case, designs are
likely going to be guided by trying to come up with something that kind of
attempts to mitigate, perhaps at grander and grander scales. But always
with such potential to be compromised by the structure it wants to join and
strengthen.

There are a surprising number of behaviors and features and sql engines and
… well, let’s just say, I think the best hope on such an endeavor would be
to get wide permission for a axe and a lot of sad people with various
attachments and dependencies on all the things that are disregarded.

That’s why I just went through everything. Fix it all. Make it all work.
Make it all efficient and fast. Leave no man but the ridiculous behind. Now
that process is not easy. But it puts in a situation to really do some
interesting things that are not compromised or heavily reduced and scaled
down, or … anyway, it’s not practical information that if you make it all
good you are in a position to do some great. I never saw any other path
that wasn’t likely to be heavily compromised and unsatisfying or
essentially a no holds barred reboot. I was never into a reboot without
first getting to the bottom of the boot. I’ve seen that let’s just do
version 2 game played before.

Unfortunately, the world is setup where I can’t reasonably make the trade
offs to even really do anything with the work I’ve done at a scale that
would make sense. I think for similar reasons that large scale work on Solr
proper have probably seen their most active days.

So yeah, everyone has always brought up, we need some designs, we need to
get everyone together and start planning it out. I say go to it. That type
of collaboration has not gone on for a while, but I don’t think you will
find anyone would object to it.

Personally, I’d let others lead hashing out any designs. It’s easier to get
more people, of all kinds,  in on that.

I think the implementation ends up being way more important and ends up
with far fewer resources, I’d sign up for some contribution there. Impl
while float any design but the silly or unworkable very nicely if given the
fuel.

Mark

> Ilan
>
> On Thu 30 Sep 2021 at 01:02, Mark Miller <ma...@gmail.com> wrote:
>
>> You actually capture most of the history of cloud there AB.
>>
>> ZK is the heart of the system. It’s a rare chance you get the time or
>> financing to lay that out on something that will be used.
>>
>> I didn’t get it done, changed jobs, and that mostly closed the window on
>> that.
>>
>> Then you have a poor heart that would take a god amount of time and
>> experience for anyone to really fully understand all the nuts and bolts of,
>> even if you stood it up.  And it’s about the equivalent of a poorly written
>> concurrent program.
>>
>> So when you come along and try to put something like autoscaling on it,
>> it’s going to subvert you the whole way. And unless you are going to change
>> auto scaling to discover and rework all the problems in the heart of the
>> system, not a lot you can do about it. And that completely ignores the
>> overseer end of it.
>>
>> It’s a shame, I could setup a great heart to put something like auto
>> scaling on for you now. But the ship has sailed. Very hard to claw that
>> back and the world has adjusted to to getting what they can from what is.
>>
>> But yeah, curator is a huge improvement on a variety of those issues. And
>> I invested enough into to know it’s good. It’s fast. It’s better and more
>> apis and algorithms - documented. Maintained and pushed forward by a
>> separate group dedicated to the task.
>>
>> But I can tell you, it’s by no means some kind of Rubik’s cube, but it is
>> no small lift.
>>
>> Mark
>>
>> On Wed, Sep 29, 2021 at 9:13 AM Mark Miller <ma...@gmail.com>
>> wrote:
>>
>>> I very much agree. That code is the root of a very surprising amount of
>>> evil and has been for a surprisingly long time.
>>>
>>> There is a long list of reasons that I won’t iterate of why I don’t see
>>> that as likely happening though - just starting with Ive brought it up to
>>> various people over a couple years and gotten pushback just at the top.
>>> Roughly, it’s on the scale of work and invasiveness, even with some
>>> incremental paths, that I don’t see the path or resources to seriously
>>> consider it myself. You can go back through jira history for quite a while
>>> before you find that kind of item not looking out of place.
>>>
>>> Mark
>>>
>>> On Wed, Sep 29, 2021 at 2:05 AM Andrzej Białecki <ab...@getopt.org> wrote:
>>>
>>>> +1 to start working towards using Curator, this is long overdue and
>>>> sooner or later we need to eat this frog - as you dig deeper and deeper it
>>>> turns out that many issues in Solr can be attributed to our home-grown ZK
>>>> code, there are maybe 2 people on the Solr team who understand what’s going
>>>> on there (and I’m certainly not one of them!). And the maintenance cost is
>>>> just too high over time.
>>>>
>>>> —
>>>>
>>>> Andrzej Białecki
>>>>
>>>> On 28 Sep 2021, at 21:31, Mark Miller <ma...@gmail.com> wrote:
>>>>
>>>> P.S. this is not actually the zookeeper design I would submit to any
>>>> competition :)
>>>>
>>>> I’ve gone different routes in addressing the zookeeper short fall. This
>>>> one is relatively easy, impactful and isolated for the right developer.
>>>>
>>>> Personally, with fewer scale and isolation limits, by the far the best
>>>> thing I’ve done is remove almost all of our zk recipes and custom stuff and
>>>> use Apache curator and replace our stuff as well as improve and expand on
>>>> things using their large stable of well behaving recipes. I don’t think raw
>>>> zookeeper is good for a project of more than a few people at most. But I
>>>> wouldn’t toss that out there, it’s a much larger undertaking, no one is
>>>> going to bite on that in passing.
>>>>
>>>> Mark
>>>> --
>>>> - Mark
>>>>
>>>> http://about.me/markrmiller
>>>>
>>>>
>>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
> --
- Mark

http://about.me/markrmiller

Re: ZkCmdExecutor

Posted by Ilan Ginzburg <il...@gmail.com>.

Independent of how interactions with ZK are implemented (direct or via
Curator), we should first clean up what these interactions do or expect.

Take shard leader elector. First a replica is elected, then we check if it
is fit for the job, run another election if not, look at other replicas
(hopefully) participating in the election, wait a bit more (total wait can
be 6 minutes), then might decide that an unfit leader is still fit…

Before moving this to curator, we should likely simplify the approach or it
might not look good on curator.

I’m not that worried about Autoscaling (removed in main) or Overseer
(removed in main if you set the right config).

Many other things to worry about though (for example cluster state cache
maintained async on all nodes at cost of heavy ZK usage on every change).

Ilan

On Thu 30 Sep 2021 at 01:02, Mark Miller <ma...@gmail.com> wrote:

> You actually capture most of the history of cloud there AB.
>
> ZK is the heart of the system. It’s a rare chance you get the time or
> financing to lay that out on something that will be used.
>
> I didn’t get it done, changed jobs, and that mostly closed the window on
> that.
>
> Then you have a poor heart that would take a god amount of time and
> experience for anyone to really fully understand all the nuts and bolts of,
> even if you stood it up.  And it’s about the equivalent of a poorly written
> concurrent program.
>
> So when you come along and try to put something like autoscaling on it,
> it’s going to subvert you the whole way. And unless you are going to change
> auto scaling to discover and rework all the problems in the heart of the
> system, not a lot you can do about it. And that completely ignores the
> overseer end of it.
>
> It’s a shame, I could setup a great heart to put something like auto
> scaling on for you now. But the ship has sailed. Very hard to claw that
> back and the world has adjusted to to getting what they can from what is.
>
> But yeah, curator is a huge improvement on a variety of those issues. And
> I invested enough into to know it’s good. It’s fast. It’s better and more
> apis and algorithms - documented. Maintained and pushed forward by a
> separate group dedicated to the task.
>
> But I can tell you, it’s by no means some kind of Rubik’s cube, but it is
> no small lift.
>
> Mark
>
> On Wed, Sep 29, 2021 at 9:13 AM Mark Miller <ma...@gmail.com> wrote:
>
>> I very much agree. That code is the root of a very surprising amount of
>> evil and has been for a surprisingly long time.
>>
>> There is a long list of reasons that I won’t iterate of why I don’t see
>> that as likely happening though - just starting with Ive brought it up to
>> various people over a couple years and gotten pushback just at the top.
>> Roughly, it’s on the scale of work and invasiveness, even with some
>> incremental paths, that I don’t see the path or resources to seriously
>> consider it myself. You can go back through jira history for quite a while
>> before you find that kind of item not looking out of place.
>>
>> Mark
>>
>> On Wed, Sep 29, 2021 at 2:05 AM Andrzej Białecki <ab...@getopt.org> wrote:
>>
>>> +1 to start working towards using Curator, this is long overdue and
>>> sooner or later we need to eat this frog - as you dig deeper and deeper it
>>> turns out that many issues in Solr can be attributed to our home-grown ZK
>>> code, there are maybe 2 people on the Solr team who understand what’s going
>>> on there (and I’m certainly not one of them!). And the maintenance cost is
>>> just too high over time.
>>>
>>> —
>>>
>>> Andrzej Białecki
>>>
>>> On 28 Sep 2021, at 21:31, Mark Miller <ma...@gmail.com> wrote:
>>>
>>> P.S. this is not actually the zookeeper design I would submit to any
>>> competition :)
>>>
>>> I’ve gone different routes in addressing the zookeeper short fall. This
>>> one is relatively easy, impactful and isolated for the right developer.
>>>
>>> Personally, with fewer scale and isolation limits, by the far the best
>>> thing I’ve done is remove almost all of our zk recipes and custom stuff and
>>> use Apache curator and replace our stuff as well as improve and expand on
>>> things using their large stable of well behaving recipes. I don’t think raw
>>> zookeeper is good for a project of more than a few people at most. But I
>>> wouldn’t toss that out there, it’s a much larger undertaking, no one is
>>> going to bite on that in passing.
>>>
>>> Mark
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>>>
>>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
> --
> - Mark
>
> http://about.me/markrmiller
>

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

You actually capture most of the history of cloud there AB.

ZK is the heart of the system. It’s a rare chance you get the time or
financing to lay that out on something that will be used.

I didn’t get it done, changed jobs, and that mostly closed the window on
that.

Then you have a poor heart that would take a god amount of time and
experience for anyone to really fully understand all the nuts and bolts of,
even if you stood it up.  And it’s about the equivalent of a poorly written
concurrent program.

So when you come along and try to put something like autoscaling on it,
it’s going to subvert you the whole way. And unless you are going to change
auto scaling to discover and rework all the problems in the heart of the
system, not a lot you can do about it. And that completely ignores the
overseer end of it.

It’s a shame, I could setup a great heart to put something like auto
scaling on for you now. But the ship has sailed. Very hard to claw that
back and the world has adjusted to to getting what they can from what is.

But yeah, curator is a huge improvement on a variety of those issues. And I
invested enough into to know it’s good. It’s fast. It’s better and more
apis and algorithms - documented. Maintained and pushed forward by a
separate group dedicated to the task.

But I can tell you, it’s by no means some kind of Rubik’s cube, but it is
no small lift.

Mark

On Wed, Sep 29, 2021 at 9:13 AM Mark Miller <ma...@gmail.com> wrote:

> I very much agree. That code is the root of a very surprising amount of
> evil and has been for a surprisingly long time.
>
> There is a long list of reasons that I won’t iterate of why I don’t see
> that as likely happening though - just starting with Ive brought it up to
> various people over a couple years and gotten pushback just at the top.
> Roughly, it’s on the scale of work and invasiveness, even with some
> incremental paths, that I don’t see the path or resources to seriously
> consider it myself. You can go back through jira history for quite a while
> before you find that kind of item not looking out of place.
>
> Mark
>
> On Wed, Sep 29, 2021 at 2:05 AM Andrzej Białecki <ab...@getopt.org> wrote:
>
>> +1 to start working towards using Curator, this is long overdue and
>> sooner or later we need to eat this frog - as you dig deeper and deeper it
>> turns out that many issues in Solr can be attributed to our home-grown ZK
>> code, there are maybe 2 people on the Solr team who understand what’s going
>> on there (and I’m certainly not one of them!). And the maintenance cost is
>> just too high over time.
>>
>> —
>>
>> Andrzej Białecki
>>
>> On 28 Sep 2021, at 21:31, Mark Miller <ma...@gmail.com> wrote:
>>
>> P.S. this is not actually the zookeeper design I would submit to any
>> competition :)
>>
>> I’ve gone different routes in addressing the zookeeper short fall. This
>> one is relatively easy, impactful and isolated for the right developer.
>>
>> Personally, with fewer scale and isolation limits, by the far the best
>> thing I’ve done is remove almost all of our zk recipes and custom stuff and
>> use Apache curator and replace our stuff as well as improve and expand on
>> things using their large stable of well behaving recipes. I don’t think raw
>> zookeeper is good for a project of more than a few people at most. But I
>> wouldn’t toss that out there, it’s a much larger undertaking, no one is
>> going to bite on that in passing.
>>
>> Mark
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
>>
>> --
> - Mark
>
> http://about.me/markrmiller
>
-- 
- Mark

http://about.me/markrmiller

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

I very much agree. That code is the root of a very surprising amount of
evil and has been for a surprisingly long time.

There is a long list of reasons that I won’t iterate of why I don’t see
that as likely happening though - just starting with Ive brought it up to
various people over a couple years and gotten pushback just at the top.
Roughly, it’s on the scale of work and invasiveness, even with some
incremental paths, that I don’t see the path or resources to seriously
consider it myself. You can go back through jira history for quite a while
before you find that kind of item not looking out of place.

Mark

On Wed, Sep 29, 2021 at 2:05 AM Andrzej Białecki <ab...@getopt.org> wrote:

> +1 to start working towards using Curator, this is long overdue and sooner
> or later we need to eat this frog - as you dig deeper and deeper it turns
> out that many issues in Solr can be attributed to our home-grown ZK code,
> there are maybe 2 people on the Solr team who understand what’s going on
> there (and I’m certainly not one of them!). And the maintenance cost is
> just too high over time.
>
> —
>
> Andrzej Białecki
>
> On 28 Sep 2021, at 21:31, Mark Miller <ma...@gmail.com> wrote:
>
> P.S. this is not actually the zookeeper design I would submit to any
> competition :)
>
> I’ve gone different routes in addressing the zookeeper short fall. This
> one is relatively easy, impactful and isolated for the right developer.
>
> Personally, with fewer scale and isolation limits, by the far the best
> thing I’ve done is remove almost all of our zk recipes and custom stuff and
> use Apache curator and replace our stuff as well as improve and expand on
> things using their large stable of well behaving recipes. I don’t think raw
> zookeeper is good for a project of more than a few people at most. But I
> wouldn’t toss that out there, it’s a much larger undertaking, no one is
> going to bite on that in passing.
>
> Mark
> --
> - Mark
>
> http://about.me/markrmiller
>
>
> --
- Mark

http://about.me/markrmiller

Re: ZkCmdExecutor

Posted by Andrzej Białecki <ab...@getopt.org>.

+1 to start working towards using Curator, this is long overdue and sooner or later we need to eat this frog - as you dig deeper and deeper it turns out that many issues in Solr can be attributed to our home-grown ZK code, there are maybe 2 people on the Solr team who understand what’s going on there (and I’m certainly not one of them!). And the maintenance cost is just too high over time.

—

Andrzej Białecki

> On 28 Sep 2021, at 21:31, Mark Miller <ma...@gmail.com> wrote:
> 
> P.S. this is not actually the zookeeper design I would submit to any competition :)
> 
> I’ve gone different routes in addressing the zookeeper short fall. This one is relatively easy, impactful and isolated for the right developer. 
> 
> Personally, with fewer scale and isolation limits, by the far the best thing I’ve done is remove almost all of our zk recipes and custom stuff and use Apache curator and replace our stuff as well as improve and expand on things using their large stable of well behaving recipes. I don’t think raw zookeeper is good for a project of more than a few people at most. But I wouldn’t toss that out there, it’s a much larger undertaking, no one is going to bite on that in passing. 
> 
> Mark
> -- 
> - Mark
> 
> http://about.me/markrmiller <http://about.me/markrmiller>

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

P.S. this is not actually the zookeeper design I would submit to any
competition :)

I’ve gone different routes in addressing the zookeeper short fall. This one
is relatively easy, impactful and isolated for the right developer.

Personally, with fewer scale and isolation limits, by the far the best
thing I’ve done is remove almost all of our zk recipes and custom stuff and
use Apache curator and replace our stuff as well as improve and expand on
things using their large stable of well behaving recipes. I don’t think raw
zookeeper is good for a project of more than a few people at most. But I
wouldn’t toss that out there, it’s a much larger undertaking, no one is
going to bite on that in passing.

Mark
-- 
- Mark

http://about.me/markrmiller

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

I’m not taking it as a challenge, I’m just throwing a relatively tractable
change out there if someone is interested in better zk behavior. I wouldn’t
likely tackle it in any near term unless it was put into my queue for some
reason - I’ve got mostly unrelated stuff lined up.

Yeah, connection loss is almost always dealt with via retries other than
some rarer cases such as around leader election. We don’t retry across
session expiration, we throw an exception. The overseer and what is does
with zookeeper is a major exception(s).

The issue is just how those things are done. Eg, all the calls to zk will
all retry with no regard for zk connection state updates. So they are both
not timely, and usually over aggressive in a way that causes problems with
the zk client. We also make guesses about whether we have been expired or
not - because you don’t hit session expiration when it happens - you hit it
once you have reconnected again - and we don’t want to keep retrying for an
hour while disconnected. It all adds up to, if you index at a high rate or
a variety of complicated actions are going on, you are likely to get
exceptions around not being connected to zk or other various bugs when the
system should actually just be trucking through that no problem. It can
also mess with various close and shutdown behaviors as they outstanding
calls hang out retrying waiting until some sort of expiration is
determined.

There are a bunch of bugs in all of this. I would have trouble even trying
to isolate this change given aggressive testing would force me into dealing
with a lot more. But improving the behavior makes other issues more obvious
and behavior improvements in general start making further work easier.

Mark

On Tue, Sep 28, 2021 at 7:12 AM Ilan Ginzburg <il...@gmail.com> wrote:

> I did not mean to challenge the way you plan to refactor existing code to
> make it equivalent but simpler.
>
> My point was different, but I mixed up disconnections i.e. ZK connection
> loss (no big deal) and session expiration (big deal). So please ignore my
> previous message.
>
> The correct way to look at it as I understand it:
> Connection loss is a non event (sorry for the pun) and indeed can/should
> be retried. Doing this efficiently as you suggest is better.
> Session expiration is the real signal from ZK that the client
> "transaction" (sequence of actions that achieve some goal such as electing
> a leader among participants or grabbing a lock) got interrupted and must be
> completely restarted using a new session.
>
> On Tue, Sep 28, 2021 at 1:03 PM Mark Miller <ma...@gmail.com> wrote:
>
>> That’s why I say that ideally you should actually enter a quiet mode on
>> zk disconnect and come out of it on reconnect, this isn’t necessarily the
>> ideal.  I don’t think it assumes zk continuity - the retries are because
>> when you dc from a zk server, as long as all the servers are not dead, it
>> will automatically start trying to reconnect to another zk server - the
>> retries are just waiting for that to happen. They are just dumb retries,
>> which was silly, because zk tells you when it dcs and when it reconnects. I
>> don’t really see how it relates to a transaction - you just don’t want
>> things to fail when the zk client is failing over to a another server, in a
>> good impl, that should be mostly transparent to any user of the system. ZK
>> is designed to feed your app this info so that you can implement something
>> that makes these fail overs smooth. Sometimes it’s not even a failover it’s
>> just that overload or gc pauses have stalled zks heartbeat for too long.
>>
>> I did all that wrong though for sure - I did it before I really even
>> understood zookeeper that well. So the current stuff is just pretty poor.
>> What I describe it’s a relatively quick and simple way to make it much much
>> better. It would be much more invasive and a lot more work and effort to do
>> something too radically different. The end result would be similar, except
>> there really is no reason to sit on all the outstanding calls. It’s just
>> difficult to do anything that’s transparent to the user when zk is failing
>> over and that works with all the existing code.
>>
>> The end goal is not a transaction or anything though - it just to have
>> your app able to smoothly handle the zk client transitioning from one zk
>> server to another - or missing it’s heart beat due to load and then
>> connecting again. It’s certainly not currently that often a smooth event -
>> this just describes a way I have been able to make it smooth without having
>> to completely rewrite everything.
>>
>> Mark
>>
>> On Tue, Sep 28, 2021 at 2:06 AM Ilan Ginzburg <il...@gmail.com> wrote:
>>
>>> Should ZK disconnect be handled at the individual call level to begin
>>> with? Aren’t we implementing “recipes” (equivalent to “transactions” in a
>>> DB world) that combine multiple actions and that implicitly assume ZK
>>> continuity over the course of execution? It seems these should rather fail
>>> and retry as a whole rather than individual actions?
>>>
>>> I don’t have any existing examples in mind of where this is problematic
>>> in existing code (or it would already be a bug) but the existing single
>>> call level retry approach feels fragile.
>>>
>>> Ilan
>>>
>>> On Mon 27 Sep 2021 at 19:04, Mark Miller <ma...@gmail.com> wrote:
>>>
>>>> There are a variety of ways you could do it.
>>>>
>>>> The easiest short term change is to simply modify what handles most zk
>>>> retries - the ZkCmdExecutor - already plugged into SolrZkClient where it
>>>> retries. It tries to guess when a session times out and does fall back
>>>> retries up to that point.
>>>>
>>>> Because there can be any number of calls doing this, zk disconnects
>>>> tend to spiral the cluster down.
>>>>
>>>> It shouldn’t work like this. Everything in the system related to zk
>>>> should be event driven.
>>>>
>>>> So ZkCmdExecutor should not sleep and retry some number of times.
>>>>
>>>> It’s retry method should call something like
>>>> ConnectionManager#waitForReconnect. Make that a wait on a lock. When zk
>>>> notifies there is a reconnect, signallAll the lock. Or use a condition.
>>>> Same thing if the ConnectionManager is closed.
>>>>
>>>> It’s not as ideal as entering a quite mode, but it’s tremendously
>>>> simpler to do.
>>>>
>>>> Now when zk hits a dc, it doesn’t get repeatedly hit over and over up
>>>> until a expiration guess or past a ConnectionManager close.
>>>>
>>>> Pretty much everything gets held up, the system is forced into what is
>>>> essentially a quite state - though will all the outstanding calls hanging -
>>>> which gives zookeeper the ability to easily reconnect to a valid zk server
>>>> - in which case everything is released to retry and succeed.
>>>>
>>>> With this approach, (and removing the guess isExpired on
>>>> ConnectionManager and using its actual zk client state) you can actually
>>>> bombard and overload the system with updates - which currently will crush
>>>> the system - and instead you can survive the bombard without any updates
>>>> are disabled, zk is not connected fails. Unless your zk cluster is actually
>>>> catastrophically down.
>>>>
>>>> Mark
>>>>
>>>> On Sun, Sep 26, 2021 at 7:54 AM David Smiley <ds...@apache.org>
>>>> wrote:
>>>>
>>>>>
>>>>> On Wed, Sep 22, 2021 at 9:06 PM Mark Miller <ma...@gmail.com>
>>>>> wrote:
>>>>> ...
>>>>>
>>>>>> Zk alerts us when it losses a connection via callback. When the
>>>>>> connection is back, another callback. An unlimited number of locations
>>>>>> trying to work this out on there own is terrible zk. In an ideal world,
>>>>>> everything enters a zk quiete mode and re-engaged when zk says hello again.
>>>>>> A simpler shorter term improvement is to simply  sink all the zk calls when
>>>>>> they hit the zk connection manager and don’t let them go until the
>>>>>> connection is restored.
>>>>>>
>>>>>
>>>>> While I don't tend to work on this stuff, I want to understand the
>>>>> essence of your point.  Are you basically recommending that our ZK
>>>>> interactions should all go through one instance of a ZK connection manager
>>>>> class that can keep track of ZK's connection state?
>>>>>
>>>> --
>>>> - Mark
>>>>
>>>> http://about.me/markrmiller
>>>>
>>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
> --
- Mark

http://about.me/markrmiller

Re: ZkCmdExecutor

Posted by Ilan Ginzburg <il...@gmail.com>.

I did not mean to challenge the way you plan to refactor existing code to
make it equivalent but simpler.

My point was different, but I mixed up disconnections i.e. ZK connection
loss (no big deal) and session expiration (big deal). So please ignore my
previous message.

The correct way to look at it as I understand it:
Connection loss is a non event (sorry for the pun) and indeed can/should be
retried. Doing this efficiently as you suggest is better.
Session expiration is the real signal from ZK that the client "transaction"
(sequence of actions that achieve some goal such as electing a leader among
participants or grabbing a lock) got interrupted and must be completely
restarted using a new session.

On Tue, Sep 28, 2021 at 1:03 PM Mark Miller <ma...@gmail.com> wrote:

> That’s why I say that ideally you should actually enter a quiet mode on zk
> disconnect and come out of it on reconnect, this isn’t necessarily the
> ideal.  I don’t think it assumes zk continuity - the retries are because
> when you dc from a zk server, as long as all the servers are not dead, it
> will automatically start trying to reconnect to another zk server - the
> retries are just waiting for that to happen. They are just dumb retries,
> which was silly, because zk tells you when it dcs and when it reconnects. I
> don’t really see how it relates to a transaction - you just don’t want
> things to fail when the zk client is failing over to a another server, in a
> good impl, that should be mostly transparent to any user of the system. ZK
> is designed to feed your app this info so that you can implement something
> that makes these fail overs smooth. Sometimes it’s not even a failover it’s
> just that overload or gc pauses have stalled zks heartbeat for too long.
>
> I did all that wrong though for sure - I did it before I really even
> understood zookeeper that well. So the current stuff is just pretty poor.
> What I describe it’s a relatively quick and simple way to make it much much
> better. It would be much more invasive and a lot more work and effort to do
> something too radically different. The end result would be similar, except
> there really is no reason to sit on all the outstanding calls. It’s just
> difficult to do anything that’s transparent to the user when zk is failing
> over and that works with all the existing code.
>
> The end goal is not a transaction or anything though - it just to have
> your app able to smoothly handle the zk client transitioning from one zk
> server to another - or missing it’s heart beat due to load and then
> connecting again. It’s certainly not currently that often a smooth event -
> this just describes a way I have been able to make it smooth without having
> to completely rewrite everything.
>
> Mark
>
> On Tue, Sep 28, 2021 at 2:06 AM Ilan Ginzburg <il...@gmail.com> wrote:
>
>> Should ZK disconnect be handled at the individual call level to begin
>> with? Aren’t we implementing “recipes” (equivalent to “transactions” in a
>> DB world) that combine multiple actions and that implicitly assume ZK
>> continuity over the course of execution? It seems these should rather fail
>> and retry as a whole rather than individual actions?
>>
>> I don’t have any existing examples in mind of where this is problematic
>> in existing code (or it would already be a bug) but the existing single
>> call level retry approach feels fragile.
>>
>> Ilan
>>
>> On Mon 27 Sep 2021 at 19:04, Mark Miller <ma...@gmail.com> wrote:
>>
>>> There are a variety of ways you could do it.
>>>
>>> The easiest short term change is to simply modify what handles most zk
>>> retries - the ZkCmdExecutor - already plugged into SolrZkClient where it
>>> retries. It tries to guess when a session times out and does fall back
>>> retries up to that point.
>>>
>>> Because there can be any number of calls doing this, zk disconnects tend
>>> to spiral the cluster down.
>>>
>>> It shouldn’t work like this. Everything in the system related to zk
>>> should be event driven.
>>>
>>> So ZkCmdExecutor should not sleep and retry some number of times.
>>>
>>> It’s retry method should call something like
>>> ConnectionManager#waitForReconnect. Make that a wait on a lock. When zk
>>> notifies there is a reconnect, signallAll the lock. Or use a condition.
>>> Same thing if the ConnectionManager is closed.
>>>
>>> It’s not as ideal as entering a quite mode, but it’s tremendously
>>> simpler to do.
>>>
>>> Now when zk hits a dc, it doesn’t get repeatedly hit over and over up
>>> until a expiration guess or past a ConnectionManager close.
>>>
>>> Pretty much everything gets held up, the system is forced into what is
>>> essentially a quite state - though will all the outstanding calls hanging -
>>> which gives zookeeper the ability to easily reconnect to a valid zk server
>>> - in which case everything is released to retry and succeed.
>>>
>>> With this approach, (and removing the guess isExpired on
>>> ConnectionManager and using its actual zk client state) you can actually
>>> bombard and overload the system with updates - which currently will crush
>>> the system - and instead you can survive the bombard without any updates
>>> are disabled, zk is not connected fails. Unless your zk cluster is actually
>>> catastrophically down.
>>>
>>> Mark
>>>
>>> On Sun, Sep 26, 2021 at 7:54 AM David Smiley <ds...@apache.org> wrote:
>>>
>>>>
>>>> On Wed, Sep 22, 2021 at 9:06 PM Mark Miller <ma...@gmail.com>
>>>> wrote:
>>>> ...
>>>>
>>>>> Zk alerts us when it losses a connection via callback. When the
>>>>> connection is back, another callback. An unlimited number of locations
>>>>> trying to work this out on there own is terrible zk. In an ideal world,
>>>>> everything enters a zk quiete mode and re-engaged when zk says hello again.
>>>>> A simpler shorter term improvement is to simply  sink all the zk calls when
>>>>> they hit the zk connection manager and don’t let them go until the
>>>>> connection is restored.
>>>>>
>>>>
>>>> While I don't tend to work on this stuff, I want to understand the
>>>> essence of your point.  Are you basically recommending that our ZK
>>>> interactions should all go through one instance of a ZK connection manager
>>>> class that can keep track of ZK's connection state?
>>>>
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>> --
> - Mark
>
> http://about.me/markrmiller
>

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

That’s why I say that ideally you should actually enter a quiet mode on zk
disconnect and come out of it on reconnect, this isn’t necessarily the
ideal.  I don’t think it assumes zk continuity - the retries are because
when you dc from a zk server, as long as all the servers are not dead, it
will automatically start trying to reconnect to another zk server - the
retries are just waiting for that to happen. They are just dumb retries,
which was silly, because zk tells you when it dcs and when it reconnects. I
don’t really see how it relates to a transaction - you just don’t want
things to fail when the zk client is failing over to a another server, in a
good impl, that should be mostly transparent to any user of the system. ZK
is designed to feed your app this info so that you can implement something
that makes these fail overs smooth. Sometimes it’s not even a failover it’s
just that overload or gc pauses have stalled zks heartbeat for too long.

I did all that wrong though for sure - I did it before I really even
understood zookeeper that well. So the current stuff is just pretty poor.
What I describe it’s a relatively quick and simple way to make it much much
better. It would be much more invasive and a lot more work and effort to do
something too radically different. The end result would be similar, except
there really is no reason to sit on all the outstanding calls. It’s just
difficult to do anything that’s transparent to the user when zk is failing
over and that works with all the existing code.

The end goal is not a transaction or anything though - it just to have your
app able to smoothly handle the zk client transitioning from one zk server
to another - or missing it’s heart beat due to load and then connecting
again. It’s certainly not currently that often a smooth event - this just
describes a way I have been able to make it smooth without having to
completely rewrite everything.

Mark

On Tue, Sep 28, 2021 at 2:06 AM Ilan Ginzburg <il...@gmail.com> wrote:

> Should ZK disconnect be handled at the individual call level to begin
> with? Aren’t we implementing “recipes” (equivalent to “transactions” in a
> DB world) that combine multiple actions and that implicitly assume ZK
> continuity over the course of execution? It seems these should rather fail
> and retry as a whole rather than individual actions?
>
> I don’t have any existing examples in mind of where this is problematic in
> existing code (or it would already be a bug) but the existing single call
> level retry approach feels fragile.
>
> Ilan
>
> On Mon 27 Sep 2021 at 19:04, Mark Miller <ma...@gmail.com> wrote:
>
>> There are a variety of ways you could do it.
>>
>> The easiest short term change is to simply modify what handles most zk
>> retries - the ZkCmdExecutor - already plugged into SolrZkClient where it
>> retries. It tries to guess when a session times out and does fall back
>> retries up to that point.
>>
>> Because there can be any number of calls doing this, zk disconnects tend
>> to spiral the cluster down.
>>
>> It shouldn’t work like this. Everything in the system related to zk
>> should be event driven.
>>
>> So ZkCmdExecutor should not sleep and retry some number of times.
>>
>> It’s retry method should call something like
>> ConnectionManager#waitForReconnect. Make that a wait on a lock. When zk
>> notifies there is a reconnect, signallAll the lock. Or use a condition.
>> Same thing if the ConnectionManager is closed.
>>
>> It’s not as ideal as entering a quite mode, but it’s tremendously simpler
>> to do.
>>
>> Now when zk hits a dc, it doesn’t get repeatedly hit over and over up
>> until a expiration guess or past a ConnectionManager close.
>>
>> Pretty much everything gets held up, the system is forced into what is
>> essentially a quite state - though will all the outstanding calls hanging -
>> which gives zookeeper the ability to easily reconnect to a valid zk server
>> - in which case everything is released to retry and succeed.
>>
>> With this approach, (and removing the guess isExpired on
>> ConnectionManager and using its actual zk client state) you can actually
>> bombard and overload the system with updates - which currently will crush
>> the system - and instead you can survive the bombard without any updates
>> are disabled, zk is not connected fails. Unless your zk cluster is actually
>> catastrophically down.
>>
>> Mark
>>
>> On Sun, Sep 26, 2021 at 7:54 AM David Smiley <ds...@apache.org> wrote:
>>
>>>
>>> On Wed, Sep 22, 2021 at 9:06 PM Mark Miller <ma...@gmail.com>
>>> wrote:
>>> ...
>>>
>>>> Zk alerts us when it losses a connection via callback. When the
>>>> connection is back, another callback. An unlimited number of locations
>>>> trying to work this out on there own is terrible zk. In an ideal world,
>>>> everything enters a zk quiete mode and re-engaged when zk says hello again.
>>>> A simpler shorter term improvement is to simply  sink all the zk calls when
>>>> they hit the zk connection manager and don’t let them go until the
>>>> connection is restored.
>>>>
>>>
>>> While I don't tend to work on this stuff, I want to understand the
>>> essence of your point.  Are you basically recommending that our ZK
>>> interactions should all go through one instance of a ZK connection manager
>>> class that can keep track of ZK's connection state?
>>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
> --
- Mark

http://about.me/markrmiller

Re: ZkCmdExecutor

Posted by Ilan Ginzburg <il...@gmail.com>.

Should ZK disconnect be handled at the individual call level to begin with?
Aren’t we implementing “recipes” (equivalent to “transactions” in a DB
world) that combine multiple actions and that implicitly assume ZK
continuity over the course of execution? It seems these should rather fail
and retry as a whole rather than individual actions?

I don’t have any existing examples in mind of where this is problematic in
existing code (or it would already be a bug) but the existing single call
level retry approach feels fragile.

Ilan

On Mon 27 Sep 2021 at 19:04, Mark Miller <ma...@gmail.com> wrote:

> There are a variety of ways you could do it.
>
> The easiest short term change is to simply modify what handles most zk
> retries - the ZkCmdExecutor - already plugged into SolrZkClient where it
> retries. It tries to guess when a session times out and does fall back
> retries up to that point.
>
> Because there can be any number of calls doing this, zk disconnects tend
> to spiral the cluster down.
>
> It shouldn’t work like this. Everything in the system related to zk should
> be event driven.
>
> So ZkCmdExecutor should not sleep and retry some number of times.
>
> It’s retry method should call something like
> ConnectionManager#waitForReconnect. Make that a wait on a lock. When zk
> notifies there is a reconnect, signallAll the lock. Or use a condition.
> Same thing if the ConnectionManager is closed.
>
> It’s not as ideal as entering a quite mode, but it’s tremendously simpler
> to do.
>
> Now when zk hits a dc, it doesn’t get repeatedly hit over and over up
> until a expiration guess or past a ConnectionManager close.
>
> Pretty much everything gets held up, the system is forced into what is
> essentially a quite state - though will all the outstanding calls hanging -
> which gives zookeeper the ability to easily reconnect to a valid zk server
> - in which case everything is released to retry and succeed.
>
> With this approach, (and removing the guess isExpired on ConnectionManager
> and using its actual zk client state) you can actually bombard and overload
> the system with updates - which currently will crush the system - and
> instead you can survive the bombard without any updates are disabled, zk is
> not connected fails. Unless your zk cluster is actually catastrophically
> down.
>
> Mark
>
> On Sun, Sep 26, 2021 at 7:54 AM David Smiley <ds...@apache.org> wrote:
>
>>
>> On Wed, Sep 22, 2021 at 9:06 PM Mark Miller <ma...@gmail.com>
>> wrote:
>> ...
>>
>>> Zk alerts us when it losses a connection via callback. When the
>>> connection is back, another callback. An unlimited number of locations
>>> trying to work this out on there own is terrible zk. In an ideal world,
>>> everything enters a zk quiete mode and re-engaged when zk says hello again.
>>> A simpler shorter term improvement is to simply  sink all the zk calls when
>>> they hit the zk connection manager and don’t let them go until the
>>> connection is restored.
>>>
>>
>> While I don't tend to work on this stuff, I want to understand the
>> essence of your point.  Are you basically recommending that our ZK
>> interactions should all go through one instance of a ZK connection manager
>> class that can keep track of ZK's connection state?
>>
> --
> - Mark
>
> http://about.me/markrmiller
>

Re: ZkCmdExecutor

Posted by Mark Miller <ma...@gmail.com>.

There are a variety of ways you could do it.

The easiest short term change is to simply modify what handles most zk
retries - the ZkCmdExecutor - already plugged into SolrZkClient where it
retries. It tries to guess when a session times out and does fall back
retries up to that point.

Because there can be any number of calls doing this, zk disconnects tend to
spiral the cluster down.

It shouldn’t work like this. Everything in the system related to zk should
be event driven.

So ZkCmdExecutor should not sleep and retry some number of times.

It’s retry method should call something like
ConnectionManager#waitForReconnect. Make that a wait on a lock. When zk
notifies there is a reconnect, signallAll the lock. Or use a condition.
Same thing if the ConnectionManager is closed.

It’s not as ideal as entering a quite mode, but it’s tremendously simpler
to do.

Now when zk hits a dc, it doesn’t get repeatedly hit over and over up until
a expiration guess or past a ConnectionManager close.

Pretty much everything gets held up, the system is forced into what is
essentially a quite state - though will all the outstanding calls hanging -
which gives zookeeper the ability to easily reconnect to a valid zk server
- in which case everything is released to retry and succeed.

With this approach, (and removing the guess isExpired on ConnectionManager
and using its actual zk client state) you can actually bombard and overload
the system with updates - which currently will crush the system - and
instead you can survive the bombard without any updates are disabled, zk is
not connected fails. Unless your zk cluster is actually catastrophically
down.

Mark

On Sun, Sep 26, 2021 at 7:54 AM David Smiley <ds...@apache.org> wrote:

>
> On Wed, Sep 22, 2021 at 9:06 PM Mark Miller <ma...@gmail.com> wrote:
> ...
>
>> Zk alerts us when it losses a connection via callback. When the
>> connection is back, another callback. An unlimited number of locations
>> trying to work this out on there own is terrible zk. In an ideal world,
>> everything enters a zk quiete mode and re-engaged when zk says hello again.
>> A simpler shorter term improvement is to simply  sink all the zk calls when
>> they hit the zk connection manager and don’t let them go until the
>> connection is restored.
>>
>
> While I don't tend to work on this stuff, I want to understand the essence
> of your point.  Are you basically recommending that our ZK interactions
> should all go through one instance of a ZK connection manager class that
> can keep track of ZK's connection state?
>
-- 
- Mark

http://about.me/markrmiller

Re: ZkCmdExecutor

Posted by David Smiley <ds...@apache.org>.

On Wed, Sep 22, 2021 at 9:06 PM Mark Miller <ma...@gmail.com> wrote:
...

> Zk alerts us when it losses a connection via callback. When the connection
> is back, another callback. An unlimited number of locations trying to work
> this out on there own is terrible zk. In an ideal world, everything enters
> a zk quiete mode and re-engaged when zk says hello again. A simpler shorter
> term improvement is to simply  sink all the zk calls when they hit the zk
> connection manager and don’t let them go until the connection is restored.
>

While I don't tend to work on this stuff, I want to understand the essence
of your point.  Are you basically recommending that our ZK interactions
should all go through one instance of a ZK connection manager class that
can keep track of ZK's connection state?