You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Mark Miller <ma...@gmail.com> on 2019/11/02 21:44:22 UTC

Re: SolrCloud is sick.

Give me a short bit to follow up and I will lay out my case and proposal.

Everyone is then free to decide that we need to do something drastic or
that I'm wrong and we should just continue down the same road. If that's
the case, a lot of your work will get a lot easier and less impeded by me
and we will still all be happier. Win win.

If we can just not make drastic changes for a just a brief week or so
window, I'll say what I have to say, you guys can judge and do whatever
you'd please.

- mark

On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <ma...@gmail.com> wrote:

> Hey All Solr Dev's,
>
> SolrCloud is sick right now. The way low level Zookeeper is handeled, the
> Overseer, is mix and mess of proper exception handling and super slow
> startup and shutdown, adding new things all the time with no concern for
> performance or proper ordering (which is harder to tell than you think).
>
> Our class dependency graph doesn't even work - we just force it. Sort of.
> If the whole system  doesn't block and choke it's way to a start slow
> enough, lots of things fail.
>
> This thing coughs up, you toss stuff into the storm, a good chunk of time,
> what you want eventually come back without causing too much damage.
>
> There are so many things are are off or just plain wrong and the list is
> growing and growing. No one is following this or if you are, please back me
> up. This thing will collapse under it's own wait.
>
> So if you want to add yet another state format cluster state or some other
> optimization on this junk heap, you can expect me to push back.
>
> We should all be embarrassed by the state of things.
>
> I've got some ideas for addressing them that I'll share soon, but god,
> don't keep optimizing a turd in non backcompat Overseer loving ways. That
> Overseer is an atrocity.
>
> --
> - Mark
>
> http://about.me/markrmiller
>


-- 
- Mark

http://about.me/markrmiller

Re: SolrCloud is sick.

Posted by Mark Miller <ma...@gmail.com>.

I am not personally attacking anyone.

Everyone here does good work in one way or another.

However, if you cross a bar of prolific vs attention to detail, tests, doc,
and things beyond just you, I do hold you to a higher standard.

You are super prolific and your care for others following your footsteps or
full consideration to things, is lacking. I see nothing personal about it,
it’s a pure code and work observation.

You have good ideas, you can spin code, that’s awesome, but your often
using that power in a way that clears a lot of ground without a lot of
replacement tree planting.

I’ve said this 50 times, I’ve said this to the PMC more than once, it’s
nothing personal. Personally you are a great human and enjoyable to
interact with. I always enjoy you, my wife still tells me she enjoys you.

I’ve got problems with others work as well, but no one else seems to have
the same disregard for the other devs while working.

I wish Yonik would document more for mere mortals. I wish AB would look at
more the costs he adds to startup and shutdown. I wish, I wish... I dint
care as long as you have a cooperative attitude.

And I can help with the area people are lacking.  It’s no free lunch, it’s
a going to be a pain I. The ass, we can talk another that when I have my
proposal planned.

There is tons of stuff you can help with project with, prolific people are
valuable. But we need to figure out a better way to operate and we need to
start from more stable ground.

I’m not tossing some design on you. We have a design. It’s fine. Our impl
and foundation are bad, they have always been bad, we are going backward
faster than forward.

Most of the worst and basic code is mine.

We all write bad code. If you are open to owning that your efforts could be
improved in various area - maybe even just to please the others we work
with, that’s enough for me. It’s fine then. But an obstinate disregard for
the fact that we all share this project ... I can’t work wile with it.

Anyway, I’ve got stuff for us to build with. I’d rather you spend your time
making things fast that we care about and not ghosts of the past.

I know everyone has wanted that. I’m not here saying you guys fucked up.
I’m at the head of that list. It’s our project.

Mark

On Sat, Nov 2, 2019 at 6:32 PM Noble Paul <no...@gmail.com> wrote:

> Hi,
>
> I believe there is a consensus on what is wrong with the way we have built
> the cluster state and overseer. We need to focus a bit more on the design
> aspect. Design, according to me, has the following elements:
>
> * How does it work?
>
> * What are the performance characteristics? Can it be done more
> efficiently?
>
> * What are the public touch points?
>
> ** Which are the files we store in ZK? Are they expected to be watched
> always?
>
> ** Or are they read on demand?
>
> ** The public APIs. Does it make sense to the user? Can it be further
> simplified? How does it compare to the other APIs in the system?
>
>
> We, as a community, do a bad job in dealing with these. While we focus on
> internal things, these are not discussed before it is too late. We usually
> do coding, tests, code review (sometimes) and commit. This leads to huge
> technical debt.
>
>
> This is not to put blame on one person or a group of people. (I
> occasionally see people discussing design issues upfront, I just hope that
> is the norm.)
>
>
> Now, why am I discussing this in this thread?
>
>
> While we agree there are problems, we are trying to solve the problem
> using the same process we used to create these problems. Again, I'm not
> questioning the intent or competence of anyone. Unless we set the process
> right, we are doomed to make the same mistakes again.
>
>
> I whole heartedly endorse any effort to improve SolrCloud/overseer. At the
> same time I fail to see us leveraging the collective experience of our
> community through meaningful discussion.
>
>
> I hope we don't resort to personal attacks and use this as an opportunity
> to improve our processes.
> Thanks
>
> On Sun, Nov 3, 2019, 9:52 AM Scott Blum <dr...@gmail.com> wrote:
>
>> Very much agreed.  I've been trying to figure out for a long time what is
>> the point in having a replica DOWN state that has to be toggled (DOWN and
>> then UP!) every time a node restarts.  Considering that we could just
>> combine ACTIVE and `live_nodes` to understand whether a replica is
>> available.  It's not even foolproof since kill -9 on a solr node won't mark
>> all the replicas DOWN-- that doesn't happen until the node comes back up
>> (perversely).
>>
>> What would it take to get to a state where restarting a node would
>> require a minimal amount of ZK work in most cases?
>>
>> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <ma...@gmail.com> wrote:
>>
>>> Give me a short bit to follow up and I will lay out my case and proposal.
>>>
>>> Everyone is then free to decide that we need to do something drastic or
>>> that I'm wrong and we should just continue down the same road. If that's
>>> the case, a lot of your work will get a lot easier and less impeded by me
>>> and we will still all be happier. Win win.
>>>
>>> If we can just not make drastic changes for a just a brief week or so
>>> window, I'll say what I have to say, you guys can judge and do whatever
>>> you'd please.
>>>
>>> - mark
>>>
>>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hey All Solr Dev's,
>>>>
>>>> SolrCloud is sick right now. The way low level Zookeeper is handeled,
>>>> the Overseer, is mix and mess of proper exception handling and super slow
>>>> startup and shutdown, adding new things all the time with no concern for
>>>> performance or proper ordering (which is harder to tell than you think).
>>>>
>>>> Our class dependency graph doesn't even work - we just force it. Sort
>>>> of. If the whole system  doesn't block and choke it's way to a start slow
>>>> enough, lots of things fail.
>>>>
>>>> This thing coughs up, you toss stuff into the storm, a good chunk of
>>>> time, what you want eventually come back without causing too much damage.
>>>>
>>>> There are so many things are are off or just plain wrong and the list
>>>> is growing and growing. No one is following this or if you are, please back
>>>> me up. This thing will collapse under it's own wait.
>>>>
>>>> So if you want to add yet another state format cluster state or some
>>>> other optimization on this junk heap, you can expect me to push back.
>>>>
>>>> We should all be embarrassed by the state of things.
>>>>
>>>> I've got some ideas for addressing them that I'll share soon, but god,
>>>> don't keep optimizing a turd in non backcompat Overseer loving ways. That
>>>> Overseer is an atrocity.
>>>>
>>>> --
>>>> - Mark
>>>>
>>>> http://about.me/markrmiller
>>>>
>>>
>>>
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>> --
- Mark

http://about.me/markrmiller

Re: Missing top level javadocs

Posted by David Smiley <da...@gmail.com>.

+1 fantastic!

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Nov 4, 2019 at 10:45 AM Andrzej Białecki <ab...@getopt.org> wrote:

> +1, I think it’s an excellent idea. The check should also verify that the
> comment not only exists but also that it’s not empty - eg. there’s an
> IntelliJ template that creates an empty top-level javadoc.
>
> > On 4 Nov 2019, at 16:40, Bram Van Dam <br...@intix.eu> wrote:
> >
> > David Smiley mentioned this in the "SolrCloud is sick" thread. Instead
> > of hijacking that, I figured I'd start another thread.
> >
> > On 03/11/2019 05:32, David Smiley wrote:
> >> <snip> requiring javadocs on all top level classes.  I think more
> javadocs and
> >> code comments would be very helpful -- especially for the major
> >> classes.
> >
> > This sounds like something that's actionable.
> >
> > I'm not sure if there are any guidelines regarding documentation on the
> > Solr project, but on my team there's a rule that says all classes must
> > have a top-level javadoc that explains the "why" of the class. "Why does
> > it exist/what's it for?"
> >
> > Excluding contrib, solrj and tests, there are some 400 source files with
> > classes with missing top level Javadoc. This includes some files with
> > undocumented nested "public static" classes -- couldn't find an obvious
> > way to exclude those using checkstyle.
> >
> > Here's a "top ten most frequently modified files with missing Javadoc"
> > below. This is an arbitrary metric, the "most referenced classes" might
> > be more useful, but that was harder to hack together with shell foo.
> >
> > solr/core/src/java/org/apache/solr/core/CoreContainer.java
> > solr/core/src/java/org/apache/solr/handler/admin/CollectionsHandler.java
> >
> solr/core/src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java
> > solr/core/src/java/org/apache/solr/handler/StreamHandler.java
> > solr/core/src/java/org/apache/solr/cloud/ElectionContext.java
> >
> solr/core/src/java/org/apache/solr/handler/component/RealTimeGetComponent.java
> > solr/core/src/java/org/apache/solr/update/DefaultSolrCoreState.java
> > solr/core/src/java/org/apache/solr/search/JoinQParserPlugin.java
> >
> solr/core/src/java/org/apache/solr/handler/component/HttpShardHandlerFactory.java
> > solr/core/src/java/org/apache/solr/handler/SolrConfigHandler.java
> >
> > If there's any interest in this, I could write a patch to include
> > something like this in the build (ant or gradle, whatever).
> >
> > - Bram
> >
> > Following checkstyle configuration detects classes with missing Javadoc:
> >
> > check.xml:
> > ==========
> >
> > <!DOCTYPE module PUBLIC
> >  "-//Checkstyle//DTD Checkstyle Configuration 1.3//EN"
> >  "https://checkstyle.org/dtds/configuration_1_3.dtd">
> >
> > <module name="Checker">
> >       <module name="TreeWalker">
> >               <module name="MissingJavadocType"/>
> >       </module>
> > </module>
> >
> > Bit of shell foo to list offending files:
> > =========================================
> >
> > java -jar checkstyle-8.26-all.jar -c config.xml solr/ | cut -d ' ' -f 2
> > | sed "s:.*/lucene-solr/::g" | cut -d ':' -f 1 | sort | uniq
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Missing top level javadocs

Posted by Andrzej Białecki <ab...@getopt.org>.

+1, I think it’s an excellent idea. The check should also verify that the comment not only exists but also that it’s not empty - eg. there’s an IntelliJ template that creates an empty top-level javadoc.

> On 4 Nov 2019, at 16:40, Bram Van Dam <br...@intix.eu> wrote:
> 
> David Smiley mentioned this in the "SolrCloud is sick" thread. Instead
> of hijacking that, I figured I'd start another thread.
> 
> On 03/11/2019 05:32, David Smiley wrote:
>> <snip> requiring javadocs on all top level classes.  I think more javadocs and
>> code comments would be very helpful -- especially for the major
>> classes.
> 
> This sounds like something that's actionable.
> 
> I'm not sure if there are any guidelines regarding documentation on the
> Solr project, but on my team there's a rule that says all classes must
> have a top-level javadoc that explains the "why" of the class. "Why does
> it exist/what's it for?"
> 
> Excluding contrib, solrj and tests, there are some 400 source files with
> classes with missing top level Javadoc. This includes some files with
> undocumented nested "public static" classes -- couldn't find an obvious
> way to exclude those using checkstyle.
> 
> Here's a "top ten most frequently modified files with missing Javadoc"
> below. This is an arbitrary metric, the "most referenced classes" might
> be more useful, but that was harder to hack together with shell foo.
> 
> solr/core/src/java/org/apache/solr/core/CoreContainer.java
> solr/core/src/java/org/apache/solr/handler/admin/CollectionsHandler.java
> solr/core/src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java
> solr/core/src/java/org/apache/solr/handler/StreamHandler.java
> solr/core/src/java/org/apache/solr/cloud/ElectionContext.java
> solr/core/src/java/org/apache/solr/handler/component/RealTimeGetComponent.java
> solr/core/src/java/org/apache/solr/update/DefaultSolrCoreState.java
> solr/core/src/java/org/apache/solr/search/JoinQParserPlugin.java
> solr/core/src/java/org/apache/solr/handler/component/HttpShardHandlerFactory.java
> solr/core/src/java/org/apache/solr/handler/SolrConfigHandler.java
> 
> If there's any interest in this, I could write a patch to include
> something like this in the build (ant or gradle, whatever).
> 
> - Bram
> 
> Following checkstyle configuration detects classes with missing Javadoc:
> 
> check.xml:
> ==========
> 
> <!DOCTYPE module PUBLIC
>  "-//Checkstyle//DTD Checkstyle Configuration 1.3//EN"
>  "https://checkstyle.org/dtds/configuration_1_3.dtd">
> 
> <module name="Checker">
> 	<module name="TreeWalker">
> 		<module name="MissingJavadocType"/>
> 	</module>
> </module>
> 
> Bit of shell foo to list offending files:
> =========================================
> 
> java -jar checkstyle-8.26-all.jar -c config.xml solr/ | cut -d ' ' -f 2
> | sed "s:.*/lucene-solr/::g" | cut -d ':' -f 1 | sort | uniq
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Missing top level javadocs

Posted by Bram Van Dam <br...@intix.eu>.

David Smiley mentioned this in the "SolrCloud is sick" thread. Instead
of hijacking that, I figured I'd start another thread.

On 03/11/2019 05:32, David Smiley wrote:
> <snip> requiring javadocs on all top level classes.  I think more javadocs and
> code comments would be very helpful -- especially for the major
> classes.

This sounds like something that's actionable.

I'm not sure if there are any guidelines regarding documentation on the
Solr project, but on my team there's a rule that says all classes must
have a top-level javadoc that explains the "why" of the class. "Why does
it exist/what's it for?"

Excluding contrib, solrj and tests, there are some 400 source files with
classes with missing top level Javadoc. This includes some files with
undocumented nested "public static" classes -- couldn't find an obvious
way to exclude those using checkstyle.

Here's a "top ten most frequently modified files with missing Javadoc"
below. This is an arbitrary metric, the "most referenced classes" might
be more useful, but that was harder to hack together with shell foo.

solr/core/src/java/org/apache/solr/core/CoreContainer.java
solr/core/src/java/org/apache/solr/handler/admin/CollectionsHandler.java
solr/core/src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java
solr/core/src/java/org/apache/solr/handler/StreamHandler.java
solr/core/src/java/org/apache/solr/cloud/ElectionContext.java
solr/core/src/java/org/apache/solr/handler/component/RealTimeGetComponent.java
solr/core/src/java/org/apache/solr/update/DefaultSolrCoreState.java
solr/core/src/java/org/apache/solr/search/JoinQParserPlugin.java
solr/core/src/java/org/apache/solr/handler/component/HttpShardHandlerFactory.java
solr/core/src/java/org/apache/solr/handler/SolrConfigHandler.java

If there's any interest in this, I could write a patch to include
something like this in the build (ant or gradle, whatever).

 - Bram

Following checkstyle configuration detects classes with missing Javadoc:

check.xml:
==========

<!DOCTYPE module PUBLIC
  "-//Checkstyle//DTD Checkstyle Configuration 1.3//EN"
  "https://checkstyle.org/dtds/configuration_1_3.dtd">

<module name="Checker">
	<module name="TreeWalker">
		<module name="MissingJavadocType"/>
	</module>
</module>

Bit of shell foo to list offending files:
=========================================

java -jar checkstyle-8.26-all.jar -c config.xml solr/ | cut -d ' ' -f 2
| sed "s:.*/lucene-solr/::g" | cut -d ':' -f 1 | sort | uniq

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: SolrCloud is sick.

Posted by Jan Høydahl <ja...@cominvent.com>.

Martin, please don’t hijack a subject but start a new mail thread for a new topic.

Jan Høydahl

> 4. nov. 2019 kl. 01:00 skrev Martin Gainty <mg...@hotmail.com>:
> 
> 
> here is a bug i cannot shake in when building lucne/site
> 
> inside lucene/src/main/xml/ENTITY_TermQuery.xml
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE TermQuery [
> <!ENTITY internalTerm "sumitomo">
> <!ENTITY externalTerm SYSTEM "http://www.bar.xyz/external">
> <!ENTITY % myParameterEntity "http://www.bar.xyz/param">
> ....
> 
> using ant build.xml:
>  <!--
>       The XSL input file is ignored completely, but XSL expects one to be given,
>       so we pass ourself (${ant.file}) here. The list of module build.xmls is given
>       via string parameter, that must be splitted by the XSL at '|'.
>     --> 
>     <xslt in="${ant.file}" out="${javadoc.dir}/index.html" style="site/xsl/index.xsl" force="true">
>       <outputproperty name="method" value="html"/>
>       <outputproperty name="version" value="4.0"/>
>       <outputproperty name="encoding" value="UTF-8"/>
>       <outputproperty name="indent" value="yes"/>
>       <param name="buildfiles" expression="${process-webpages.buildfiles}"/>
>       <param name="version" expression="${version}"/>
>       <param name="defaultCodec" expression="${defaultCodec}"/>
>     </xslt>
> 
> OR maven pom.xml 
>   <plugin>
>         <groupId>org.codehaus.mojo</groupId>
>    <artifactId>xml-maven-plugin</artifactId>
>            <version>1.0.1</version>
>            <executions>
>             <execution>
>                  <id>validate</id>
>                         <phase>initialize</phase>
>                         <goals>
>                          <goal>transform</goal>
>                         </goals>
>                         <configuration>
>                            <forceCreation>true</forceCreation>
>                            <skip>false</skip>
>                            <outputDirectory>${project.build.directory}/target</outputDirectory>
>      <transformationSets>
>        <transformationSet>
>   <dir>src/main/xml</dir>
>   <stylesheet>C:/Maven-plugin/lucene-solr/lucene/site/xsl/index.xsl</stylesheet>
>   <parameters>
>    <parameter>
>      <name>MyParam</name>
>      <value>true</value>
>    </parameter>
>        </parameters>
>        </transformationSet>
>      </transformationSets>
>            </configuration>
>            </execution>
>        </executions>
>        <dependencies>
>         <dependency>
>          <groupId>net.sf.saxon</groupId>
>          <artifactId>Saxon-HE</artifactId>
>          <version>9.9.1-1</version>
>         </dependency>
>        </dependencies>
>       </plugin>
> 
> either build executing XSLT i get the same error:
> 
> [ERROR] Failed to execute goal org.codehaus.mojo:xml-maven-plugin:1.0.1:transform (validate) on project analysis: Failed to transform input file lucene/src/main/xml/ENTITY_TermQuery.xml: I/O error reported by XML parser processing file://lucene/src/main/xml/ENTITY_TermQuery.xml: www.bar.xyz: 
> Unknown host www.bar.xyz 
> ]>
> 
> apparently www.bar.xyz host is supposed to be a placeholder
> but for the life of me I cannot see where www.bar.zyz placeholder is replaced by a valid URL
> 
> (i havent used DTD in at least 10 years and i am way out of my element when trying to resolve)
> any suggestions?
> martin
> From: David Smiley <da...@gmail.com>
> Sent: Sunday, November 3, 2019 12:32 AM
> To: Solr/Lucene Dev <de...@lucene.apache.org>
> Cc: Mark Miller <ma...@gmail.com>
> Subject: Re: SolrCloud is sick.
>  
> Yeah we do a bad job of the things you listed Noble.  :-(   My colleagues want pointers to internal docs but the sad reality is there isn't any.  You may notice I'm a stickler in my code reviews for requiring javadocs on all top level classes.  I think more javadocs and code comments would be very helpful -- especially for the major classes.  This might help us all and others a lot more.  For example I think Lucene does a rather fine job of this for its major classes -- IndexWriter being a good example.
> 
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
> 
> 
> On Sat, Nov 2, 2019 at 7:32 PM Noble Paul <no...@gmail.com> wrote:
> Hi, 
> 
> I believe there is a consensus on what is wrong with the way we have built the cluster state and overseer. We need to focus a bit more on the design aspect. Design, according to me, has the following elements:
> 
> * How does it work?
> 
> * What are the performance characteristics? Can it be done more efficiently?
> 
> * What are the public touch points?
> 
> ** Which are the files we store in ZK? Are they expected to be watched always?
> 
> ** Or are they read on demand?
> 
> ** The public APIs. Does it make sense to the user? Can it be further simplified? How does it compare to the other APIs in the system?
> 
> 
> We, as a community, do a bad job in dealing with these. While we focus on internal things, these are not discussed before it is too late. We usually do coding, tests, code review (sometimes) and commit. This leads to huge technical debt. 
> 
> 
> This is not to put blame on one person or a group of people. (I occasionally see people discussing design issues upfront, I just hope that is the norm.) 
> 
> 
> Now, why am I discussing this in this thread?
> 
> 
> While we agree there are problems, we are trying to solve the problem using the same process we used to create these problems. Again, I'm not questioning the intent or competence of anyone. Unless we set the process right, we are doomed to make the same mistakes again.
> 
> 
> I whole heartedly endorse any effort to improve SolrCloud/overseer. At the same time I fail to see us leveraging the collective experience of our community through meaningful discussion. 
> 
> 
> I hope we don't resort to personal attacks and use this as an opportunity to improve our processes.
> Thanks
> 
> On Sun, Nov 3, 2019, 9:52 AM Scott Blum <dr...@gmail.com> wrote:
> Very much agreed.  I've been trying to figure out for a long time what is the point in having a replica DOWN state that has to be toggled (DOWN and then UP!) every time a node restarts.  Considering that we could just combine ACTIVE and `live_nodes` to understand whether a replica is available.  It's not even foolproof since kill -9 on a solr node won't mark all the replicas DOWN-- that doesn't happen until the node comes back up (perversely).
> 
> What would it take to get to a state where restarting a node would require a minimal amount of ZK work in most cases?
> 
> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <ma...@gmail.com> wrote:
> Give me a short bit to follow up and I will lay out my case and proposal.
> 
> Everyone is then free to decide that we need to do something drastic or that I'm wrong and we should just continue down the same road. If that's the case, a lot of your work will get a lot easier and less impeded by me and we will still all be happier. Win win.
> 
> If we can just not make drastic changes for a just a brief week or so window, I'll say what I have to say, you guys can judge and do whatever you'd please.
> 
> - mark
> 
> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <ma...@gmail.com> wrote:
> Hey All Solr Dev's,
> 
> SolrCloud is sick right now. The way low level Zookeeper is handeled, the Overseer, is mix and mess of proper exception handling and super slow startup and shutdown, adding new things all the time with no concern for performance or proper ordering (which is harder to tell than you think).
> 
> Our class dependency graph doesn't even work - we just force it. Sort of. If the whole system  doesn't block and choke it's way to a start slow enough, lots of things fail.
> 
> This thing coughs up, you toss stuff into the storm, a good chunk of time, what you want eventually come back without causing too much damage.
> 
> There are so many things are are off or just plain wrong and the list is growing and growing. No one is following this or if you are, please back me up. This thing will collapse under it's own wait.
> 
> So if you want to add yet another state format cluster state or some other optimization on this junk heap, you can expect me to push back.
> 
> We should all be embarrassed by the state of things.
> 
> I've got some ideas for addressing them that I'll share soon, but god, don't keep optimizing a turd in non backcompat Overseer loving ways. That Overseer is an atrocity.
> 
> -- 
> - Mark
> 
> http://about.me/markrmiller
> 
> 
> -- 
> - Mark
> 
> http://about.me/markrmiller

Re: SolrCloud is sick.

Posted by Martin Gainty <mg...@hotmail.com>.

here is a bug i cannot shake in when building lucne/site

inside lucene/src/main/xml/ENTITY_TermQuery.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE TermQuery [
<!ENTITY internalTerm "sumitomo">
<!ENTITY externalTerm SYSTEM "http://www.bar.xyz/external">
<!ENTITY % myParameterEntity "http://www.bar.xyz/param">
....

using ant build.xml:
 <!--
      The XSL input file is ignored completely, but XSL expects one to be given,
      so we pass ourself (${ant.file}) here. The list of module build.xmls is given
      via string parameter, that must be splitted by the XSL at '|'.
    -->
    <xslt in="${ant.file}" out="${javadoc.dir}/index.html" style="site/xsl/index.xsl" force="true">
      <outputproperty name="method" value="html"/>
      <outputproperty name="version" value="4.0"/>
      <outputproperty name="encoding" value="UTF-8"/>
      <outputproperty name="indent" value="yes"/>
      <param name="buildfiles" expression="${process-webpages.buildfiles}"/>
      <param name="version" expression="${version}"/>
      <param name="defaultCodec" expression="${defaultCodec}"/>
    </xslt>

OR maven pom.xml
  <plugin>
        <groupId>org.codehaus.mojo</groupId>
   <artifactId>xml-maven-plugin</artifactId>
           <version>1.0.1</version>
           <executions>
            <execution>
                 <id>validate</id>
                        <phase>initialize</phase>
                        <goals>
                         <goal>transform</goal>
                        </goals>
                        <configuration>
                           <forceCreation>true</forceCreation>
                           <skip>false</skip>
                           <outputDirectory>${project.build.directory}/target</outputDirectory>
     <transformationSets>
       <transformationSet>
  <dir>src/main/xml</dir>
  <stylesheet>C:/Maven-plugin/lucene-solr/lucene/site/xsl/index.xsl</stylesheet>
  <parameters>
   <parameter>
     <name>MyParam</name>
     <value>true</value>
   </parameter>
       </parameters>
       </transformationSet>
     </transformationSets>
           </configuration>
           </execution>
       </executions>
       <dependencies>
        <dependency>
         <groupId>net.sf.saxon</groupId>
         <artifactId>Saxon-HE</artifactId>
         <version>9.9.1-1</version>
        </dependency>
       </dependencies>
      </plugin>

either build executing XSLT i get the same error:

[ERROR] Failed to execute goal org.codehaus.mojo:xml-maven-plugin:1.0.1:transform (validate) on project analysis: Failed to transform input file lucene/src/main/xml/ENTITY_TermQuery.xml: I/O error reported by XML parser processing file://lucene/src/main/xml/ENTITY_TermQuery.xml: www.bar.xyz:
Unknown host www.bar.xyz
]>

apparently www.bar.xyz<http://www.bar.xyz> host is supposed to be a placeholder
but for the life of me I cannot see where www.bar.zyz<http://www.bar.zyz> placeholder is replaced by a valid URL

(i havent used DTD in at least 10 years and i am way out of my element when trying to resolve)
any suggestions?
martin
________________________________
From: David Smiley <da...@gmail.com>
Sent: Sunday, November 3, 2019 12:32 AM
To: Solr/Lucene Dev <de...@lucene.apache.org>
Cc: Mark Miller <ma...@gmail.com>
Subject: Re: SolrCloud is sick.

Yeah we do a bad job of the things you listed Noble.  :-(   My colleagues want pointers to internal docs but the sad reality is there isn't any.  You may notice I'm a stickler in my code reviews for requiring javadocs on all top level classes.  I think more javadocs and code comments would be very helpful -- especially for the major classes.  This might help us all and others a lot more.  For example I think Lucene does a rather fine job of this for its major classes -- IndexWriter being a good example.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Sat, Nov 2, 2019 at 7:32 PM Noble Paul <no...@gmail.com>> wrote:
Hi,

I believe there is a consensus on what is wrong with the way we have built the cluster state and overseer. We need to focus a bit more on the design aspect. Design, according to me, has the following elements:

* How does it work?

* What are the performance characteristics? Can it be done more efficiently?

* What are the public touch points?

** Which are the files we store in ZK? Are they expected to be watched always?

** Or are they read on demand?

** The public APIs. Does it make sense to the user? Can it be further simplified? How does it compare to the other APIs in the system?


We, as a community, do a bad job in dealing with these. While we focus on internal things, these are not discussed before it is too late. We usually do coding, tests, code review (sometimes) and commit. This leads to huge technical debt.


This is not to put blame on one person or a group of people. (I occasionally see people discussing design issues upfront, I just hope that is the norm.)


Now, why am I discussing this in this thread?


While we agree there are problems, we are trying to solve the problem using the same process we used to create these problems. Again, I'm not questioning the intent or competence of anyone. Unless we set the process right, we are doomed to make the same mistakes again.


I whole heartedly endorse any effort to improve SolrCloud/overseer. At the same time I fail to see us leveraging the collective experience of our community through meaningful discussion.


I hope we don't resort to personal attacks and use this as an opportunity to improve our processes.
Thanks

On Sun, Nov 3, 2019, 9:52 AM Scott Blum <dr...@gmail.com>> wrote:
Very much agreed.  I've been trying to figure out for a long time what is the point in having a replica DOWN state that has to be toggled (DOWN and then UP!) every time a node restarts.  Considering that we could just combine ACTIVE and `live_nodes` to understand whether a replica is available.  It's not even foolproof since kill -9 on a solr node won't mark all the replicas DOWN-- that doesn't happen until the node comes back up (perversely).

What would it take to get to a state where restarting a node would require a minimal amount of ZK work in most cases?

On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <ma...@gmail.com>> wrote:
Give me a short bit to follow up and I will lay out my case and proposal.

Everyone is then free to decide that we need to do something drastic or that I'm wrong and we should just continue down the same road. If that's the case, a lot of your work will get a lot easier and less impeded by me and we will still all be happier. Win win.

If we can just not make drastic changes for a just a brief week or so window, I'll say what I have to say, you guys can judge and do whatever you'd please.

- mark

On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <ma...@gmail.com>> wrote:
Hey All Solr Dev's,

SolrCloud is sick right now. The way low level Zookeeper is handeled, the Overseer, is mix and mess of proper exception handling and super slow startup and shutdown, adding new things all the time with no concern for performance or proper ordering (which is harder to tell than you think).

Our class dependency graph doesn't even work - we just force it. Sort of. If the whole system  doesn't block and choke it's way to a start slow enough, lots of things fail.

This thing coughs up, you toss stuff into the storm, a good chunk of time, what you want eventually come back without causing too much damage.

There are so many things are are off or just plain wrong and the list is growing and growing. No one is following this or if you are, please back me up. This thing will collapse under it's own wait.

So if you want to add yet another state format cluster state or some other optimization on this junk heap, you can expect me to push back.

We should all be embarrassed by the state of things.

I've got some ideas for addressing them that I'll share soon, but god, don't keep optimizing a turd in non backcompat Overseer loving ways. That Overseer is an atrocity.

--
- Mark

http://about.me/markrmiller


--
- Mark

http://about.me/markrmiller

Re: SolrCloud is sick.

Posted by Mark Miller <ma...@gmail.com>.

And the bummer is, in the midst of this madness people are doing good work.
Good cleanups. Good improvements. Good features. Good code. And it’s all
basically wasted. It’s my hurts my mind.

Mark

On Sun, Nov 3, 2019 at 7:58 AM Mark Miller <ma...@gmail.com> wrote:

> From a credentials standpoint:
>
> Yonik and I built 90% of it originally and then I spent years on it with
> few other devs or users.
>
> Pretty sure I'm the only one that has ever had 95%+ of the Solr test
> suites work in under 10-15 seconds consistently - 4000 tests across like
> 1000 suites. Got them all to run in parallel in under 5 minutes vs the
> 20-45 it takes on a good day after tons of other speed ups and fixes I've
> already done.
>
> I doubt there are many insane enough that have deep dived and pushed
> around the entire code base for 2-3 weeks multiple times, 16-20 hours a
> day. That has spent the last decade beyond that stupid time almost
> exclusively on this system. Designing it with Yonik, building it, fixing
> it, helping people with it, monitoring it, responding to pages and
> escalations for it.
>
> That has spent half a year replacing the entire decade old build for
> Lucene and Solr and all it's various nooks and crannies.
>
> If anyone has spent more time on this system or pushing it around on a
> large scale or or has seen it in 100x the shape it is now more than once,
> please speak up, you are in charge, I follow you. I'm not that bright, if
> you have done the ground work, default to you.
>
> Otherwise, I don't even have much confidence anyone else even knows this
> system remotely well. All that time and effort and the most I know of it is
> what awful awful shape its in and the bad trend direction.
>
>
> - Mark
>
> On Sun, Nov 3, 2019 at 7:35 AM Mark Miller <ma...@gmail.com> wrote:
>
>> Personally, I believe the latter so strongly, if I can’t convince the
>> others in the raft with me, I’m jumping in and swimming to another raft
>> after my entire adult life here.
>>
>> Mark
>>
>> On Sun, Nov 3, 2019 at 7:30 AM Mark Miller <ma...@gmail.com> wrote:
>>
>>> In fact this will be a fundamental difference some of us are about to
>>> split between.
>>>
>>> Those that think they can ever fix the tests or the system or the 1000s
>>> of bugs we have and keep adding due to our current world view of
>>> making tests fit the system not the system fit the tests and that fact that
>>> everything is so slow and retry and workaround that stupid shit works all
>>> over. It's all deep. It's ingrained. It grown over for a decade.Its a
>>> project of 60 modules.
>>>
>>> Soon we will split between those that think they are making progress
>>> across the ocean and those that think we are sitting in shark infested
>>> waters waiting to die actually, starting to float backwards sometimes now.
>>>
>>> - Mark
>>>
>>> On Sun, Nov 3, 2019 at 7:23 AM Mark Miller <ma...@gmail.com>
>>> wrote:
>>>
>>>> bq.  They also would allow it to do it in an iterative manner without
>>>> changing everything at once.
>>>>
>>>> Sadly, you can't fix this piece by piece :) I dare anyone to try. I
>>>> encourage, I applaud the effort.
>>>>
>>>> The world is your oyster from a good spot - take your pick of how to do
>>>> things.
>>>>
>>>> But from this spot, if anyone thinks we are getting out design change
>>>> by design change, JIRA by JIRA, I'm so sorry. Let's commiserate in a couple
>>>> years on a beer when you  give up on that.
>>>>
>>>> - Mark
>>>>
>>>> On Sun, Nov 3, 2019 at 4:01 AM Jörn Franke <jo...@gmail.com>
>>>> wrote:
>>>>
>>>>> I cannot say anything about the statements, but maybe it could help to
>>>>> introduce Solr Improvement Proposals (SIP) similar to Kafka Improvement
>>>>> Proposals (KIP) or Flink Improvement Proposals (FLIP).
>>>>>
>>>>>   I think they are helpful to facilitate design decisions and
>>>>> refactoring / redesign decision. They also would allow it to do it in an
>>>>> iterative manner without changing everything at once.
>>>>> The final version could be out  in The Git of Solr in markdown
>>>>> including figures presenting parts of the design.
>>>>>
>>>>> However for developing them I propose a more inclusive approach where
>>>>> many people (not only core developers) can easily comment and support, eg
>>>>> Google docs or similar.
>>>>>
>>>>> > Am 03.11.2019 um 06:39 schrieb Noble Paul <no...@gmail.com>:
>>>>> >
>>>>> > Solr has to do more than Lucene. A Lucene user is mostly a developer
>>>>> > who reads javadocs. A Solr user's touch points are
>>>>> >
>>>>> > * Public API
>>>>> > * Ref guide
>>>>> > * publicly visible files (in ZK as well as file system)
>>>>> > * What to see/look for in the log files to debug issues
>>>>> >
>>>>> > Then we have more nuanced touch points such as the knowledge base of
>>>>> > what happens internally in the system when 'X' API is invoked or when
>>>>> > 'Y' behavior is observed in ZK data.
>>>>> >
>>>>> > The problem with delaying the review process till code completion is
>>>>> > that, any changes based on review comments will require massive
>>>>> amount
>>>>> > of work.
>>>>> >
>>>>> > I don't have an answer to how we achieve it. But, I clearly see this
>>>>> > as a major gap in our development process today.
>>>>> >
>>>>> > This discussion may not be relevant in this thread, may be because no
>>>>> > behavior is changed at all. We don't know yet
>>>>> >
>>>>> > What I want to believe is Mark is doing the right thing & it's gonna
>>>>> > help us all in dealing with our operational issues. I don't want to
>>>>> > interrupt his work with more discussions.
>>>>> >
>>>>> > Thanks you
>>>>> >
>>>>> >
>>>>> >> On Sun, Nov 3, 2019 at 3:32 PM David Smiley <
>>>>> david.w.smiley@gmail.com> wrote:
>>>>> >>
>>>>> >> Yeah we do a bad job of the things you listed Noble.  :-(   My
>>>>> colleagues want pointers to internal docs but the sad reality is there
>>>>> isn't any.  You may notice I'm a stickler in my code reviews for requiring
>>>>> javadocs on all top level classes.  I think more javadocs and code comments
>>>>> would be very helpful -- especially for the major classes.  This might help
>>>>> us all and others a lot more.  For example I think Lucene does a rather
>>>>> fine job of this for its major classes -- IndexWriter being a good example.
>>>>> >>
>>>>> >> ~ David Smiley
>>>>> >> Apache Lucene/Solr Search Developer
>>>>> >> http://www.linkedin.com/in/davidwsmiley
>>>>> >>
>>>>> >>
>>>>> >>> On Sat, Nov 2, 2019 at 7:32 PM Noble Paul <no...@gmail.com>
>>>>> wrote:
>>>>> >>>
>>>>> >>> Hi,
>>>>> >>>
>>>>> >>> I believe there is a consensus on what is wrong with the way we
>>>>> have built the cluster state and overseer. We need to focus a bit more on
>>>>> the design aspect. Design, according to me, has the following elements:
>>>>> >>>
>>>>> >>> * How does it work?
>>>>> >>>
>>>>> >>> * What are the performance characteristics? Can it be done more
>>>>> efficiently?
>>>>> >>>
>>>>> >>> * What are the public touch points?
>>>>> >>>
>>>>> >>> ** Which are the files we store in ZK? Are they expected to be
>>>>> watched always?
>>>>> >>>
>>>>> >>> ** Or are they read on demand?
>>>>> >>>
>>>>> >>> ** The public APIs. Does it make sense to the user? Can it be
>>>>> further simplified? How does it compare to the other APIs in the system?
>>>>> >>>
>>>>> >>>
>>>>> >>> We, as a community, do a bad job in dealing with these. While we
>>>>> focus on internal things, these are not discussed before it is too late. We
>>>>> usually do coding, tests, code review (sometimes) and commit. This leads to
>>>>> huge technical debt.
>>>>> >>>
>>>>> >>>
>>>>> >>> This is not to put blame on one person or a group of people. (I
>>>>> occasionally see people discussing design issues upfront, I just hope that
>>>>> is the norm.)
>>>>> >>>
>>>>> >>>
>>>>> >>> Now, why am I discussing this in this thread?
>>>>> >>>
>>>>> >>>
>>>>> >>> While we agree there are problems, we are trying to solve the
>>>>> problem using the same process we used to create these problems. Again, I'm
>>>>> not questioning the intent or competence of anyone. Unless we set the
>>>>> process right, we are doomed to make the same mistakes again.
>>>>> >>>
>>>>> >>>
>>>>> >>> I whole heartedly endorse any effort to improve
>>>>> SolrCloud/overseer. At the same time I fail to see us leveraging the
>>>>> collective experience of our community through meaningful discussion.
>>>>> >>>
>>>>> >>>
>>>>> >>> I hope we don't resort to personal attacks and use this as an
>>>>> opportunity to improve our processes.
>>>>> >>> Thanks
>>>>> >>>
>>>>> >>> On Sun, Nov 3, 2019, 9:52 AM Scott Blum <dr...@gmail.com>
>>>>> wrote:
>>>>> >>>>
>>>>> >>>> Very much agreed.  I've been trying to figure out for a long time
>>>>> what is the point in having a replica DOWN state that has to be toggled
>>>>> (DOWN and then UP!) every time a node restarts.  Considering that we could
>>>>> just combine ACTIVE and `live_nodes` to understand whether a replica is
>>>>> available.  It's not even foolproof since kill -9 on a solr node won't mark
>>>>> all the replicas DOWN-- that doesn't happen until the node comes back up
>>>>> (perversely).
>>>>> >>>>
>>>>> >>>> What would it take to get to a state where restarting a node
>>>>> would require a minimal amount of ZK work in most cases?
>>>>> >>>>
>>>>> >>>> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <ma...@gmail.com>
>>>>> wrote:
>>>>> >>>>>
>>>>> >>>>> Give me a short bit to follow up and I will lay out my case and
>>>>> proposal.
>>>>> >>>>>
>>>>> >>>>> Everyone is then free to decide that we need to do something
>>>>> drastic or that I'm wrong and we should just continue down the same road.
>>>>> If that's the case, a lot of your work will get a lot easier and less
>>>>> impeded by me and we will still all be happier. Win win.
>>>>> >>>>>
>>>>> >>>>> If we can just not make drastic changes for a just a brief week
>>>>> or so window, I'll say what I have to say, you guys can judge and do
>>>>> whatever you'd please.
>>>>> >>>>>
>>>>> >>>>> - mark
>>>>> >>>>>
>>>>> >>>>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <
>>>>> markrmiller@gmail.com> wrote:
>>>>> >>>>>>
>>>>> >>>>>> Hey All Solr Dev's,
>>>>> >>>>>>
>>>>> >>>>>> SolrCloud is sick right now. The way low level Zookeeper is
>>>>> handeled, the Overseer, is mix and mess of proper exception handling and
>>>>> super slow startup and shutdown, adding new things all the time with no
>>>>> concern for performance or proper ordering (which is harder to tell than
>>>>> you think).
>>>>> >>>>>>
>>>>> >>>>>> Our class dependency graph doesn't even work - we just force
>>>>> it. Sort of. If the whole system  doesn't block and choke it's way to a
>>>>> start slow enough, lots of things fail.
>>>>> >>>>>>
>>>>> >>>>>> This thing coughs up, you toss stuff into the storm, a good
>>>>> chunk of time, what you want eventually come back without causing too much
>>>>> damage.
>>>>> >>>>>>
>>>>> >>>>>> There are so many things are are off or just plain wrong and
>>>>> the list is growing and growing. No one is following this or if you are,
>>>>> please back me up. This thing will collapse under it's own wait.
>>>>> >>>>>>
>>>>> >>>>>> So if you want to add yet another state format cluster state or
>>>>> some other optimization on this junk heap, you can expect me to push back.
>>>>> >>>>>>
>>>>> >>>>>> We should all be embarrassed by the state of things.
>>>>> >>>>>>
>>>>> >>>>>> I've got some ideas for addressing them that I'll share soon,
>>>>> but god, don't keep optimizing a turd in non backcompat Overseer loving
>>>>> ways. That Overseer is an atrocity.
>>>>> >>>>>>
>>>>> >>>>>> --
>>>>> >>>>>> - Mark
>>>>> >>>>>>
>>>>> >>>>>> http://about.me/markrmiller
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>>
>>>>> >>>>> --
>>>>> >>>>> - Mark
>>>>> >>>>>
>>>>> >>>>> http://about.me/markrmiller
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > -----------------------------------------------------
>>>>> > Noble Paul
>>>>> >
>>>>> > ---------------------------------------------------------------------
>>>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>>>> >
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>> --
>>>> - Mark
>>>>
>>>> http://about.me/markrmiller
>>>>
>>>
>>>
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
>
>
> --
> - Mark
>
> http://about.me/markrmiller
>
-- 
- Mark

http://about.me/markrmiller

Re: SolrCloud is sick.

Posted by Mark Miller <ma...@gmail.com>.

From a credentials standpoint:

Yonik and I built 90% of it originally and then I spent years on it with
few other devs or users.

Pretty sure I'm the only one that has ever had 95%+ of the Solr test suites
work in under 10-15 seconds consistently - 4000 tests across like 1000
suites. Got them all to run in parallel in under 5 minutes vs the 20-45 it
takes on a good day after tons of other speed ups and fixes I've already
done.

I doubt there are many insane enough that have deep dived and pushed around
the entire code base for 2-3 weeks multiple times, 16-20 hours a day. That
has spent the last decade beyond that stupid time almost exclusively on
this system. Designing it with Yonik, building it, fixing it, helping
people with it, monitoring it, responding to pages and escalations for it.

That has spent half a year replacing the entire decade old build for Lucene
and Solr and all it's various nooks and crannies.

If anyone has spent more time on this system or pushing it around on a
large scale or or has seen it in 100x the shape it is now more than once,
please speak up, you are in charge, I follow you. I'm not that bright, if
you have done the ground work, default to you.

Otherwise, I don't even have much confidence anyone else even knows this
system remotely well. All that time and effort and the most I know of it is
what awful awful shape its in and the bad trend direction.


- Mark

On Sun, Nov 3, 2019 at 7:35 AM Mark Miller <ma...@gmail.com> wrote:

> Personally, I believe the latter so strongly, if I can’t convince the
> others in the raft with me, I’m jumping in and swimming to another raft
> after my entire adult life here.
>
> Mark
>
> On Sun, Nov 3, 2019 at 7:30 AM Mark Miller <ma...@gmail.com> wrote:
>
>> In fact this will be a fundamental difference some of us are about to
>> split between.
>>
>> Those that think they can ever fix the tests or the system or the 1000s
>> of bugs we have and keep adding due to our current world view of
>> making tests fit the system not the system fit the tests and that fact that
>> everything is so slow and retry and workaround that stupid shit works all
>> over. It's all deep. It's ingrained. It grown over for a decade.Its a
>> project of 60 modules.
>>
>> Soon we will split between those that think they are making progress
>> across the ocean and those that think we are sitting in shark infested
>> waters waiting to die actually, starting to float backwards sometimes now.
>>
>> - Mark
>>
>> On Sun, Nov 3, 2019 at 7:23 AM Mark Miller <ma...@gmail.com> wrote:
>>
>>> bq.  They also would allow it to do it in an iterative manner without
>>> changing everything at once.
>>>
>>> Sadly, you can't fix this piece by piece :) I dare anyone to try. I
>>> encourage, I applaud the effort.
>>>
>>> The world is your oyster from a good spot - take your pick of how to do
>>> things.
>>>
>>> But from this spot, if anyone thinks we are getting out design change by
>>> design change, JIRA by JIRA, I'm so sorry. Let's commiserate in a couple
>>> years on a beer when you  give up on that.
>>>
>>> - Mark
>>>
>>> On Sun, Nov 3, 2019 at 4:01 AM Jörn Franke <jo...@gmail.com> wrote:
>>>
>>>> I cannot say anything about the statements, but maybe it could help to
>>>> introduce Solr Improvement Proposals (SIP) similar to Kafka Improvement
>>>> Proposals (KIP) or Flink Improvement Proposals (FLIP).
>>>>
>>>>   I think they are helpful to facilitate design decisions and
>>>> refactoring / redesign decision. They also would allow it to do it in an
>>>> iterative manner without changing everything at once.
>>>> The final version could be out  in The Git of Solr in markdown
>>>> including figures presenting parts of the design.
>>>>
>>>> However for developing them I propose a more inclusive approach where
>>>> many people (not only core developers) can easily comment and support, eg
>>>> Google docs or similar.
>>>>
>>>> > Am 03.11.2019 um 06:39 schrieb Noble Paul <no...@gmail.com>:
>>>> >
>>>> > Solr has to do more than Lucene. A Lucene user is mostly a developer
>>>> > who reads javadocs. A Solr user's touch points are
>>>> >
>>>> > * Public API
>>>> > * Ref guide
>>>> > * publicly visible files (in ZK as well as file system)
>>>> > * What to see/look for in the log files to debug issues
>>>> >
>>>> > Then we have more nuanced touch points such as the knowledge base of
>>>> > what happens internally in the system when 'X' API is invoked or when
>>>> > 'Y' behavior is observed in ZK data.
>>>> >
>>>> > The problem with delaying the review process till code completion is
>>>> > that, any changes based on review comments will require massive amount
>>>> > of work.
>>>> >
>>>> > I don't have an answer to how we achieve it. But, I clearly see this
>>>> > as a major gap in our development process today.
>>>> >
>>>> > This discussion may not be relevant in this thread, may be because no
>>>> > behavior is changed at all. We don't know yet
>>>> >
>>>> > What I want to believe is Mark is doing the right thing & it's gonna
>>>> > help us all in dealing with our operational issues. I don't want to
>>>> > interrupt his work with more discussions.
>>>> >
>>>> > Thanks you
>>>> >
>>>> >
>>>> >> On Sun, Nov 3, 2019 at 3:32 PM David Smiley <
>>>> david.w.smiley@gmail.com> wrote:
>>>> >>
>>>> >> Yeah we do a bad job of the things you listed Noble.  :-(   My
>>>> colleagues want pointers to internal docs but the sad reality is there
>>>> isn't any.  You may notice I'm a stickler in my code reviews for requiring
>>>> javadocs on all top level classes.  I think more javadocs and code comments
>>>> would be very helpful -- especially for the major classes.  This might help
>>>> us all and others a lot more.  For example I think Lucene does a rather
>>>> fine job of this for its major classes -- IndexWriter being a good example.
>>>> >>
>>>> >> ~ David Smiley
>>>> >> Apache Lucene/Solr Search Developer
>>>> >> http://www.linkedin.com/in/davidwsmiley
>>>> >>
>>>> >>
>>>> >>> On Sat, Nov 2, 2019 at 7:32 PM Noble Paul <no...@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>> Hi,
>>>> >>>
>>>> >>> I believe there is a consensus on what is wrong with the way we
>>>> have built the cluster state and overseer. We need to focus a bit more on
>>>> the design aspect. Design, according to me, has the following elements:
>>>> >>>
>>>> >>> * How does it work?
>>>> >>>
>>>> >>> * What are the performance characteristics? Can it be done more
>>>> efficiently?
>>>> >>>
>>>> >>> * What are the public touch points?
>>>> >>>
>>>> >>> ** Which are the files we store in ZK? Are they expected to be
>>>> watched always?
>>>> >>>
>>>> >>> ** Or are they read on demand?
>>>> >>>
>>>> >>> ** The public APIs. Does it make sense to the user? Can it be
>>>> further simplified? How does it compare to the other APIs in the system?
>>>> >>>
>>>> >>>
>>>> >>> We, as a community, do a bad job in dealing with these. While we
>>>> focus on internal things, these are not discussed before it is too late. We
>>>> usually do coding, tests, code review (sometimes) and commit. This leads to
>>>> huge technical debt.
>>>> >>>
>>>> >>>
>>>> >>> This is not to put blame on one person or a group of people. (I
>>>> occasionally see people discussing design issues upfront, I just hope that
>>>> is the norm.)
>>>> >>>
>>>> >>>
>>>> >>> Now, why am I discussing this in this thread?
>>>> >>>
>>>> >>>
>>>> >>> While we agree there are problems, we are trying to solve the
>>>> problem using the same process we used to create these problems. Again, I'm
>>>> not questioning the intent or competence of anyone. Unless we set the
>>>> process right, we are doomed to make the same mistakes again.
>>>> >>>
>>>> >>>
>>>> >>> I whole heartedly endorse any effort to improve SolrCloud/overseer.
>>>> At the same time I fail to see us leveraging the collective experience of
>>>> our community through meaningful discussion.
>>>> >>>
>>>> >>>
>>>> >>> I hope we don't resort to personal attacks and use this as an
>>>> opportunity to improve our processes.
>>>> >>> Thanks
>>>> >>>
>>>> >>> On Sun, Nov 3, 2019, 9:52 AM Scott Blum <dr...@gmail.com>
>>>> wrote:
>>>> >>>>
>>>> >>>> Very much agreed.  I've been trying to figure out for a long time
>>>> what is the point in having a replica DOWN state that has to be toggled
>>>> (DOWN and then UP!) every time a node restarts.  Considering that we could
>>>> just combine ACTIVE and `live_nodes` to understand whether a replica is
>>>> available.  It's not even foolproof since kill -9 on a solr node won't mark
>>>> all the replicas DOWN-- that doesn't happen until the node comes back up
>>>> (perversely).
>>>> >>>>
>>>> >>>> What would it take to get to a state where restarting a node would
>>>> require a minimal amount of ZK work in most cases?
>>>> >>>>
>>>> >>>> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <ma...@gmail.com>
>>>> wrote:
>>>> >>>>>
>>>> >>>>> Give me a short bit to follow up and I will lay out my case and
>>>> proposal.
>>>> >>>>>
>>>> >>>>> Everyone is then free to decide that we need to do something
>>>> drastic or that I'm wrong and we should just continue down the same road.
>>>> If that's the case, a lot of your work will get a lot easier and less
>>>> impeded by me and we will still all be happier. Win win.
>>>> >>>>>
>>>> >>>>> If we can just not make drastic changes for a just a brief week
>>>> or so window, I'll say what I have to say, you guys can judge and do
>>>> whatever you'd please.
>>>> >>>>>
>>>> >>>>> - mark
>>>> >>>>>
>>>> >>>>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <ma...@gmail.com>
>>>> wrote:
>>>> >>>>>>
>>>> >>>>>> Hey All Solr Dev's,
>>>> >>>>>>
>>>> >>>>>> SolrCloud is sick right now. The way low level Zookeeper is
>>>> handeled, the Overseer, is mix and mess of proper exception handling and
>>>> super slow startup and shutdown, adding new things all the time with no
>>>> concern for performance or proper ordering (which is harder to tell than
>>>> you think).
>>>> >>>>>>
>>>> >>>>>> Our class dependency graph doesn't even work - we just force it.
>>>> Sort of. If the whole system  doesn't block and choke it's way to a start
>>>> slow enough, lots of things fail.
>>>> >>>>>>
>>>> >>>>>> This thing coughs up, you toss stuff into the storm, a good
>>>> chunk of time, what you want eventually come back without causing too much
>>>> damage.
>>>> >>>>>>
>>>> >>>>>> There are so many things are are off or just plain wrong and the
>>>> list is growing and growing. No one is following this or if you are, please
>>>> back me up. This thing will collapse under it's own wait.
>>>> >>>>>>
>>>> >>>>>> So if you want to add yet another state format cluster state or
>>>> some other optimization on this junk heap, you can expect me to push back.
>>>> >>>>>>
>>>> >>>>>> We should all be embarrassed by the state of things.
>>>> >>>>>>
>>>> >>>>>> I've got some ideas for addressing them that I'll share soon,
>>>> but god, don't keep optimizing a turd in non backcompat Overseer loving
>>>> ways. That Overseer is an atrocity.
>>>> >>>>>>
>>>> >>>>>> --
>>>> >>>>>> - Mark
>>>> >>>>>>
>>>> >>>>>> http://about.me/markrmiller
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> --
>>>> >>>>> - Mark
>>>> >>>>>
>>>> >>>>> http://about.me/markrmiller
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > -----------------------------------------------------
>>>> > Noble Paul
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>>
>>>
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>>
>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
> --
> - Mark
>
> http://about.me/markrmiller
>


-- 
- Mark

http://about.me/markrmiller

Re: SolrCloud is sick.

Posted by Mark Miller <ma...@gmail.com>.

Personally, I believe the latter so strongly, if I can’t convince the
others in the raft with me, I’m jumping in and swimming to another raft
after my entire adult life here.

Mark

On Sun, Nov 3, 2019 at 7:30 AM Mark Miller <ma...@gmail.com> wrote:

> In fact this will be a fundamental difference some of us are about to
> split between.
>
> Those that think they can ever fix the tests or the system or the 1000s of
> bugs we have and keep adding due to our current world view of making tests
> fit the system not the system fit the tests and that fact that everything
> is so slow and retry and workaround that stupid shit works all over. It's
> all deep. It's ingrained. It grown over for a decade.Its a project of 60
> modules.
>
> Soon we will split between those that think they are making progress
> across the ocean and those that think we are sitting in shark infested
> waters waiting to die actually, starting to float backwards sometimes now.
>
> - Mark
>
> On Sun, Nov 3, 2019 at 7:23 AM Mark Miller <ma...@gmail.com> wrote:
>
>> bq.  They also would allow it to do it in an iterative manner without
>> changing everything at once.
>>
>> Sadly, you can't fix this piece by piece :) I dare anyone to try. I
>> encourage, I applaud the effort.
>>
>> The world is your oyster from a good spot - take your pick of how to do
>> things.
>>
>> But from this spot, if anyone thinks we are getting out design change by
>> design change, JIRA by JIRA, I'm so sorry. Let's commiserate in a couple
>> years on a beer when you  give up on that.
>>
>> - Mark
>>
>> On Sun, Nov 3, 2019 at 4:01 AM Jörn Franke <jo...@gmail.com> wrote:
>>
>>> I cannot say anything about the statements, but maybe it could help to
>>> introduce Solr Improvement Proposals (SIP) similar to Kafka Improvement
>>> Proposals (KIP) or Flink Improvement Proposals (FLIP).
>>>
>>>   I think they are helpful to facilitate design decisions and
>>> refactoring / redesign decision. They also would allow it to do it in an
>>> iterative manner without changing everything at once.
>>> The final version could be out  in The Git of Solr in markdown including
>>> figures presenting parts of the design.
>>>
>>> However for developing them I propose a more inclusive approach where
>>> many people (not only core developers) can easily comment and support, eg
>>> Google docs or similar.
>>>
>>> > Am 03.11.2019 um 06:39 schrieb Noble Paul <no...@gmail.com>:
>>> >
>>> > Solr has to do more than Lucene. A Lucene user is mostly a developer
>>> > who reads javadocs. A Solr user's touch points are
>>> >
>>> > * Public API
>>> > * Ref guide
>>> > * publicly visible files (in ZK as well as file system)
>>> > * What to see/look for in the log files to debug issues
>>> >
>>> > Then we have more nuanced touch points such as the knowledge base of
>>> > what happens internally in the system when 'X' API is invoked or when
>>> > 'Y' behavior is observed in ZK data.
>>> >
>>> > The problem with delaying the review process till code completion is
>>> > that, any changes based on review comments will require massive amount
>>> > of work.
>>> >
>>> > I don't have an answer to how we achieve it. But, I clearly see this
>>> > as a major gap in our development process today.
>>> >
>>> > This discussion may not be relevant in this thread, may be because no
>>> > behavior is changed at all. We don't know yet
>>> >
>>> > What I want to believe is Mark is doing the right thing & it's gonna
>>> > help us all in dealing with our operational issues. I don't want to
>>> > interrupt his work with more discussions.
>>> >
>>> > Thanks you
>>> >
>>> >
>>> >> On Sun, Nov 3, 2019 at 3:32 PM David Smiley <da...@gmail.com>
>>> wrote:
>>> >>
>>> >> Yeah we do a bad job of the things you listed Noble.  :-(   My
>>> colleagues want pointers to internal docs but the sad reality is there
>>> isn't any.  You may notice I'm a stickler in my code reviews for requiring
>>> javadocs on all top level classes.  I think more javadocs and code comments
>>> would be very helpful -- especially for the major classes.  This might help
>>> us all and others a lot more.  For example I think Lucene does a rather
>>> fine job of this for its major classes -- IndexWriter being a good example.
>>> >>
>>> >> ~ David Smiley
>>> >> Apache Lucene/Solr Search Developer
>>> >> http://www.linkedin.com/in/davidwsmiley
>>> >>
>>> >>
>>> >>> On Sat, Nov 2, 2019 at 7:32 PM Noble Paul <no...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> Hi,
>>> >>>
>>> >>> I believe there is a consensus on what is wrong with the way we have
>>> built the cluster state and overseer. We need to focus a bit more on the
>>> design aspect. Design, according to me, has the following elements:
>>> >>>
>>> >>> * How does it work?
>>> >>>
>>> >>> * What are the performance characteristics? Can it be done more
>>> efficiently?
>>> >>>
>>> >>> * What are the public touch points?
>>> >>>
>>> >>> ** Which are the files we store in ZK? Are they expected to be
>>> watched always?
>>> >>>
>>> >>> ** Or are they read on demand?
>>> >>>
>>> >>> ** The public APIs. Does it make sense to the user? Can it be
>>> further simplified? How does it compare to the other APIs in the system?
>>> >>>
>>> >>>
>>> >>> We, as a community, do a bad job in dealing with these. While we
>>> focus on internal things, these are not discussed before it is too late. We
>>> usually do coding, tests, code review (sometimes) and commit. This leads to
>>> huge technical debt.
>>> >>>
>>> >>>
>>> >>> This is not to put blame on one person or a group of people. (I
>>> occasionally see people discussing design issues upfront, I just hope that
>>> is the norm.)
>>> >>>
>>> >>>
>>> >>> Now, why am I discussing this in this thread?
>>> >>>
>>> >>>
>>> >>> While we agree there are problems, we are trying to solve the
>>> problem using the same process we used to create these problems. Again, I'm
>>> not questioning the intent or competence of anyone. Unless we set the
>>> process right, we are doomed to make the same mistakes again.
>>> >>>
>>> >>>
>>> >>> I whole heartedly endorse any effort to improve SolrCloud/overseer.
>>> At the same time I fail to see us leveraging the collective experience of
>>> our community through meaningful discussion.
>>> >>>
>>> >>>
>>> >>> I hope we don't resort to personal attacks and use this as an
>>> opportunity to improve our processes.
>>> >>> Thanks
>>> >>>
>>> >>> On Sun, Nov 3, 2019, 9:52 AM Scott Blum <dr...@gmail.com>
>>> wrote:
>>> >>>>
>>> >>>> Very much agreed.  I've been trying to figure out for a long time
>>> what is the point in having a replica DOWN state that has to be toggled
>>> (DOWN and then UP!) every time a node restarts.  Considering that we could
>>> just combine ACTIVE and `live_nodes` to understand whether a replica is
>>> available.  It's not even foolproof since kill -9 on a solr node won't mark
>>> all the replicas DOWN-- that doesn't happen until the node comes back up
>>> (perversely).
>>> >>>>
>>> >>>> What would it take to get to a state where restarting a node would
>>> require a minimal amount of ZK work in most cases?
>>> >>>>
>>> >>>> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <ma...@gmail.com>
>>> wrote:
>>> >>>>>
>>> >>>>> Give me a short bit to follow up and I will lay out my case and
>>> proposal.
>>> >>>>>
>>> >>>>> Everyone is then free to decide that we need to do something
>>> drastic or that I'm wrong and we should just continue down the same road.
>>> If that's the case, a lot of your work will get a lot easier and less
>>> impeded by me and we will still all be happier. Win win.
>>> >>>>>
>>> >>>>> If we can just not make drastic changes for a just a brief week or
>>> so window, I'll say what I have to say, you guys can judge and do whatever
>>> you'd please.
>>> >>>>>
>>> >>>>> - mark
>>> >>>>>
>>> >>>>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <ma...@gmail.com>
>>> wrote:
>>> >>>>>>
>>> >>>>>> Hey All Solr Dev's,
>>> >>>>>>
>>> >>>>>> SolrCloud is sick right now. The way low level Zookeeper is
>>> handeled, the Overseer, is mix and mess of proper exception handling and
>>> super slow startup and shutdown, adding new things all the time with no
>>> concern for performance or proper ordering (which is harder to tell than
>>> you think).
>>> >>>>>>
>>> >>>>>> Our class dependency graph doesn't even work - we just force it.
>>> Sort of. If the whole system  doesn't block and choke it's way to a start
>>> slow enough, lots of things fail.
>>> >>>>>>
>>> >>>>>> This thing coughs up, you toss stuff into the storm, a good chunk
>>> of time, what you want eventually come back without causing too much damage.
>>> >>>>>>
>>> >>>>>> There are so many things are are off or just plain wrong and the
>>> list is growing and growing. No one is following this or if you are, please
>>> back me up. This thing will collapse under it's own wait.
>>> >>>>>>
>>> >>>>>> So if you want to add yet another state format cluster state or
>>> some other optimization on this junk heap, you can expect me to push back.
>>> >>>>>>
>>> >>>>>> We should all be embarrassed by the state of things.
>>> >>>>>>
>>> >>>>>> I've got some ideas for addressing them that I'll share soon, but
>>> god, don't keep optimizing a turd in non backcompat Overseer loving ways.
>>> That Overseer is an atrocity.
>>> >>>>>>
>>> >>>>>> --
>>> >>>>>> - Mark
>>> >>>>>>
>>> >>>>>> http://about.me/markrmiller
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> --
>>> >>>>> - Mark
>>> >>>>>
>>> >>>>> http://about.me/markrmiller
>>> >
>>> >
>>> >
>>> > --
>>> > -----------------------------------------------------
>>> > Noble Paul
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> > For additional commands, e-mail: dev-help@lucene.apache.org
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
>
>
> --
> - Mark
>
> http://about.me/markrmiller
>
-- 
- Mark

http://about.me/markrmiller

Re: SolrCloud is sick.

Posted by Mark Miller <ma...@gmail.com>.

In fact this will be a fundamental difference some of us are about to split
between.

Those that think they can ever fix the tests or the system or the 1000s of
bugs we have and keep adding due to our current world view of making tests
fit the system not the system fit the tests and that fact that everything
is so slow and retry and workaround that stupid shit works all over. It's
all deep. It's ingrained. It grown over for a decade.Its a project of 60
modules.

Soon we will split between those that think they are making progress across
the ocean and those that think we are sitting in shark infested waters
waiting to die actually, starting to float backwards sometimes now.

- Mark

On Sun, Nov 3, 2019 at 7:23 AM Mark Miller <ma...@gmail.com> wrote:

> bq.  They also would allow it to do it in an iterative manner without
> changing everything at once.
>
> Sadly, you can't fix this piece by piece :) I dare anyone to try. I
> encourage, I applaud the effort.
>
> The world is your oyster from a good spot - take your pick of how to do
> things.
>
> But from this spot, if anyone thinks we are getting out design change by
> design change, JIRA by JIRA, I'm so sorry. Let's commiserate in a couple
> years on a beer when you  give up on that.
>
> - Mark
>
> On Sun, Nov 3, 2019 at 4:01 AM Jörn Franke <jo...@gmail.com> wrote:
>
>> I cannot say anything about the statements, but maybe it could help to
>> introduce Solr Improvement Proposals (SIP) similar to Kafka Improvement
>> Proposals (KIP) or Flink Improvement Proposals (FLIP).
>>
>>   I think they are helpful to facilitate design decisions and refactoring
>> / redesign decision. They also would allow it to do it in an iterative
>> manner without changing everything at once.
>> The final version could be out  in The Git of Solr in markdown including
>> figures presenting parts of the design.
>>
>> However for developing them I propose a more inclusive approach where
>> many people (not only core developers) can easily comment and support, eg
>> Google docs or similar.
>>
>> > Am 03.11.2019 um 06:39 schrieb Noble Paul <no...@gmail.com>:
>> >
>> > Solr has to do more than Lucene. A Lucene user is mostly a developer
>> > who reads javadocs. A Solr user's touch points are
>> >
>> > * Public API
>> > * Ref guide
>> > * publicly visible files (in ZK as well as file system)
>> > * What to see/look for in the log files to debug issues
>> >
>> > Then we have more nuanced touch points such as the knowledge base of
>> > what happens internally in the system when 'X' API is invoked or when
>> > 'Y' behavior is observed in ZK data.
>> >
>> > The problem with delaying the review process till code completion is
>> > that, any changes based on review comments will require massive amount
>> > of work.
>> >
>> > I don't have an answer to how we achieve it. But, I clearly see this
>> > as a major gap in our development process today.
>> >
>> > This discussion may not be relevant in this thread, may be because no
>> > behavior is changed at all. We don't know yet
>> >
>> > What I want to believe is Mark is doing the right thing & it's gonna
>> > help us all in dealing with our operational issues. I don't want to
>> > interrupt his work with more discussions.
>> >
>> > Thanks you
>> >
>> >
>> >> On Sun, Nov 3, 2019 at 3:32 PM David Smiley <da...@gmail.com>
>> wrote:
>> >>
>> >> Yeah we do a bad job of the things you listed Noble.  :-(   My
>> colleagues want pointers to internal docs but the sad reality is there
>> isn't any.  You may notice I'm a stickler in my code reviews for requiring
>> javadocs on all top level classes.  I think more javadocs and code comments
>> would be very helpful -- especially for the major classes.  This might help
>> us all and others a lot more.  For example I think Lucene does a rather
>> fine job of this for its major classes -- IndexWriter being a good example.
>> >>
>> >> ~ David Smiley
>> >> Apache Lucene/Solr Search Developer
>> >> http://www.linkedin.com/in/davidwsmiley
>> >>
>> >>
>> >>> On Sat, Nov 2, 2019 at 7:32 PM Noble Paul <no...@gmail.com>
>> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> I believe there is a consensus on what is wrong with the way we have
>> built the cluster state and overseer. We need to focus a bit more on the
>> design aspect. Design, according to me, has the following elements:
>> >>>
>> >>> * How does it work?
>> >>>
>> >>> * What are the performance characteristics? Can it be done more
>> efficiently?
>> >>>
>> >>> * What are the public touch points?
>> >>>
>> >>> ** Which are the files we store in ZK? Are they expected to be
>> watched always?
>> >>>
>> >>> ** Or are they read on demand?
>> >>>
>> >>> ** The public APIs. Does it make sense to the user? Can it be further
>> simplified? How does it compare to the other APIs in the system?
>> >>>
>> >>>
>> >>> We, as a community, do a bad job in dealing with these. While we
>> focus on internal things, these are not discussed before it is too late. We
>> usually do coding, tests, code review (sometimes) and commit. This leads to
>> huge technical debt.
>> >>>
>> >>>
>> >>> This is not to put blame on one person or a group of people. (I
>> occasionally see people discussing design issues upfront, I just hope that
>> is the norm.)
>> >>>
>> >>>
>> >>> Now, why am I discussing this in this thread?
>> >>>
>> >>>
>> >>> While we agree there are problems, we are trying to solve the problem
>> using the same process we used to create these problems. Again, I'm not
>> questioning the intent or competence of anyone. Unless we set the process
>> right, we are doomed to make the same mistakes again.
>> >>>
>> >>>
>> >>> I whole heartedly endorse any effort to improve SolrCloud/overseer.
>> At the same time I fail to see us leveraging the collective experience of
>> our community through meaningful discussion.
>> >>>
>> >>>
>> >>> I hope we don't resort to personal attacks and use this as an
>> opportunity to improve our processes.
>> >>> Thanks
>> >>>
>> >>> On Sun, Nov 3, 2019, 9:52 AM Scott Blum <dr...@gmail.com>
>> wrote:
>> >>>>
>> >>>> Very much agreed.  I've been trying to figure out for a long time
>> what is the point in having a replica DOWN state that has to be toggled
>> (DOWN and then UP!) every time a node restarts.  Considering that we could
>> just combine ACTIVE and `live_nodes` to understand whether a replica is
>> available.  It's not even foolproof since kill -9 on a solr node won't mark
>> all the replicas DOWN-- that doesn't happen until the node comes back up
>> (perversely).
>> >>>>
>> >>>> What would it take to get to a state where restarting a node would
>> require a minimal amount of ZK work in most cases?
>> >>>>
>> >>>> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <ma...@gmail.com>
>> wrote:
>> >>>>>
>> >>>>> Give me a short bit to follow up and I will lay out my case and
>> proposal.
>> >>>>>
>> >>>>> Everyone is then free to decide that we need to do something
>> drastic or that I'm wrong and we should just continue down the same road.
>> If that's the case, a lot of your work will get a lot easier and less
>> impeded by me and we will still all be happier. Win win.
>> >>>>>
>> >>>>> If we can just not make drastic changes for a just a brief week or
>> so window, I'll say what I have to say, you guys can judge and do whatever
>> you'd please.
>> >>>>>
>> >>>>> - mark
>> >>>>>
>> >>>>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <ma...@gmail.com>
>> wrote:
>> >>>>>>
>> >>>>>> Hey All Solr Dev's,
>> >>>>>>
>> >>>>>> SolrCloud is sick right now. The way low level Zookeeper is
>> handeled, the Overseer, is mix and mess of proper exception handling and
>> super slow startup and shutdown, adding new things all the time with no
>> concern for performance or proper ordering (which is harder to tell than
>> you think).
>> >>>>>>
>> >>>>>> Our class dependency graph doesn't even work - we just force it.
>> Sort of. If the whole system  doesn't block and choke it's way to a start
>> slow enough, lots of things fail.
>> >>>>>>
>> >>>>>> This thing coughs up, you toss stuff into the storm, a good chunk
>> of time, what you want eventually come back without causing too much damage.
>> >>>>>>
>> >>>>>> There are so many things are are off or just plain wrong and the
>> list is growing and growing. No one is following this or if you are, please
>> back me up. This thing will collapse under it's own wait.
>> >>>>>>
>> >>>>>> So if you want to add yet another state format cluster state or
>> some other optimization on this junk heap, you can expect me to push back.
>> >>>>>>
>> >>>>>> We should all be embarrassed by the state of things.
>> >>>>>>
>> >>>>>> I've got some ideas for addressing them that I'll share soon, but
>> god, don't keep optimizing a turd in non backcompat Overseer loving ways.
>> That Overseer is an atrocity.
>> >>>>>>
>> >>>>>> --
>> >>>>>> - Mark
>> >>>>>>
>> >>>>>> http://about.me/markrmiller
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> - Mark
>> >>>>>
>> >>>>> http://about.me/markrmiller
>> >
>> >
>> >
>> > --
>> > -----------------------------------------------------
>> > Noble Paul
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> --
> - Mark
>
> http://about.me/markrmiller
>


-- 
- Mark

http://about.me/markrmiller

Re: SolrCloud is sick.

Posted by Mark Miller <ma...@gmail.com>.

bq.  They also would allow it to do it in an iterative manner without
changing everything at once.

Sadly, you can't fix this piece by piece :) I dare anyone to try. I
encourage, I applaud the effort.

The world is your oyster from a good spot - take your pick of how to do
things.

But from this spot, if anyone thinks we are getting out design change by
design change, JIRA by JIRA, I'm so sorry. Let's commiserate in a couple
years on a beer when you  give up on that.

- Mark

On Sun, Nov 3, 2019 at 4:01 AM Jörn Franke <jo...@gmail.com> wrote:

> I cannot say anything about the statements, but maybe it could help to
> introduce Solr Improvement Proposals (SIP) similar to Kafka Improvement
> Proposals (KIP) or Flink Improvement Proposals (FLIP).
>
>   I think they are helpful to facilitate design decisions and refactoring
> / redesign decision. They also would allow it to do it in an iterative
> manner without changing everything at once.
> The final version could be out  in The Git of Solr in markdown including
> figures presenting parts of the design.
>
> However for developing them I propose a more inclusive approach where many
> people (not only core developers) can easily comment and support, eg Google
> docs or similar.
>
> > Am 03.11.2019 um 06:39 schrieb Noble Paul <no...@gmail.com>:
> >
> > Solr has to do more than Lucene. A Lucene user is mostly a developer
> > who reads javadocs. A Solr user's touch points are
> >
> > * Public API
> > * Ref guide
> > * publicly visible files (in ZK as well as file system)
> > * What to see/look for in the log files to debug issues
> >
> > Then we have more nuanced touch points such as the knowledge base of
> > what happens internally in the system when 'X' API is invoked or when
> > 'Y' behavior is observed in ZK data.
> >
> > The problem with delaying the review process till code completion is
> > that, any changes based on review comments will require massive amount
> > of work.
> >
> > I don't have an answer to how we achieve it. But, I clearly see this
> > as a major gap in our development process today.
> >
> > This discussion may not be relevant in this thread, may be because no
> > behavior is changed at all. We don't know yet
> >
> > What I want to believe is Mark is doing the right thing & it's gonna
> > help us all in dealing with our operational issues. I don't want to
> > interrupt his work with more discussions.
> >
> > Thanks you
> >
> >
> >> On Sun, Nov 3, 2019 at 3:32 PM David Smiley <da...@gmail.com>
> wrote:
> >>
> >> Yeah we do a bad job of the things you listed Noble.  :-(   My
> colleagues want pointers to internal docs but the sad reality is there
> isn't any.  You may notice I'm a stickler in my code reviews for requiring
> javadocs on all top level classes.  I think more javadocs and code comments
> would be very helpful -- especially for the major classes.  This might help
> us all and others a lot more.  For example I think Lucene does a rather
> fine job of this for its major classes -- IndexWriter being a good example.
> >>
> >> ~ David Smiley
> >> Apache Lucene/Solr Search Developer
> >> http://www.linkedin.com/in/davidwsmiley
> >>
> >>
> >>> On Sat, Nov 2, 2019 at 7:32 PM Noble Paul <no...@gmail.com>
> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I believe there is a consensus on what is wrong with the way we have
> built the cluster state and overseer. We need to focus a bit more on the
> design aspect. Design, according to me, has the following elements:
> >>>
> >>> * How does it work?
> >>>
> >>> * What are the performance characteristics? Can it be done more
> efficiently?
> >>>
> >>> * What are the public touch points?
> >>>
> >>> ** Which are the files we store in ZK? Are they expected to be watched
> always?
> >>>
> >>> ** Or are they read on demand?
> >>>
> >>> ** The public APIs. Does it make sense to the user? Can it be further
> simplified? How does it compare to the other APIs in the system?
> >>>
> >>>
> >>> We, as a community, do a bad job in dealing with these. While we focus
> on internal things, these are not discussed before it is too late. We
> usually do coding, tests, code review (sometimes) and commit. This leads to
> huge technical debt.
> >>>
> >>>
> >>> This is not to put blame on one person or a group of people. (I
> occasionally see people discussing design issues upfront, I just hope that
> is the norm.)
> >>>
> >>>
> >>> Now, why am I discussing this in this thread?
> >>>
> >>>
> >>> While we agree there are problems, we are trying to solve the problem
> using the same process we used to create these problems. Again, I'm not
> questioning the intent or competence of anyone. Unless we set the process
> right, we are doomed to make the same mistakes again.
> >>>
> >>>
> >>> I whole heartedly endorse any effort to improve SolrCloud/overseer. At
> the same time I fail to see us leveraging the collective experience of our
> community through meaningful discussion.
> >>>
> >>>
> >>> I hope we don't resort to personal attacks and use this as an
> opportunity to improve our processes.
> >>> Thanks
> >>>
> >>> On Sun, Nov 3, 2019, 9:52 AM Scott Blum <dr...@gmail.com> wrote:
> >>>>
> >>>> Very much agreed.  I've been trying to figure out for a long time
> what is the point in having a replica DOWN state that has to be toggled
> (DOWN and then UP!) every time a node restarts.  Considering that we could
> just combine ACTIVE and `live_nodes` to understand whether a replica is
> available.  It's not even foolproof since kill -9 on a solr node won't mark
> all the replicas DOWN-- that doesn't happen until the node comes back up
> (perversely).
> >>>>
> >>>> What would it take to get to a state where restarting a node would
> require a minimal amount of ZK work in most cases?
> >>>>
> >>>> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <ma...@gmail.com>
> wrote:
> >>>>>
> >>>>> Give me a short bit to follow up and I will lay out my case and
> proposal.
> >>>>>
> >>>>> Everyone is then free to decide that we need to do something drastic
> or that I'm wrong and we should just continue down the same road. If that's
> the case, a lot of your work will get a lot easier and less impeded by me
> and we will still all be happier. Win win.
> >>>>>
> >>>>> If we can just not make drastic changes for a just a brief week or
> so window, I'll say what I have to say, you guys can judge and do whatever
> you'd please.
> >>>>>
> >>>>> - mark
> >>>>>
> >>>>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <ma...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> Hey All Solr Dev's,
> >>>>>>
> >>>>>> SolrCloud is sick right now. The way low level Zookeeper is
> handeled, the Overseer, is mix and mess of proper exception handling and
> super slow startup and shutdown, adding new things all the time with no
> concern for performance or proper ordering (which is harder to tell than
> you think).
> >>>>>>
> >>>>>> Our class dependency graph doesn't even work - we just force it.
> Sort of. If the whole system  doesn't block and choke it's way to a start
> slow enough, lots of things fail.
> >>>>>>
> >>>>>> This thing coughs up, you toss stuff into the storm, a good chunk
> of time, what you want eventually come back without causing too much damage.
> >>>>>>
> >>>>>> There are so many things are are off or just plain wrong and the
> list is growing and growing. No one is following this or if you are, please
> back me up. This thing will collapse under it's own wait.
> >>>>>>
> >>>>>> So if you want to add yet another state format cluster state or
> some other optimization on this junk heap, you can expect me to push back.
> >>>>>>
> >>>>>> We should all be embarrassed by the state of things.
> >>>>>>
> >>>>>> I've got some ideas for addressing them that I'll share soon, but
> god, don't keep optimizing a turd in non backcompat Overseer loving ways.
> That Overseer is an atrocity.
> >>>>>>
> >>>>>> --
> >>>>>> - Mark
> >>>>>>
> >>>>>> http://about.me/markrmiller
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> - Mark
> >>>>>
> >>>>> http://about.me/markrmiller
> >
> >
> >
> > --
> > -----------------------------------------------------
> > Noble Paul
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

-- 
- Mark

http://about.me/markrmiller

Re: SolrCloud is sick.

Posted by Jörn Franke <jo...@gmail.com>.

I cannot say anything about the statements, but maybe it could help to introduce Solr Improvement Proposals (SIP) similar to Kafka Improvement Proposals (KIP) or Flink Improvement Proposals (FLIP).

  I think they are helpful to facilitate design decisions and refactoring / redesign decision. They also would allow it to do it in an iterative manner without changing everything at once. 
The final version could be out  in The Git of Solr in markdown including figures presenting parts of the design.

However for developing them I propose a more inclusive approach where many people (not only core developers) can easily comment and support, eg Google docs or similar.

> Am 03.11.2019 um 06:39 schrieb Noble Paul <no...@gmail.com>:
> 
> Solr has to do more than Lucene. A Lucene user is mostly a developer
> who reads javadocs. A Solr user's touch points are
> 
> * Public API
> * Ref guide
> * publicly visible files (in ZK as well as file system)
> * What to see/look for in the log files to debug issues
> 
> Then we have more nuanced touch points such as the knowledge base of
> what happens internally in the system when 'X' API is invoked or when
> 'Y' behavior is observed in ZK data.
> 
> The problem with delaying the review process till code completion is
> that, any changes based on review comments will require massive amount
> of work.
> 
> I don't have an answer to how we achieve it. But, I clearly see this
> as a major gap in our development process today.
> 
> This discussion may not be relevant in this thread, may be because no
> behavior is changed at all. We don't know yet
> 
> What I want to believe is Mark is doing the right thing & it's gonna
> help us all in dealing with our operational issues. I don't want to
> interrupt his work with more discussions.
> 
> Thanks you
> 
> 
>> On Sun, Nov 3, 2019 at 3:32 PM David Smiley <da...@gmail.com> wrote:
>> 
>> Yeah we do a bad job of the things you listed Noble.  :-(   My colleagues want pointers to internal docs but the sad reality is there isn't any.  You may notice I'm a stickler in my code reviews for requiring javadocs on all top level classes.  I think more javadocs and code comments would be very helpful -- especially for the major classes.  This might help us all and others a lot more.  For example I think Lucene does a rather fine job of this for its major classes -- IndexWriter being a good example.
>> 
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>> 
>> 
>>> On Sat, Nov 2, 2019 at 7:32 PM Noble Paul <no...@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> I believe there is a consensus on what is wrong with the way we have built the cluster state and overseer. We need to focus a bit more on the design aspect. Design, according to me, has the following elements:
>>> 
>>> * How does it work?
>>> 
>>> * What are the performance characteristics? Can it be done more efficiently?
>>> 
>>> * What are the public touch points?
>>> 
>>> ** Which are the files we store in ZK? Are they expected to be watched always?
>>> 
>>> ** Or are they read on demand?
>>> 
>>> ** The public APIs. Does it make sense to the user? Can it be further simplified? How does it compare to the other APIs in the system?
>>> 
>>> 
>>> We, as a community, do a bad job in dealing with these. While we focus on internal things, these are not discussed before it is too late. We usually do coding, tests, code review (sometimes) and commit. This leads to huge technical debt.
>>> 
>>> 
>>> This is not to put blame on one person or a group of people. (I occasionally see people discussing design issues upfront, I just hope that is the norm.)
>>> 
>>> 
>>> Now, why am I discussing this in this thread?
>>> 
>>> 
>>> While we agree there are problems, we are trying to solve the problem using the same process we used to create these problems. Again, I'm not questioning the intent or competence of anyone. Unless we set the process right, we are doomed to make the same mistakes again.
>>> 
>>> 
>>> I whole heartedly endorse any effort to improve SolrCloud/overseer. At the same time I fail to see us leveraging the collective experience of our community through meaningful discussion.
>>> 
>>> 
>>> I hope we don't resort to personal attacks and use this as an opportunity to improve our processes.
>>> Thanks
>>> 
>>> On Sun, Nov 3, 2019, 9:52 AM Scott Blum <dr...@gmail.com> wrote:
>>>> 
>>>> Very much agreed.  I've been trying to figure out for a long time what is the point in having a replica DOWN state that has to be toggled (DOWN and then UP!) every time a node restarts.  Considering that we could just combine ACTIVE and `live_nodes` to understand whether a replica is available.  It's not even foolproof since kill -9 on a solr node won't mark all the replicas DOWN-- that doesn't happen until the node comes back up (perversely).
>>>> 
>>>> What would it take to get to a state where restarting a node would require a minimal amount of ZK work in most cases?
>>>> 
>>>> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <ma...@gmail.com> wrote:
>>>>> 
>>>>> Give me a short bit to follow up and I will lay out my case and proposal.
>>>>> 
>>>>> Everyone is then free to decide that we need to do something drastic or that I'm wrong and we should just continue down the same road. If that's the case, a lot of your work will get a lot easier and less impeded by me and we will still all be happier. Win win.
>>>>> 
>>>>> If we can just not make drastic changes for a just a brief week or so window, I'll say what I have to say, you guys can judge and do whatever you'd please.
>>>>> 
>>>>> - mark
>>>>> 
>>>>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <ma...@gmail.com> wrote:
>>>>>> 
>>>>>> Hey All Solr Dev's,
>>>>>> 
>>>>>> SolrCloud is sick right now. The way low level Zookeeper is handeled, the Overseer, is mix and mess of proper exception handling and super slow startup and shutdown, adding new things all the time with no concern for performance or proper ordering (which is harder to tell than you think).
>>>>>> 
>>>>>> Our class dependency graph doesn't even work - we just force it. Sort of. If the whole system  doesn't block and choke it's way to a start slow enough, lots of things fail.
>>>>>> 
>>>>>> This thing coughs up, you toss stuff into the storm, a good chunk of time, what you want eventually come back without causing too much damage.
>>>>>> 
>>>>>> There are so many things are are off or just plain wrong and the list is growing and growing. No one is following this or if you are, please back me up. This thing will collapse under it's own wait.
>>>>>> 
>>>>>> So if you want to add yet another state format cluster state or some other optimization on this junk heap, you can expect me to push back.
>>>>>> 
>>>>>> We should all be embarrassed by the state of things.
>>>>>> 
>>>>>> I've got some ideas for addressing them that I'll share soon, but god, don't keep optimizing a turd in non backcompat Overseer loving ways. That Overseer is an atrocity.
>>>>>> 
>>>>>> --
>>>>>> - Mark
>>>>>> 
>>>>>> http://about.me/markrmiller
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> - Mark
>>>>> 
>>>>> http://about.me/markrmiller
> 
> 
> 
> -- 
> -----------------------------------------------------
> Noble Paul
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: SolrCloud is sick.

Posted by Noble Paul <no...@gmail.com>.

Solr has to do more than Lucene. A Lucene user is mostly a developer
who reads javadocs. A Solr user's touch points are

* Public API
* Ref guide
* publicly visible files (in ZK as well as file system)
* What to see/look for in the log files to debug issues

Then we have more nuanced touch points such as the knowledge base of
what happens internally in the system when 'X' API is invoked or when
'Y' behavior is observed in ZK data.

The problem with delaying the review process till code completion is
that, any changes based on review comments will require massive amount
of work.

I don't have an answer to how we achieve it. But, I clearly see this
as a major gap in our development process today.

This discussion may not be relevant in this thread, may be because no
behavior is changed at all. We don't know yet

What I want to believe is Mark is doing the right thing & it's gonna
help us all in dealing with our operational issues. I don't want to
interrupt his work with more discussions.

Thanks you


On Sun, Nov 3, 2019 at 3:32 PM David Smiley <da...@gmail.com> wrote:
>
> Yeah we do a bad job of the things you listed Noble.  :-(   My colleagues want pointers to internal docs but the sad reality is there isn't any.  You may notice I'm a stickler in my code reviews for requiring javadocs on all top level classes.  I think more javadocs and code comments would be very helpful -- especially for the major classes.  This might help us all and others a lot more.  For example I think Lucene does a rather fine job of this for its major classes -- IndexWriter being a good example.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Sat, Nov 2, 2019 at 7:32 PM Noble Paul <no...@gmail.com> wrote:
>>
>> Hi,
>>
>> I believe there is a consensus on what is wrong with the way we have built the cluster state and overseer. We need to focus a bit more on the design aspect. Design, according to me, has the following elements:
>>
>> * How does it work?
>>
>> * What are the performance characteristics? Can it be done more efficiently?
>>
>> * What are the public touch points?
>>
>> ** Which are the files we store in ZK? Are they expected to be watched always?
>>
>> ** Or are they read on demand?
>>
>> ** The public APIs. Does it make sense to the user? Can it be further simplified? How does it compare to the other APIs in the system?
>>
>>
>> We, as a community, do a bad job in dealing with these. While we focus on internal things, these are not discussed before it is too late. We usually do coding, tests, code review (sometimes) and commit. This leads to huge technical debt.
>>
>>
>> This is not to put blame on one person or a group of people. (I occasionally see people discussing design issues upfront, I just hope that is the norm.)
>>
>>
>> Now, why am I discussing this in this thread?
>>
>>
>> While we agree there are problems, we are trying to solve the problem using the same process we used to create these problems. Again, I'm not questioning the intent or competence of anyone. Unless we set the process right, we are doomed to make the same mistakes again.
>>
>>
>> I whole heartedly endorse any effort to improve SolrCloud/overseer. At the same time I fail to see us leveraging the collective experience of our community through meaningful discussion.
>>
>>
>> I hope we don't resort to personal attacks and use this as an opportunity to improve our processes.
>> Thanks
>>
>> On Sun, Nov 3, 2019, 9:52 AM Scott Blum <dr...@gmail.com> wrote:
>>>
>>> Very much agreed.  I've been trying to figure out for a long time what is the point in having a replica DOWN state that has to be toggled (DOWN and then UP!) every time a node restarts.  Considering that we could just combine ACTIVE and `live_nodes` to understand whether a replica is available.  It's not even foolproof since kill -9 on a solr node won't mark all the replicas DOWN-- that doesn't happen until the node comes back up (perversely).
>>>
>>> What would it take to get to a state where restarting a node would require a minimal amount of ZK work in most cases?
>>>
>>> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <ma...@gmail.com> wrote:
>>>>
>>>> Give me a short bit to follow up and I will lay out my case and proposal.
>>>>
>>>> Everyone is then free to decide that we need to do something drastic or that I'm wrong and we should just continue down the same road. If that's the case, a lot of your work will get a lot easier and less impeded by me and we will still all be happier. Win win.
>>>>
>>>> If we can just not make drastic changes for a just a brief week or so window, I'll say what I have to say, you guys can judge and do whatever you'd please.
>>>>
>>>> - mark
>>>>
>>>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <ma...@gmail.com> wrote:
>>>>>
>>>>> Hey All Solr Dev's,
>>>>>
>>>>> SolrCloud is sick right now. The way low level Zookeeper is handeled, the Overseer, is mix and mess of proper exception handling and super slow startup and shutdown, adding new things all the time with no concern for performance or proper ordering (which is harder to tell than you think).
>>>>>
>>>>> Our class dependency graph doesn't even work - we just force it. Sort of. If the whole system  doesn't block and choke it's way to a start slow enough, lots of things fail.
>>>>>
>>>>> This thing coughs up, you toss stuff into the storm, a good chunk of time, what you want eventually come back without causing too much damage.
>>>>>
>>>>> There are so many things are are off or just plain wrong and the list is growing and growing. No one is following this or if you are, please back me up. This thing will collapse under it's own wait.
>>>>>
>>>>> So if you want to add yet another state format cluster state or some other optimization on this junk heap, you can expect me to push back.
>>>>>
>>>>> We should all be embarrassed by the state of things.
>>>>>
>>>>> I've got some ideas for addressing them that I'll share soon, but god, don't keep optimizing a turd in non backcompat Overseer loving ways. That Overseer is an atrocity.
>>>>>
>>>>> --
>>>>> - Mark
>>>>>
>>>>> http://about.me/markrmiller
>>>>
>>>>
>>>>
>>>> --
>>>> - Mark
>>>>
>>>> http://about.me/markrmiller



-- 
-----------------------------------------------------
Noble Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: SolrCloud is sick.

Posted by David Smiley <da...@gmail.com>.

Yeah we do a bad job of the things you listed Noble.  :-(   My colleagues
want pointers to internal docs but the sad reality is there isn't any.  You
may notice I'm a stickler in my code reviews for requiring javadocs on all
top level classes.  I think more javadocs and code comments would be very
helpful -- especially for the major classes.  This might help us all and
others a lot more.  For example I think Lucene does a rather fine job of
this for its major classes -- IndexWriter being a good example.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Sat, Nov 2, 2019 at 7:32 PM Noble Paul <no...@gmail.com> wrote:

> Hi,
>
> I believe there is a consensus on what is wrong with the way we have built
> the cluster state and overseer. We need to focus a bit more on the design
> aspect. Design, according to me, has the following elements:
>
> * How does it work?
>
> * What are the performance characteristics? Can it be done more
> efficiently?
>
> * What are the public touch points?
>
> ** Which are the files we store in ZK? Are they expected to be watched
> always?
>
> ** Or are they read on demand?
>
> ** The public APIs. Does it make sense to the user? Can it be further
> simplified? How does it compare to the other APIs in the system?
>
>
> We, as a community, do a bad job in dealing with these. While we focus on
> internal things, these are not discussed before it is too late. We usually
> do coding, tests, code review (sometimes) and commit. This leads to huge
> technical debt.
>
>
> This is not to put blame on one person or a group of people. (I
> occasionally see people discussing design issues upfront, I just hope that
> is the norm.)
>
>
> Now, why am I discussing this in this thread?
>
>
> While we agree there are problems, we are trying to solve the problem
> using the same process we used to create these problems. Again, I'm not
> questioning the intent or competence of anyone. Unless we set the process
> right, we are doomed to make the same mistakes again.
>
>
> I whole heartedly endorse any effort to improve SolrCloud/overseer. At the
> same time I fail to see us leveraging the collective experience of our
> community through meaningful discussion.
>
>
> I hope we don't resort to personal attacks and use this as an opportunity
> to improve our processes.
> Thanks
>
> On Sun, Nov 3, 2019, 9:52 AM Scott Blum <dr...@gmail.com> wrote:
>
>> Very much agreed.  I've been trying to figure out for a long time what is
>> the point in having a replica DOWN state that has to be toggled (DOWN and
>> then UP!) every time a node restarts.  Considering that we could just
>> combine ACTIVE and `live_nodes` to understand whether a replica is
>> available.  It's not even foolproof since kill -9 on a solr node won't mark
>> all the replicas DOWN-- that doesn't happen until the node comes back up
>> (perversely).
>>
>> What would it take to get to a state where restarting a node would
>> require a minimal amount of ZK work in most cases?
>>
>> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <ma...@gmail.com> wrote:
>>
>>> Give me a short bit to follow up and I will lay out my case and proposal.
>>>
>>> Everyone is then free to decide that we need to do something drastic or
>>> that I'm wrong and we should just continue down the same road. If that's
>>> the case, a lot of your work will get a lot easier and less impeded by me
>>> and we will still all be happier. Win win.
>>>
>>> If we can just not make drastic changes for a just a brief week or so
>>> window, I'll say what I have to say, you guys can judge and do whatever
>>> you'd please.
>>>
>>> - mark
>>>
>>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hey All Solr Dev's,
>>>>
>>>> SolrCloud is sick right now. The way low level Zookeeper is handeled,
>>>> the Overseer, is mix and mess of proper exception handling and super slow
>>>> startup and shutdown, adding new things all the time with no concern for
>>>> performance or proper ordering (which is harder to tell than you think).
>>>>
>>>> Our class dependency graph doesn't even work - we just force it. Sort
>>>> of. If the whole system  doesn't block and choke it's way to a start slow
>>>> enough, lots of things fail.
>>>>
>>>> This thing coughs up, you toss stuff into the storm, a good chunk of
>>>> time, what you want eventually come back without causing too much damage.
>>>>
>>>> There are so many things are are off or just plain wrong and the list
>>>> is growing and growing. No one is following this or if you are, please back
>>>> me up. This thing will collapse under it's own wait.
>>>>
>>>> So if you want to add yet another state format cluster state or some
>>>> other optimization on this junk heap, you can expect me to push back.
>>>>
>>>> We should all be embarrassed by the state of things.
>>>>
>>>> I've got some ideas for addressing them that I'll share soon, but god,
>>>> don't keep optimizing a turd in non backcompat Overseer loving ways. That
>>>> Overseer is an atrocity.
>>>>
>>>> --
>>>> - Mark
>>>>
>>>> http://about.me/markrmiller
>>>>
>>>
>>>
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>>

Re: SolrCloud is sick.

Posted by Noble Paul <no...@gmail.com>.

Hi,

I believe there is a consensus on what is wrong with the way we have built
the cluster state and overseer. We need to focus a bit more on the design
aspect. Design, according to me, has the following elements:

* How does it work?

* What are the performance characteristics? Can it be done more efficiently?

* What are the public touch points?

** Which are the files we store in ZK? Are they expected to be watched
always?

** Or are they read on demand?

** The public APIs. Does it make sense to the user? Can it be further
simplified? How does it compare to the other APIs in the system?

We, as a community, do a bad job in dealing with these. While we focus on
internal things, these are not discussed before it is too late. We usually
do coding, tests, code review (sometimes) and commit. This leads to huge
technical debt.

This is not to put blame on one person or a group of people. (I
occasionally see people discussing design issues upfront, I just hope that
is the norm.)

Now, why am I discussing this in this thread?

While we agree there are problems, we are trying to solve the problem using
the same process we used to create these problems. Again, I'm not
questioning the intent or competence of anyone. Unless we set the process
right, we are doomed to make the same mistakes again.

I whole heartedly endorse any effort to improve SolrCloud/overseer. At the
same time I fail to see us leveraging the collective experience of our
community through meaningful discussion.

I hope we don't resort to personal attacks and use this as an opportunity
to improve our processes.
Thanks

On Sun, Nov 3, 2019, 9:52 AM Scott Blum <dr...@gmail.com> wrote:

> Very much agreed.  I've been trying to figure out for a long time what is
> the point in having a replica DOWN state that has to be toggled (DOWN and
> then UP!) every time a node restarts.  Considering that we could just
> combine ACTIVE and `live_nodes` to understand whether a replica is
> available.  It's not even foolproof since kill -9 on a solr node won't mark
> all the replicas DOWN-- that doesn't happen until the node comes back up
> (perversely).
>
> What would it take to get to a state where restarting a node would require
> a minimal amount of ZK work in most cases?
>
> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <ma...@gmail.com> wrote:
>
>> Give me a short bit to follow up and I will lay out my case and proposal.
>>
>> Everyone is then free to decide that we need to do something drastic or
>> that I'm wrong and we should just continue down the same road. If that's
>> the case, a lot of your work will get a lot easier and less impeded by me
>> and we will still all be happier. Win win.
>>
>> If we can just not make drastic changes for a just a brief week or so
>> window, I'll say what I have to say, you guys can judge and do whatever
>> you'd please.
>>
>> - mark
>>
>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <ma...@gmail.com> wrote:
>>
>>> Hey All Solr Dev's,
>>>
>>> SolrCloud is sick right now. The way low level Zookeeper is handeled,
>>> the Overseer, is mix and mess of proper exception handling and super slow
>>> startup and shutdown, adding new things all the time with no concern for
>>> performance or proper ordering (which is harder to tell than you think).
>>>
>>> Our class dependency graph doesn't even work - we just force it. Sort
>>> of. If the whole system  doesn't block and choke it's way to a start slow
>>> enough, lots of things fail.
>>>
>>> This thing coughs up, you toss stuff into the storm, a good chunk of
>>> time, what you want eventually come back without causing too much damage.
>>>
>>> There are so many things are are off or just plain wrong and the list is
>>> growing and growing. No one is following this or if you are, please back me
>>> up. This thing will collapse under it's own wait.
>>>
>>> So if you want to add yet another state format cluster state or some
>>> other optimization on this junk heap, you can expect me to push back.
>>>
>>> We should all be embarrassed by the state of things.
>>>
>>> I've got some ideas for addressing them that I'll share soon, but god,
>>> don't keep optimizing a turd in non backcompat Overseer loving ways. That
>>> Overseer is an atrocity.
>>>
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>>
>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
>

Re: SolrCloud is sick.

Posted by Mark Miller <ma...@gmail.com>.

Things are also counterintuitive. The more you fix and the faster things
work the more things fail. It’s like rings of hell.

Mark

On Sat, Nov 2, 2019 at 10:29 PM Mark Miller <ma...@gmail.com> wrote:

> And it didnt get any easier. What I did about it is kill myself multiple
> times over 2 years for weeks on end of torturing my wife. And I found a
> million problems, a million bugs, a million terrible inefficiencies. And I
> fixed and lost countless of them friggen twice. And didnt lose tons of the
> work as well. And so it's not easy to get out of this. Its not easy at all.
> And i havent even done the hard part yet.
>
> - Mark
>
> On Sat, Nov 2, 2019 at 10:24 PM Mark Miller <ma...@gmail.com> wrote:
>
>> I mean the reality is - why do we not have just a single watcher per node
>> pulling in state. We are we not tracking and minimizing state transfers and
>> changes? Why are we not measuring the time it takes to round trip a
>> state.json and adjusting? Looking at load to adjust overseerish duties and
>> leader election? A million other smart things?
>>
>> Because it's too hard. It's too hard and we all gave up long ago on
>> figuring out what to do about it. Because we are programming in assembly in
>> an abyss when we should be doing java in the clouds.
>>
>> Everyone knows the SolrCloud DNA one way or another.We all somehow made
>> our peace with it or not.
>>
>> It's easy when you dont go deep. Hell thats easy to forget even if you do.
>>
>> But I'm looping on it now, have to eject.
>>
>> - Mark
>>
>> On Sat, Nov 2, 2019 at 10:15 PM Mark Miller <ma...@gmail.com>
>> wrote:
>>
>>> Not much. Something you can understand. How about tests < 10 seconds
>>> fail or not. Good logging and as a backup good debug logging. Docs on how
>>> things are designed to work? Tracking of all important operations and how
>>> long they take with tight cutoffs? Proper response to interruption 100% of
>>> the time? The idea of a cluster start and stop? Of a cluster install to ZK
>>> initially. Drop all legacyCloud support, stateformat=1 support, maybe a few
>>> other things.
>>>
>>> I've got some stuff, I'm gonna pull out as fast as I sensibly can given
>>> many setbacks and too little sleep for a long time.
>>>
>>> I'm not here to do all the of the lift for everyone, but unless I get
>>> sick in the next week or two or my 10 backup methods and git pushes and
>>> backup branches fail or I just burn the hell out, I have a solid refuge
>>> that we can knock out and then build on with confidence.
>>>
>>> - Mark
>>>
>>> On Sat, Nov 2, 2019 at 5:52 PM Scott Blum <dr...@gmail.com> wrote:
>>>
>>>> Very much agreed.  I've been trying to figure out for a long time what
>>>> is the point in having a replica DOWN state that has to be toggled (DOWN
>>>> and then UP!) every time a node restarts.  Considering that we could just
>>>> combine ACTIVE and `live_nodes` to understand whether a replica is
>>>> available.  It's not even foolproof since kill -9 on a solr node won't mark
>>>> all the replicas DOWN-- that doesn't happen until the node comes back up
>>>> (perversely).
>>>>
>>>> What would it take to get to a state where restarting a node would
>>>> require a minimal amount of ZK work in most cases?
>>>>
>>>> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Give me a short bit to follow up and I will lay out my case and
>>>>> proposal.
>>>>>
>>>>> Everyone is then free to decide that we need to do something drastic
>>>>> or that I'm wrong and we should just continue down the same road. If that's
>>>>> the case, a lot of your work will get a lot easier and less impeded by me
>>>>> and we will still all be happier. Win win.
>>>>>
>>>>> If we can just not make drastic changes for a just a brief week or so
>>>>> window, I'll say what I have to say, you guys can judge and do whatever
>>>>> you'd please.
>>>>>
>>>>> - mark
>>>>>
>>>>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <ma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hey All Solr Dev's,
>>>>>>
>>>>>> SolrCloud is sick right now. The way low level Zookeeper is handeled,
>>>>>> the Overseer, is mix and mess of proper exception handling and super slow
>>>>>> startup and shutdown, adding new things all the time with no concern for
>>>>>> performance or proper ordering (which is harder to tell than you think).
>>>>>>
>>>>>> Our class dependency graph doesn't even work - we just force it. Sort
>>>>>> of. If the whole system  doesn't block and choke it's way to a start slow
>>>>>> enough, lots of things fail.
>>>>>>
>>>>>> This thing coughs up, you toss stuff into the storm, a good chunk of
>>>>>> time, what you want eventually come back without causing too much damage.
>>>>>>
>>>>>> There are so many things are are off or just plain wrong and the list
>>>>>> is growing and growing. No one is following this or if you are, please back
>>>>>> me up. This thing will collapse under it's own wait.
>>>>>>
>>>>>> So if you want to add yet another state format cluster state or some
>>>>>> other optimization on this junk heap, you can expect me to push back.
>>>>>>
>>>>>> We should all be embarrassed by the state of things.
>>>>>>
>>>>>> I've got some ideas for addressing them that I'll share soon, but
>>>>>> god, don't keep optimizing a turd in non backcompat Overseer loving ways.
>>>>>> That Overseer is an atrocity.
>>>>>>
>>>>>> --
>>>>>> - Mark
>>>>>>
>>>>>> http://about.me/markrmiller
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> - Mark
>>>>>
>>>>> http://about.me/markrmiller
>>>>>
>>>>
>>>
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>>
>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
>
>
> --
> - Mark
>
> http://about.me/markrmiller
>
-- 
- Mark

http://about.me/markrmiller

Re: SolrCloud is sick.

Posted by Mark Miller <ma...@gmail.com>.

And it didnt get any easier. What I did about it is kill myself multiple
times over 2 years for weeks on end of torturing my wife. And I found a
million problems, a million bugs, a million terrible inefficiencies. And I
fixed and lost countless of them friggen twice. And didnt lose tons of the
work as well. And so it's not easy to get out of this. Its not easy at all.
And i havent even done the hard part yet.

- Mark

On Sat, Nov 2, 2019 at 10:24 PM Mark Miller <ma...@gmail.com> wrote:

> I mean the reality is - why do we not have just a single watcher per node
> pulling in state. We are we not tracking and minimizing state transfers and
> changes? Why are we not measuring the time it takes to round trip a
> state.json and adjusting? Looking at load to adjust overseerish duties and
> leader election? A million other smart things?
>
> Because it's too hard. It's too hard and we all gave up long ago on
> figuring out what to do about it. Because we are programming in assembly in
> an abyss when we should be doing java in the clouds.
>
> Everyone knows the SolrCloud DNA one way or another.We all somehow made
> our peace with it or not.
>
> It's easy when you dont go deep. Hell thats easy to forget even if you do.
>
> But I'm looping on it now, have to eject.
>
> - Mark
>
> On Sat, Nov 2, 2019 at 10:15 PM Mark Miller <ma...@gmail.com> wrote:
>
>> Not much. Something you can understand. How about tests < 10 seconds fail
>> or not. Good logging and as a backup good debug logging. Docs on how things
>> are designed to work? Tracking of all important operations and how long
>> they take with tight cutoffs? Proper response to interruption 100% of the
>> time? The idea of a cluster start and stop? Of a cluster install to ZK
>> initially. Drop all legacyCloud support, stateformat=1 support, maybe a few
>> other things.
>>
>> I've got some stuff, I'm gonna pull out as fast as I sensibly can given
>> many setbacks and too little sleep for a long time.
>>
>> I'm not here to do all the of the lift for everyone, but unless I get
>> sick in the next week or two or my 10 backup methods and git pushes and
>> backup branches fail or I just burn the hell out, I have a solid refuge
>> that we can knock out and then build on with confidence.
>>
>> - Mark
>>
>> On Sat, Nov 2, 2019 at 5:52 PM Scott Blum <dr...@gmail.com> wrote:
>>
>>> Very much agreed.  I've been trying to figure out for a long time what
>>> is the point in having a replica DOWN state that has to be toggled (DOWN
>>> and then UP!) every time a node restarts.  Considering that we could just
>>> combine ACTIVE and `live_nodes` to understand whether a replica is
>>> available.  It's not even foolproof since kill -9 on a solr node won't mark
>>> all the replicas DOWN-- that doesn't happen until the node comes back up
>>> (perversely).
>>>
>>> What would it take to get to a state where restarting a node would
>>> require a minimal amount of ZK work in most cases?
>>>
>>> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <ma...@gmail.com>
>>> wrote:
>>>
>>>> Give me a short bit to follow up and I will lay out my case and
>>>> proposal.
>>>>
>>>> Everyone is then free to decide that we need to do something drastic or
>>>> that I'm wrong and we should just continue down the same road. If that's
>>>> the case, a lot of your work will get a lot easier and less impeded by me
>>>> and we will still all be happier. Win win.
>>>>
>>>> If we can just not make drastic changes for a just a brief week or so
>>>> window, I'll say what I have to say, you guys can judge and do whatever
>>>> you'd please.
>>>>
>>>> - mark
>>>>
>>>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hey All Solr Dev's,
>>>>>
>>>>> SolrCloud is sick right now. The way low level Zookeeper is handeled,
>>>>> the Overseer, is mix and mess of proper exception handling and super slow
>>>>> startup and shutdown, adding new things all the time with no concern for
>>>>> performance or proper ordering (which is harder to tell than you think).
>>>>>
>>>>> Our class dependency graph doesn't even work - we just force it. Sort
>>>>> of. If the whole system  doesn't block and choke it's way to a start slow
>>>>> enough, lots of things fail.
>>>>>
>>>>> This thing coughs up, you toss stuff into the storm, a good chunk of
>>>>> time, what you want eventually come back without causing too much damage.
>>>>>
>>>>> There are so many things are are off or just plain wrong and the list
>>>>> is growing and growing. No one is following this or if you are, please back
>>>>> me up. This thing will collapse under it's own wait.
>>>>>
>>>>> So if you want to add yet another state format cluster state or some
>>>>> other optimization on this junk heap, you can expect me to push back.
>>>>>
>>>>> We should all be embarrassed by the state of things.
>>>>>
>>>>> I've got some ideas for addressing them that I'll share soon, but god,
>>>>> don't keep optimizing a turd in non backcompat Overseer loving ways. That
>>>>> Overseer is an atrocity.
>>>>>
>>>>> --
>>>>> - Mark
>>>>>
>>>>> http://about.me/markrmiller
>>>>>
>>>>
>>>>
>>>> --
>>>> - Mark
>>>>
>>>> http://about.me/markrmiller
>>>>
>>>
>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
>
>
> --
> - Mark
>
> http://about.me/markrmiller
>


-- 
- Mark

http://about.me/markrmiller

Re: SolrCloud is sick.

Posted by Mark Miller <ma...@gmail.com>.

I mean the reality is - why do we not have just a single watcher per node
pulling in state. We are we not tracking and minimizing state transfers and
changes? Why are we not measuring the time it takes to round trip a
state.json and adjusting? Looking at load to adjust overseerish duties and
leader election? A million other smart things?

Because it's too hard. It's too hard and we all gave up long ago on
figuring out what to do about it. Because we are programming in assembly in
an abyss when we should be doing java in the clouds.

Everyone knows the SolrCloud DNA one way or another.We all somehow made our
peace with it or not.

It's easy when you dont go deep. Hell thats easy to forget even if you do.

But I'm looping on it now, have to eject.

- Mark

On Sat, Nov 2, 2019 at 10:15 PM Mark Miller <ma...@gmail.com> wrote:

> Not much. Something you can understand. How about tests < 10 seconds fail
> or not. Good logging and as a backup good debug logging. Docs on how things
> are designed to work? Tracking of all important operations and how long
> they take with tight cutoffs? Proper response to interruption 100% of the
> time? The idea of a cluster start and stop? Of a cluster install to ZK
> initially. Drop all legacyCloud support, stateformat=1 support, maybe a few
> other things.
>
> I've got some stuff, I'm gonna pull out as fast as I sensibly can given
> many setbacks and too little sleep for a long time.
>
> I'm not here to do all the of the lift for everyone, but unless I get sick
> in the next week or two or my 10 backup methods and git pushes and backup
> branches fail or I just burn the hell out, I have a solid refuge that we
> can knock out and then build on with confidence.
>
> - Mark
>
> On Sat, Nov 2, 2019 at 5:52 PM Scott Blum <dr...@gmail.com> wrote:
>
>> Very much agreed.  I've been trying to figure out for a long time what is
>> the point in having a replica DOWN state that has to be toggled (DOWN and
>> then UP!) every time a node restarts.  Considering that we could just
>> combine ACTIVE and `live_nodes` to understand whether a replica is
>> available.  It's not even foolproof since kill -9 on a solr node won't mark
>> all the replicas DOWN-- that doesn't happen until the node comes back up
>> (perversely).
>>
>> What would it take to get to a state where restarting a node would
>> require a minimal amount of ZK work in most cases?
>>
>> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <ma...@gmail.com> wrote:
>>
>>> Give me a short bit to follow up and I will lay out my case and proposal.
>>>
>>> Everyone is then free to decide that we need to do something drastic or
>>> that I'm wrong and we should just continue down the same road. If that's
>>> the case, a lot of your work will get a lot easier and less impeded by me
>>> and we will still all be happier. Win win.
>>>
>>> If we can just not make drastic changes for a just a brief week or so
>>> window, I'll say what I have to say, you guys can judge and do whatever
>>> you'd please.
>>>
>>> - mark
>>>
>>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hey All Solr Dev's,
>>>>
>>>> SolrCloud is sick right now. The way low level Zookeeper is handeled,
>>>> the Overseer, is mix and mess of proper exception handling and super slow
>>>> startup and shutdown, adding new things all the time with no concern for
>>>> performance or proper ordering (which is harder to tell than you think).
>>>>
>>>> Our class dependency graph doesn't even work - we just force it. Sort
>>>> of. If the whole system  doesn't block and choke it's way to a start slow
>>>> enough, lots of things fail.
>>>>
>>>> This thing coughs up, you toss stuff into the storm, a good chunk of
>>>> time, what you want eventually come back without causing too much damage.
>>>>
>>>> There are so many things are are off or just plain wrong and the list
>>>> is growing and growing. No one is following this or if you are, please back
>>>> me up. This thing will collapse under it's own wait.
>>>>
>>>> So if you want to add yet another state format cluster state or some
>>>> other optimization on this junk heap, you can expect me to push back.
>>>>
>>>> We should all be embarrassed by the state of things.
>>>>
>>>> I've got some ideas for addressing them that I'll share soon, but god,
>>>> don't keep optimizing a turd in non backcompat Overseer loving ways. That
>>>> Overseer is an atrocity.
>>>>
>>>> --
>>>> - Mark
>>>>
>>>> http://about.me/markrmiller
>>>>
>>>
>>>
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>>
>
> --
> - Mark
>
> http://about.me/markrmiller
>


-- 
- Mark

http://about.me/markrmiller

Re: SolrCloud is sick.

Posted by Mark Miller <ma...@gmail.com>.

Not much. Something you can understand. How about tests < 10 seconds fail
or not. Good logging and as a backup good debug logging. Docs on how things
are designed to work? Tracking of all important operations and how long
they take with tight cutoffs? Proper response to interruption 100% of the
time? The idea of a cluster start and stop? Of a cluster install to ZK
initially. Drop all legacyCloud support, stateformat=1 support, maybe a few
other things.

I've got some stuff, I'm gonna pull out as fast as I sensibly can given
many setbacks and too little sleep for a long time.

I'm not here to do all the of the lift for everyone, but unless I get sick
in the next week or two or my 10 backup methods and git pushes and backup
branches fail or I just burn the hell out, I have a solid refuge that we
can knock out and then build on with confidence.

- Mark

On Sat, Nov 2, 2019 at 5:52 PM Scott Blum <dr...@gmail.com> wrote:

> Very much agreed.  I've been trying to figure out for a long time what is
> the point in having a replica DOWN state that has to be toggled (DOWN and
> then UP!) every time a node restarts.  Considering that we could just
> combine ACTIVE and `live_nodes` to understand whether a replica is
> available.  It's not even foolproof since kill -9 on a solr node won't mark
> all the replicas DOWN-- that doesn't happen until the node comes back up
> (perversely).
>
> What would it take to get to a state where restarting a node would require
> a minimal amount of ZK work in most cases?
>
> On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <ma...@gmail.com> wrote:
>
>> Give me a short bit to follow up and I will lay out my case and proposal.
>>
>> Everyone is then free to decide that we need to do something drastic or
>> that I'm wrong and we should just continue down the same road. If that's
>> the case, a lot of your work will get a lot easier and less impeded by me
>> and we will still all be happier. Win win.
>>
>> If we can just not make drastic changes for a just a brief week or so
>> window, I'll say what I have to say, you guys can judge and do whatever
>> you'd please.
>>
>> - mark
>>
>> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <ma...@gmail.com> wrote:
>>
>>> Hey All Solr Dev's,
>>>
>>> SolrCloud is sick right now. The way low level Zookeeper is handeled,
>>> the Overseer, is mix and mess of proper exception handling and super slow
>>> startup and shutdown, adding new things all the time with no concern for
>>> performance or proper ordering (which is harder to tell than you think).
>>>
>>> Our class dependency graph doesn't even work - we just force it. Sort
>>> of. If the whole system  doesn't block and choke it's way to a start slow
>>> enough, lots of things fail.
>>>
>>> This thing coughs up, you toss stuff into the storm, a good chunk of
>>> time, what you want eventually come back without causing too much damage.
>>>
>>> There are so many things are are off or just plain wrong and the list is
>>> growing and growing. No one is following this or if you are, please back me
>>> up. This thing will collapse under it's own wait.
>>>
>>> So if you want to add yet another state format cluster state or some
>>> other optimization on this junk heap, you can expect me to push back.
>>>
>>> We should all be embarrassed by the state of things.
>>>
>>> I've got some ideas for addressing them that I'll share soon, but god,
>>> don't keep optimizing a turd in non backcompat Overseer loving ways. That
>>> Overseer is an atrocity.
>>>
>>> --
>>> - Mark
>>>
>>> http://about.me/markrmiller
>>>
>>
>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
>

-- 
- Mark

http://about.me/markrmiller

Re: SolrCloud is sick.

Posted by Scott Blum <dr...@gmail.com>.

Very much agreed.  I've been trying to figure out for a long time what is
the point in having a replica DOWN state that has to be toggled (DOWN and
then UP!) every time a node restarts.  Considering that we could just
combine ACTIVE and `live_nodes` to understand whether a replica is
available.  It's not even foolproof since kill -9 on a solr node won't mark
all the replicas DOWN-- that doesn't happen until the node comes back up
(perversely).

What would it take to get to a state where restarting a node would require
a minimal amount of ZK work in most cases?

On Sat, Nov 2, 2019 at 5:44 PM Mark Miller <ma...@gmail.com> wrote:

> Give me a short bit to follow up and I will lay out my case and proposal.
>
> Everyone is then free to decide that we need to do something drastic or
> that I'm wrong and we should just continue down the same road. If that's
> the case, a lot of your work will get a lot easier and less impeded by me
> and we will still all be happier. Win win.
>
> If we can just not make drastic changes for a just a brief week or so
> window, I'll say what I have to say, you guys can judge and do whatever
> you'd please.
>
> - mark
>
> On Fri, Nov 1, 2019 at 7:46 PM Mark Miller <ma...@gmail.com> wrote:
>
>> Hey All Solr Dev's,
>>
>> SolrCloud is sick right now. The way low level Zookeeper is handeled, the
>> Overseer, is mix and mess of proper exception handling and super slow
>> startup and shutdown, adding new things all the time with no concern for
>> performance or proper ordering (which is harder to tell than you think).
>>
>> Our class dependency graph doesn't even work - we just force it. Sort of.
>> If the whole system  doesn't block and choke it's way to a start slow
>> enough, lots of things fail.
>>
>> This thing coughs up, you toss stuff into the storm, a good chunk of
>> time, what you want eventually come back without causing too much damage.
>>
>> There are so many things are are off or just plain wrong and the list is
>> growing and growing. No one is following this or if you are, please back me
>> up. This thing will collapse under it's own wait.
>>
>> So if you want to add yet another state format cluster state or some
>> other optimization on this junk heap, you can expect me to push back.
>>
>> We should all be embarrassed by the state of things.
>>
>> I've got some ideas for addressing them that I'll share soon, but god,
>> don't keep optimizing a turd in non backcompat Overseer loving ways. That
>> Overseer is an atrocity.
>>
>> --
>> - Mark
>>
>> http://about.me/markrmiller
>>
>
>
> --
> - Mark
>
> http://about.me/markrmiller
>