You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Mark Miller (Jira)" <ji...@apache.org> on 2019/10/24 00:49:00 UTC

[jira] [Created] (SOLR-13867) Make Solrcloud stable and performant and capable of having passing tests.

Mark Miller created SOLR-13867:
----------------------------------

             Summary: Make Solrcloud stable and performant and capable of having passing tests.
                 Key: SOLR-13867
                 URL: https://issues.apache.org/jira/browse/SOLR-13867
             Project: Solr
          Issue Type: Task
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Mark Miller
            Assignee: Mark Miller
             Fix For: master (9.0)


After spending a bit of time away from SolrCloud and being deeply involved in trying to stabilize it and it's tests, I came back in 2018 and went deep into the system with the Starburst upgrade.

What I found surprised me, though I guess it should not have. The system is slow, often silly, super buggy, not good at connection reuse or thread safety or efficient Zookeeper communication or efficient startup and shutdown.

Often, the things we do to make tests pass make things worse because you can't do things reasonably without some major code work.

Twice now, I've seen the system in the shape it was supposed to take. FAST. Not bug free, but 100X more solid at least and much, much, much faster.

The current system is sick and actually getting worse under it's weight as more is shoveled on top. Even since 1.5 years ago, the problems are worse, not better.

Sadly, I'm smart enough to know what has to be done, but not smart enough to do most of it twice and then lose most of it twice.

Non the less, it's time to fix SolrCloud. It's not supposed to be this way.

I spent a lot of time after starburst making tests pass for me. Then a lot of time on a better build system that can help us improve development and good practices around the project. And then a lot of time making tests faster. These are important steps, but little itty bitty baby steps without addressing the core rot that is growing. We don't find a problem and fully understand what is up and craft a careful solution. We find something that we can toss into the grand canyon, listen to it bounce around for a while, and if no body screams, we move on to the next thing. That's not necessarily anyone's choice, there is little else you can do until the system is fixed. When that happens we can start making smart changes instead of just shoving around the mess.

Twice I have made the current system fast. What happens first? Nothing works. The system doesn't know how to be fast. It doesn't have the thread safety or proper logic to be fast. And that is not a place I want to be.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org