You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@solr.apache.org by Ishan Chattopadhyaya <ic...@gmail.com> on 2023/01/17 18:40:34 UTC

Re: [jira] [Commented] (SOLR-16622) Replicas don't come up active after node restart

I agree, Michael. We should add more functional validation to the
benchmarks now. It is learning after this episode.

On Tue, 17 Jan, 2023, 11:13 pm Michael Gibney (Jira), <ji...@apache.org>
wrote:

>
>     [
> https://issues.apache.org/jira/browse/SOLR-16622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677868#comment-17677868
> ]
>
> Michael Gibney commented on SOLR-16622:
> ---------------------------------------
>
> Thanks for this extra context, it's really helpful.
>
> {quote}this just shows that our testing is inadequate at the moment{quote}
>
> That makes sense broadly, IMO with some caveats (below). To state the
> obvious: these are basically integration tests, and by nature are going to
> be difficult to reproduce reliably, no matter how we proceed.
>
> On the one hand I agree it is fair to characterize this particular case as
> a functional regression -- on the other hand "our testing is inadequate"
> could easily be read as suggesting that existing unit tests and bats
> integration tests should do a better job of covering these types of issues,
> which I think would be misleading given the inherent challenges involved
> with regularly running integration tests. Really, the existing test suite
> is simply not designed to catch these kinds of "integration test" issues,
> and even "bats" integration tests would be difficult to adapt to serve the
> purpose of catching issues that only crop up when running at scale.
>
> "Straw man" argument: we could just lean in to periodic benchmarks helping
> to catch these types of issues. The overhead of running integration tests
> at scale would be significant. Even if the original intention of periodic
> benchmarks is to evaluate performance, it may be ok (not really a problem)
> that we end up catching some "integration test"-style issues as a
> consequence. (to be clear, I'm kinda just thinking out loud; neither
> assuming you agree nor disagree, Ishan!).
>
> > Replicas don't come up active after node restart
> > ------------------------------------------------
> >
> >                 Key: SOLR-16622
> >                 URL: https://issues.apache.org/jira/browse/SOLR-16622
> >             Project: Solr
> >          Issue Type: Bug
> >      Security Level: Public(Default Security Level. Issues are Public)
> >            Reporter: Ishan Chattopadhyaya
> >            Priority: Major
> >             Fix For: 9.1.1
> >
> >         Attachments: Screenshot from 2023-01-17 15-03-05.png
> >
> >
> > While benchmarking for performance, we saw a sharp change in the graphs:
> >
> https://issues.apache.org/jira/browse/SOLR-16525?focusedCommentId=17676725&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17676725
> > Turns out there was a commit (SOLR-16414) that escaped all testing and
> caused a regression where restarted nodes didn't have the replicas coming
> up as active.
> > This affects 9.1 release, so opening a new JIRA issue to track it.
> > Here's how to reproduce it:
> > {code}
> > git clone https://github.com/fullstorydev/solr-bench
> > cd solr-bench
> > # prerequisites on ubuntu:
> > sudo apt install openjdk-11-jdk
> > sudo apt install wget unzip zip ant ivy lsof git netcat make maven jq
> > # this is a patch to comment out the cleanup/final shutdown
> > wget https://termbin.com/yuu95
> > git apply yuu95
> > mvn clean compile assembly:single
> > ./cleanup.sh && ./stress.sh -c aa4f3d98ab19c201e7f3c74cd14c99174148616d
> suites/stress-facets-local.json
> > {code}
> > If the 95th percentile is <10 or so, we have a problem. It should be
> >300 or so. Since, we disabled cleanup, we can hit
> http://localhost:50000/solr/ to open Solr UI. In this case, I see that
> querying to the ecommerce-events collection shows shard2 is down.
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.20.10#820010)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
> For additional commands, e-mail: issues-help@solr.apache.org
>
>