You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Jan Høydahl <ja...@cominvent.com> on 2021/06/01 08:04:18 UTC

Re: Doubling down on our mistakes?

To everyone following this e-mail thread.

The Project Management Committees have discussed the matter and would like to draw attention to "Statement from the Solr and Lucene PMC regarding recent Code of Conduct violations" posted to this list today, and linked below:

https://lists.apache.org/thread.html/r9875b53aeaebca8678ee0127562d8a35c7938906fbd318ac17ba011d%40%3Cdev.solr.apache.org%3E <https://lists.apache.org/thread.html/r9875b53aeaebca8678ee0127562d8a35c7938906fbd318ac17ba011d@%3Cdev.solr.apache.org%3E>
Jan Høydahl
Solr PMC Chair

> 21. mai 2021 kl. 05:52 skrev David Smiley <ds...@apache.org>:
> 
> I removed dev@lucene.apache.org <ma...@lucene.apache.org> from my response here.  Please everyone do the same and don't email both Lucene & Solr at the same time.  I recall that's an old best practice / rule in general -- never address an email to more than one list.
> 
> I agree 100% with Erick.  It's shameful and looks bad on our community and it's just so not necessary.  It's a clear code-of-conduct violation.  I hope Andrzej is "okay" emotionally; I'd be a mess in his shoes.  At least the apologies are very reasonable to me; I was expecting Ishan/Noble to dig their heels in (as I witnessed some months ago) and I'm relieved not to see that.  
> 
> The internal complexity of Solr (esp. SolrCloud) is very high; it's difficult to make changes and not have some worry that maybe a change has some ill effect.  Yet we can't simply not touch it.  The irony here is that the change in question was targeted directly at improving the quality of Solr; I love those types of changes, honestly.
> 
> Perhaps Solr getting it's own Docker images as part of the project may lead to automated Solr-upgrade testing to catch compatibility bugs?  Maybe that might be done at the K8S Solr Operator level integration tests since I'm guessing the Operator facilitates upgrades already?
> 
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley <http://www.linkedin.com/in/davidwsmiley>
> 
> On Tue, May 18, 2021 at 8:54 AM Ishan Chattopadhyaya <ichattopadhyaya@gmail.com <ma...@gmail.com>> wrote:
> I apologize for the harsh words, and personally to Andrzej for hurting your feelings. I had no such intentions. 
> 
> > You conveniently don’t mention that I WITHDREW my objection, and instead proposed a lenient validation (but validation nonetheless!).
> Yes, let me mention that you agreed in principal to reduce the impact of the change (even though not completely revert it). I welcome that and thank you for that. By the time you replied on JIRA, I had already sent this mail.
> 
> > I see no urgency at all in this matter. This can be handled as day-to-day bug fixing as usual.
> I think this requires an immediate notification to all users to be aware of this situation before upgrading. Also, an immediate breakfix should be helpful for them. 
> 
> > My feelings are hurt, and I'm greatly disappointed in your words, quick attacking off the cuff regularly rude (IMO) because you happened to have a bad day.
> I apologize.
> 
> How I saw things is that we have a commitment to our users to give them good quality software that they can rely on. My intention was not to attack Andrzej personally, but to bring about collective awareness regarding this problem: that we, as a community, don't care enough for our users. We need to get better at testing, get better at reviews, better at benchmarks, etc. Individually, we all have the best of intentions, and obviously so does Andrzej. However, we need to get better, and I wanted this to be a starting point in that conversation. Clearly, I was carried over and I apologize for that.
> 
> On Tue, May 18, 2021 at 5:52 PM Andrzej Białecki <ab@getopt.org <ma...@getopt.org>> wrote:
> Ishan, as I pointed out in Jira I don’t care for you implying that I have evil intentions, I resent also your implication that I’m behaving irrationally or don’t care for the users. Those of you who are interested may read the comments in Jira and judge for themselves.
> 
> You conveniently don’t mention that I WITHDREW my objection, and instead proposed a lenient validation (but validation nonetheless!). It’s easy to scream “revert! revert!” but it actually takes some consideration to properly address the original purpose of this change - that is, detecting and avoiding the corruption of replica state. Let’s focus on this and not on pointing fingers.
> 
> As for the production outage - I’m sorry this happened to you. As I hope you and Noble and others are sorry for other inadvertently introduced bugs, which I’m sure brought down many clusters at inconvenient hours... 
> 
> 
>> On 18 May 2021, at 13:26, Ishan Chattopadhyaya <ichattopadhyaya@gmail.com <ma...@gmail.com>> wrote:
>> 
>> https://issues.apache.org/jira/browse/SOLR-14245 <https://issues.apache.org/jira/browse/SOLR-14245>
>> 
>> There was a production outage at odd hours at my (and Noble's) client, due to this above change in Solr 8.5 onwards by Andrzej Bialecki.
>> 
>> In short, there is some bug in Solr where a replica gets "null" as the node_name (upon invocation of a collection API command). On the rare occasions where we encountered such situations in the past, the replica would be unavailable and the system would work fine overall. However, this change (which introduces strict validation of errors while *reading* Replica objects) now means that if such a situation arises (where some Solr's APIs itself results in node_name being null in a state.json), all SolrJ clients and all Solr nodes will go for a toss (possibly crash, and not start back up).
>> 
>> This change was rushed in, without any discussions or review, without extensive testing for the failures it will cause on existing systems where cluster state is messed up but system is running, and without any consideration for the impact on users.
>> 
>> Noble and I are of the opinion that this change should be reverted immediately, considering the impact to users. However, there is strong disagreement on Andrzej's part.
>> 
>> Mistakes happen, but doubling down on them irrationally [1] will destroy the reputation of the project, let alone the peace of mind of those who are running Solr in production.
>> 
>> Does someone have any thoughts or opinions?
>> 
>> [1] - https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758 <https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758>