You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "stillalex (via GitHub)" <gi...@apache.org> on 2023/03/29 17:16:05 UTC

[GitHub] [solr] stillalex opened a new pull request, #1504: ShardSplitTest flakiness investigation

stillalex opened a new pull request, #1504:
URL: https://github.com/apache/solr/pull/1504

   https://issues.apache.org/jira/browse/SOLR-XXXXX
   
   Investigating ShardSplitTest flakiness
   
   ```
   org.apache.solr.cloud.api.collections.ShardSplitTest > test FAILED
       java.lang.NullPointerException
           at __randomizedtesting.SeedInfo.seed([EF8D371C1435BF25:67D908C6BAC9D2DD]:0)
           at org.apache.solr.cloud.api.collections.ShardSplitTest.logDebugHelp(ShardSplitTest.java:1334)
           at org.apache.solr.cloud.api.collections.ShardSplitTest.checkDocCountsAndShardStates(ShardSplitTest.java:1191)
           at org.apache.solr.cloud.api.collections.ShardSplitTest.splitByUniqueKeyTest(ShardSplitTest.java:920)
           at org.apache.solr.cloud.api.collections.ShardSplitTest.test(ShardSplitTest.java:107)
           at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
           at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   ```
   Seems it's due to document _version_ field being null.
   
   So far
   - fixed NPE, made it clearer test is failing due to missing _version_ field
   - test still failing, needs more investigation
   
   How to reproduce (leaving this mainly for myself)
   
   ```
   ./gradlew beast -p solr/core --tests ShardSplitTest.test -Ptests.dups=100 -Ptests.nightly=true
   ```
   
   New exception trace after the current patch:
   
   ```
   org.apache.solr.cloud.api.collections.ShardSplitTest > test FAILED
       java.lang.AssertionError: doc 148 has null _version_ field
           at __randomizedtesting.SeedInfo.seed([E73E7B169FEF2B83:6F6A44CC3113467B]:0)
           at org.junit.Assert.fail(Assert.java:89)
           at org.junit.Assert.assertTrue(Assert.java:42)
           at org.junit.Assert.assertNotNull(Assert.java:713)
           at org.apache.solr.cloud.api.collections.ShardSplitTest.logDebugHelp(ShardSplitTest.java:1335)
           at org.apache.solr.cloud.api.collections.ShardSplitTest.checkDocCountsAndShardStates(ShardSplitTest.java:1191)
           at org.apache.solr.cloud.api.collections.ShardSplitTest.splitByUniqueKeyTest(ShardSplitTest.java:920)
           at org.apache.solr.cloud.api.collections.ShardSplitTest.test(ShardSplitTest.java:107)
   ```
   
   <!--
   _(If you are a project committer then you may remove some/all of the following template.)_
   
   Before creating a pull request, please file an issue in the ASF Jira system for Solr:
   
   * https://issues.apache.org/jira/projects/SOLR
   
   For something minor (i.e. that wouldn't be worth putting in release notes), you can skip JIRA. 
   To create a Jira issue, you will need to create an account there first.
   
   The title of the PR should reference the Jira issue number in the form:
   
   * SOLR-####: <short description of problem or changes>
   
   SOLR must be fully capitalized. A short description helps people scanning pull requests for items they can work on.
   
   Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. -->
   
   
   # Description
   
   Please provide a short description of the changes you're making with this pull request.
   
   # Solution
   
   Please provide a short description of the approach taken to implement your solution.
   
   # Tests
   
   Please describe the tests you've developed or run to confirm this patch implements the feature or solves the problem.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [ ] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [ ] I have created a Jira issue and added the issue ID to my pull request title.
   - [ ] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [ ] I have developed this patch against the `main` branch.
   - [ ] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   - [ ] I have added documentation for the [Reference Guide](https://github.com/apache/solr/tree/main/solr/solr-ref-guide)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on PR #1504:
URL: https://github.com/apache/solr/pull/1504#issuecomment-1503952520

   Status on current PR:
    - added the version check on additions to fail in case we are not leader and version = 0. (to match delete flows)
    - changed error status from BAD_REQUEST to INVALID_STATE to allow for retries. I was able to verify retries are happening [0]
    - removed a 'cmd' variable - this is just minor readability refactoring, I tried to avoid changing the code as much as possible
    - updated the ShardSplitTest to keep track of exceptions happening during the concurrent adds and deletes and fail if needed.
   
    Things I did not do
    - gave up on trying to use `isSubShardLeader ` to allow 'some' updates. left a comment above with the code snippet just in case anyone wants to review it 
    - noticed the setupRequest() method is usually called twice, I think this is easy to fix with a basic flag, I can add it if it doesn't grow the PR too much, or it can be done on a followup PR.
   
   Thank you @dsmiley for the continued reviews and pointers!
   
   
   [0]
   
   ```
   11027 INFO  (Thread-43) [] o.a.s.c.s.i.CloudSolrClient Request to collection [collection1] failed due to (510) org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:62183/k_fff/wc/collection1: missing _version_ on update from leader, retry=0 maxRetries=5 commError=false errorCode=510 
   11027 WARN  (Thread-43) [] o.a.s.c.s.i.CloudSolrClient Re-trying request to collection(s) [collection1] after stale state error from server.
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on PR #1504:
URL: https://github.com/apache/solr/pull/1504#issuecomment-1502421101

   > Where is this in the PR? I at least like the sound of it :-)
   
   you can find the logic here, basically not leader logic and zero version will trigger an exception: https://github.com/apache/solr/pull/1504/files#diff-cd52c0b0fc5671f2e4817e9d0c8373da4b848e6791df1ab73c08289078b07a30R289
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on a diff in pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on code in PR #1504:
URL: https://github.com/apache/solr/pull/1504#discussion_r1175920825


##########
solr/CHANGES.txt:
##########
@@ -145,6 +145,8 @@ Bug Fixes
 
 * SOLR-16755: bin/solr's '-noprompt' option no longer works for examples (hossman, janhoy, Houston Putman)
 
+* SOLR-7609: Internal update requests should fail back to the client in some edge cases for shard splits. Use HTTP status 510 so the client can try the operation. (Alex Deparvu, David Smiley, Tomás Fernández Löbbe)

Review Comment:
   thanks @tflobbe I pushed the change manually, GH wouldn't allow it for some reason



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] tflobbe commented on a diff in pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "tflobbe (via GitHub)" <gi...@apache.org>.
tflobbe commented on code in PR #1504:
URL: https://github.com/apache/solr/pull/1504#discussion_r1175896225


##########
solr/CHANGES.txt:
##########
@@ -145,6 +145,8 @@ Bug Fixes
 
 * SOLR-16755: bin/solr's '-noprompt' option no longer works for examples (hossman, janhoy, Houston Putman)
 
+* SOLR-7609: Internal update requests should fail back to the client in some edge cases for shard splits. Use HTTP status 510 so the client can try the operation. (Alex Deparvu, David Smiley, Tomás Fernández Löbbe)

Review Comment:
   ```suggestion
   * SOLR-7609: Internal update requests should fail back to the client in some edge cases for shard splits. Use HTTP status 510 so the client can retry the operation. (Alex Deparvu, David Smiley, Tomás Fernández Löbbe)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] dsmiley commented on a diff in pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "dsmiley (via GitHub)" <gi...@apache.org>.
dsmiley commented on code in PR #1504:
URL: https://github.com/apache/solr/pull/1504#discussion_r1162183932


##########
solr/core/src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java:
##########
@@ -331,7 +342,7 @@ protected boolean versionAdd(AddUpdateCommand cmd) throws IOException {
 
     boolean isReplayOrPeersync =
         (cmd.getFlags() & (UpdateCommand.REPLAY | UpdateCommand.PEER_SYNC)) != 0;
-    boolean leaderLogic = isLeader && !isReplayOrPeersync;
+    boolean leaderLogic = leaderLogicWithVersionIntegrityCheck(isReplayOrPeersync, versionOnUpdate);

Review Comment:
   My concern is better expressed as related to aspects of the contents of leaderLogicWithVersionIntegrityCheck().  I like that it throws an exception when versionOnUpdate = 0 when we expect it to not be 0.  But I don't like changing the criteria for "leaderLogic".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] tflobbe commented on a diff in pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "tflobbe (via GitHub)" <gi...@apache.org>.
tflobbe commented on code in PR #1504:
URL: https://github.com/apache/solr/pull/1504#discussion_r1175899567


##########
solr/CHANGES.txt:
##########
@@ -145,6 +145,8 @@ Bug Fixes
 
 * SOLR-16755: bin/solr's '-noprompt' option no longer works for examples (hossman, janhoy, Houston Putman)
 
+* SOLR-7609: Internal update requests should fail back to the client in some edge cases for shard splits. Use HTTP status 510 so the client can try the operation. (Alex Deparvu, David Smiley, Tomás Fernández Löbbe)

Review Comment:
   ```suggestion
   * SOLR-7609: Internal update requests should fail back to the client in some edge cases for shard splits. Use HTTP status 510 so the client can retry the operation. (Alex Deparvu, David Smiley, Tomás Fernández Löbbe)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on a diff in pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on code in PR #1504:
URL: https://github.com/apache/solr/pull/1504#discussion_r1163180972


##########
solr/core/src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java:
##########
@@ -331,7 +342,7 @@ protected boolean versionAdd(AddUpdateCommand cmd) throws IOException {
 
     boolean isReplayOrPeersync =
         (cmd.getFlags() & (UpdateCommand.REPLAY | UpdateCommand.PEER_SYNC)) != 0;
-    boolean leaderLogic = isLeader && !isReplayOrPeersync;
+    boolean leaderLogic = leaderLogicWithVersionIntegrityCheck(isReplayOrPeersync, versionOnUpdate);

Review Comment:
   I am not as confident about the above anymore looking at the error traces. not sure how, but the test is using CloudSolrClient for some of the requests, I am seeing the following in the logs
   ```
     2> 11376 INFO  (Thread-43) [] o.a.s.c.s.i.CloudSolrClient Request to collection [collection1] failed due to (400) org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:59362/collection1: missing _version_ on update from leader, retry=0 maxRetries=5 commError=false errorCode=400 
   ```
   
   I am considering changing the response code from BAD_REQUEST to INVALID_STATE to allow for a retry (same as #1484). thanks for the hint @dsmiley, it didn't click when you mentioned it earlier (what was missing was the extra check).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on PR #1504:
URL: https://github.com/apache/solr/pull/1504#issuecomment-1518101663

   > That is very similar to my other PR that does the same thing for a similar case. I might just list both together, which we do sometimes.
   
   The CHANGES.txt proposal sounds good to me, and sure if you want to merge both PRs at the same time with an adjacent listing I am fine with that.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on PR #1504:
URL: https://github.com/apache/solr/pull/1504#issuecomment-1518426686

   I spent some more time on this class and have identified 2 problematic things I wanted to share:
    * this looks like a NPE waiting to happen https://github.com/apache/solr/blob/db4cb66271f615da6a0a3ae6fed5fb2e184fd053/solr/core/src/java/org/apache/solr/update/processor/DistributedZkUpdateProcessor.java#L889 I believe the intent could have been to check `docCollection` for null instead of `collection`?
    * all over the class there is a pattern of checking read only status to prevent some operations I believe could be broken.
   ```
   clusterState = zkController.getClusterState();
   if (isReadOnly()) {
     throw new SolrException(ErrorCode.FORBIDDEN, "Collection " + collection + " is read-only.");
   }
   ```
   refreshing the `clusterState` is insufficient, because the `isReadOnly` is based on the `readOnlyCollection` flag that is only initialized at the beginning. if the intent was to have a fresh check, the `readOnlyCollection` flag needs to be updated too, based on the new `clusterState`. please correct if I am reading this incorrectly.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on PR #1504:
URL: https://github.com/apache/solr/pull/1504#issuecomment-1500709938

   I have spent some more time unpacking this test.
   
   Test setup is: shard `shard1` is split into 2 new slices `shard1_0` and `shard1_1`, concurrently there are additions happening.
   
   The failure scenario context is:
    - the test client makes direct 'addDoc' calls to the new slice `shard1_1` that is in the process of being created. this means the `DistribPhase` is `NONE`
    - target slice (coll.getRouter().getTargetSlice) for the request is `shard1`
    - `isLeader` is `false` on the new slice `shard1_1`. the old shard is still considered the leader replica at this point.
    - the new slice has `RECOVERY` state. also because target slice is `shard1` the failing doc must be inside the correct slice interval (`isTargetSlice` needs to be true). this makes the `isSubShardLeader` `true`
   
   Because the `isLeader` flag is `false`, the `DistributedUpdateProcessor` uses whatever version it can extract from the request. Now _if_ this was a leader originated request (`phase=FROMLEADER`) there would have been a version available, but it is not, so version is `0`. 
   
   there is no integrity check in place to verify the version is > 0 (I added one with this PR). but we can do better, we can avoid failing the integrity check if we know the `isSubShardLeader` flag is `true` so we can apply a 'leader logic' strategy and generate a new version before saving the new document to the index.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on PR #1504:
URL: https://github.com/apache/solr/pull/1504#issuecomment-1496270123

   @dsmiley thanks for taking a look. I think we are looking at the same problem area but different lifecycles.
   
   Your summary on #1484 applies here as well:
   ```
   A shard being split (a so-called parent shard) or that which recently completed (thus may have state INACTIVE) receives docs from a client (the test) and forwards to the sub-shards. 
   ```
   the difference is you are seeing `ClusterState says we are the leader` while (one of) my errors are `Request says it is coming from parent shard leader but we are in active state`. I think you are seeing a pre-success shard split race window, while I am seeing a post-success shard split. I am wondering if my fix will not cover your case as well. I am attempting to use the `isSubShardLeader` to decide if locally we could apply the leader behavior instead of the follower behavior. it seems to work well for the failures I am looking at.
   
   It would be really nice to have a mapping between all possible state transitions, and build a unit-test-like verification for all transition states verifying if all cases are handled and no illegal state can sneak in (like persisting a null-version document).
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on PR #1504:
URL: https://github.com/apache/solr/pull/1504#issuecomment-1492441639

   I have pushed a tentative patch for review and more importantly for discussion.
   
   I have ran this test extensively (hundreds of beast jobs) and it looks like this is the way to go, all 'add' exceptions are gone once the isLeader flag is updated. BUT I am not completely convinced this is the correct patch. possibly something could be made better in the DistributedZkUpdateProcessor area instead (where a lot of leader calculations are being made but the result is not correct in this edge case).
   While I am still evaluating a patch for DistributedZkUpdateProcessor, I would love some feedback on the observations so far.
   
   Together with the patch I have added a check which will not allow additions with no version. this is now consistent with the delete flows. although I must admit I am confused by the fact that this check was missing on this execution path, while the delete path had this from at least 8 years ago.
   
   Other points
    - in the current test, there are a few operations (add and deletes) that run parallel to the split operation. these can fail and this is where the bug is. because the exceptions are not correctly accounted for, the NPE popped up at a weird place. I added tracking and assertions on this part too (except for deletes, see below)
    - deletes can also cause exceptions, I don't have a good handle on how to fix this, so I removed the strict error check. this can be added once the confidence level is higher and this flow is accounted for. until then I will focus on the addition parts.
    - DistributedZkUpdateProcessor will call setupRequest twice. this is probably a minor nuisance for some future improvement PR.
    - finding the code very hard to read I took the decision to remove one variable that is not really needed (the `updateCommand`), this change is purely cosmetic and I can revert it if it adds too much noise to the current patch.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on PR #1504:
URL: https://github.com/apache/solr/pull/1504#issuecomment-1492480511

   seeing some local failures due to the newly added 0 version check
   
   ```
   org.apache.solr.search.TestRecovery > testLogReplayWithReorderedDBQByAsterixAndChildDocs FAILED
       java.lang.Exception: org.apache.solr.common.SolrException: missing _version_ on update from leader
           at __randomizedtesting.SeedInfo.seed([FE85400FD82E3172:9A3E186DD2648A17]:0)
           at org.apache.solr.search.TestRecovery.testLogReplayWithReorderedDBQWrapper(TestRecovery.java:616)
           at org.apache.solr.search.TestRecovery.testLogReplayWithReorderedDBQByAsterixAndChildDocs(TestRecovery.java:430)
           at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
           at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.base/java.lang.reflect.Method.invoke(Method.java:568)
           at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
           at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
           at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
           at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
           at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:80)
           at org.junit.rules.RunRules.evaluate(RunRules.java:20)
           at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
           at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
           at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
           at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
           at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
           at org.junit.rules.RunRules.evaluate(RunRules.java:20)
           at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
           at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
           at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
           at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
           at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
           at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
           at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
           at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
           at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
           at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
           at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
           at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
           at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:80)
           at org.junit.rules.RunRules.evaluate(RunRules.java:20)
           at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
           at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
           at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
           at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
           at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
           at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
           at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
           at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
           at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
           at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
           at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
           at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
           at org.junit.rules.RunRules.evaluate(RunRules.java:20)
           at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
           at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
           at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
           at java.base/java.lang.Thread.run(Thread.java:833)
   
           Caused by:
           org.apache.solr.common.SolrException: missing _version_ on update from leader
               at app//org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:334)
               at app//org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:234)
               at app//org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:111)
               at app//org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:554)
               at app//org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:185)
               at app//org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:151)
               at app//org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:86)
               at app//org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:101)
               at app//org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:84)
               at app//org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:224)
               at app//org.apache.solr.core.SolrCore.execute(SolrCore.java:2892)
               at app//org.apache.solr.servlet.DirectSolrConnection.request(DirectSolrConnection.java:112)
               at app//org.apache.solr.SolrTestCaseJ4.updateJ(SolrTestCaseJ4.java:1425)
               at app//org.apache.solr.search.TestRecovery.lambda$testLogReplayWithReorderedDBQByAsterixAndChildDocs$8(TestRecovery.java:439)
               at app//org.apache.solr.search.TestRecovery.testLogReplayWithReorderedDBQWrapper(TestRecovery.java:593)
               ... 48 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] tflobbe commented on a diff in pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "tflobbe (via GitHub)" <gi...@apache.org>.
tflobbe commented on code in PR #1504:
URL: https://github.com/apache/solr/pull/1504#discussion_r1175896225


##########
solr/CHANGES.txt:
##########
@@ -145,6 +145,8 @@ Bug Fixes
 
 * SOLR-16755: bin/solr's '-noprompt' option no longer works for examples (hossman, janhoy, Houston Putman)
 
+* SOLR-7609: Internal update requests should fail back to the client in some edge cases for shard splits. Use HTTP status 510 so the client can try the operation. (Alex Deparvu, David Smiley, Tomás Fernández Löbbe)

Review Comment:
   ```suggestion
   * SOLR-7609: Internal update requests should fail back to the client in some edge cases for shard splits. Use HTTP status 510 so the client can retry the operation. (Alex Deparvu, David Smiley, Tomás Fernández Löbbe)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] dsmiley commented on a diff in pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "dsmiley (via GitHub)" <gi...@apache.org>.
dsmiley commented on code in PR #1504:
URL: https://github.com/apache/solr/pull/1504#discussion_r1162894003


##########
solr/core/src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java:
##########
@@ -331,7 +342,7 @@ protected boolean versionAdd(AddUpdateCommand cmd) throws IOException {
 
     boolean isReplayOrPeersync =
         (cmd.getFlags() & (UpdateCommand.REPLAY | UpdateCommand.PEER_SYNC)) != 0;
-    boolean leaderLogic = isLeader && !isReplayOrPeersync;
+    boolean leaderLogic = leaderLogicWithVersionIntegrityCheck(isReplayOrPeersync, versionOnUpdate);

Review Comment:
   I don't see ShardSplitTest _indexing_ docs with an HttpClient.  Except for the "controlClient" maybe, which probably doesn't count but I didn't look closer at that aspect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] tflobbe merged pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "tflobbe (via GitHub)" <gi...@apache.org>.
tflobbe merged PR #1504:
URL: https://github.com/apache/solr/pull/1504


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] tflobbe commented on pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "tflobbe (via GitHub)" <gi...@apache.org>.
tflobbe commented on PR #1504:
URL: https://github.com/apache/solr/pull/1504#issuecomment-1520829973

   > I believe the intent could have been to check docCollection for null instead of collection?
   
   +1, this seems like a bug.
   
   > all over the class there is a pattern of checking read only status to prevent some operations I believe could be broken.
   
   I'd have to look this a bit more, but you are probably right. Probably deserves it's own PR?
   
   Let's merge this PR. @stillalex, can you push the `CHANGES.txt`? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1504: ShardSplitTest flakiness investigation

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on PR #1504:
URL: https://github.com/apache/solr/pull/1504#issuecomment-1489101406

   I want to say this seems familiar [SOLR-8122](https://issues.apache.org/jira/browse/SOLR-8122) but it's pretty dated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1504: ShardSplitTest flakiness investigation

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on PR #1504:
URL: https://github.com/apache/solr/pull/1504#issuecomment-1491075979

   minor update. it seems that the 'delete' flow has some version checks in place https://github.com/apache/solr/blob/288e5123dec8f604fe57a9cb13e9770001c0ed57/solr/core/src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java#L1079, while the add flow is allowing the no-version save. an equivalent check could be added around this line  https://github.com/apache/solr/blob/288e5123dec8f604fe57a9cb13e9770001c0ed57/solr/core/src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java#L454
   at this point `versionOnUpdate` is zero because I think the update is coming from the leader and the current shard split piece is still in a weird state.
   
   I actually caught this flow in one run (the line numbers might not match due to some local logs I added)
   ```
     2> 12044 ERROR (qtp2034856562-109) [n:127.0.0.1:65414_l_lbs%2Fix c:collection1 s:shard1_1 r:core_node12 x:collection1_shard1_1_replica_n10] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: missing _version_ on update from leader
     2>           => org.apache.solr.common.SolrException: missing _version_ on update from leader
     2>    at org.apache.solr.update.processor.DistributedUpdateProcessor.versionDelete(DistributedUpdateProcessor.java:1105)
     2> org.apache.solr.common.SolrException: missing _version_ on update from leader
     2>    at org.apache.solr.update.processor.DistributedUpdateProcessor.versionDelete(DistributedUpdateProcessor.java:1105) ~[main/:?]
     2>    at org.apache.solr.update.processor.DistributedUpdateProcessor.doDeleteById(DistributedUpdateProcessor.java:867) ~[main/:?]
     2>    at org.apache.solr.update.processor.DistributedZkUpdateProcessor.doDeleteById(DistributedZkUpdateProcessor.java:376) ~[main/:?]
     2>    at org.apache.solr.update.processor.DistributedUpdateProcessor.processDelete(DistributedUpdateProcessor.java:853) ~[main/:?]
     2>    at org.apache.solr.update.processor.DistributedZkUpdateProcessor.processDelete(DistributedZkUpdateProcessor.java:353) ~[main/:?]
     2>    at org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processDelete(LogUpdateProcessorFactory.java:134) ~[main/:?]
     2>    at org.apache.solr.handler.loader.JavabinLoader.delete(JavabinLoader.java:222) ~[main/:?]
     2>    at org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:140) ~[main/:?]
     2>    at org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:74) ~[main/:?]
     2>    at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:101) ~[main/:?]
     2>    at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:84) ~[main/:?]
     2>    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:224) [main/:?]
     2>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2892) [main/:?]
     2>    at org.apache.solr.servlet.HttpSolrCall.executeCoreRequest(HttpSolrCall.java:866) [main/:?]
     2>    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:562) [main/:?]
     2>    at org.apache.solr.servlet.SolrDispatchFilter.dispatch(SolrDispatchFilter.java:252) [main/:?]
     2>    at org.apache.solr.servlet.SolrDispatchFilter.lambda$doFilter$0(SolrDispatchFilter.java:220) [main/:?]
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] dsmiley commented on a diff in pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "dsmiley (via GitHub)" <gi...@apache.org>.
dsmiley commented on code in PR #1504:
URL: https://github.com/apache/solr/pull/1504#discussion_r1162098045


##########
solr/core/src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java:
##########
@@ -331,7 +342,7 @@ protected boolean versionAdd(AddUpdateCommand cmd) throws IOException {
 
     boolean isReplayOrPeersync =
         (cmd.getFlags() & (UpdateCommand.REPLAY | UpdateCommand.PEER_SYNC)) != 0;
-    boolean leaderLogic = isLeader && !isReplayOrPeersync;
+    boolean leaderLogic = leaderLogicWithVersionIntegrityCheck(isReplayOrPeersync, versionOnUpdate);

Review Comment:
   I don't like this change (thus don't like leaderLogicWithVersionIntegrityCheck).  If there is some race condition causing the client to route to us (a new sub-shard) before we know that we are ready to accept it, I'd rather not pretend we are ready in any way.  Maybe throw an exception like INVALID_STATE that should cause CloudSolrClient to retry.  Our state will update eventually.



##########
solr/core/src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java:
##########
@@ -282,6 +280,19 @@ public static int bucketHash(BytesRef idBytes) {
     return Hash.murmurhash3_x86_32(idBytes.bytes, idBytes.offset, idBytes.length, 0);
   }
 
+  private boolean leaderLogicWithVersionIntegrityCheck(

Review Comment:
   place methods after where they are called, not before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1504: ShardSplitTest flakiness investigation

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on PR #1504:
URL: https://github.com/apache/solr/pull/1504#issuecomment-1492285957

   There is another weird error I saw while leaving this to run overnight. there are possibly more races, I will focus my proposal to the most common one, the doc addition (will be posting an update shortly).
   
   ```
      >     java.lang.AssertionError: Errors present while concurrently adding and removing docs [Error from server at http://127.0.0.1:55407/gne/mj/collection1_shard1_replica_n1: Async exception during distributed update: Error from server at http://127.0.0.1:55407/gne/mj/collection1_shard1_0_replica_n9/update?update.distrib=FROMLEADER&distrib.from=http%3A%2F%2F127.0.0.1%3A55407%2Fgne%2Fmj%2Fcollection1_shard1_replica_n1%2F&distrib.from.parent=shard1&wt=javabin&version=2: Request says it is coming from parent shard leader but we are in active state]
      >         at __randomizedtesting.SeedInfo.seed([2BF99011CE0D3FCF:A3ADAFCB60F15237]:0)
      >         at org.junit.Assert.fail(Assert.java:89)
      >         at org.junit.Assert.assertTrue(Assert.java:42)
      >         at org.apache.solr.cloud.api.collections.ShardSplitTest.splitByUniqueKeyTest(ShardSplitTest.java:939)
      >         at org.apache.solr.cloud.api.collections.ShardSplitTest.test(ShardSplitTest.java:110)
      >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      >         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
      >         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      >         at java.base/java.lang.reflect.Method.invoke(Method.java:568)
      >         at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
      >         at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
      >         at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
      >         at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
      >         at org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsFixedStatement.callStatement(BaseDistributedSearchTestCase.java:1164)
      >         at org.apache.solr.BaseDistributedSearchTestCase$ShardsRepeatRule$ShardsStatement.evaluate(BaseDistributedSearchTestCase.java:1135)
      >         at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:80)
      >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)
      >         at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
      >         at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
      >         at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
      >         at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
      >         at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
      >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)
      >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      >         at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
      >         at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
      >         at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
      >         at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
      >         at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
      >         at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
      >         at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
      >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      >         at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
      >         at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
      >         at com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:80)
      >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)
      >         at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
      >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      >         at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
      >         at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
      >         at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
      >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      >         at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
      >         at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
      >         at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
      >         at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
      >         at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
      >         at org.junit.rules.RunRules.evaluate(RunRules.java:20)
      >         at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      >         at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
      >         at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
      >         at java.base/java.lang.Thread.run(Thread.java:833)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] dsmiley commented on pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "dsmiley (via GitHub)" <gi...@apache.org>.
dsmiley commented on PR #1504:
URL: https://github.com/apache/solr/pull/1504#issuecomment-1495363648

   I'm glad you are digging into this test failure!  You've left lots of breadcrumbs for me to follow.
   I like adding more checks to insist there is a version if we're forwarded an update from a leader.  If some code paths don't verify it, it was likely forgotten.  It's not clear to me if you've root caused why it can be null.
   
   There may be a relation to #1484 which I have yet to merge, as I'm trying to be real thorough first.  I thought maybe the change there might fix this one but no it doesn't.
   
   > there is a delete that is reported successful, but not applied
   
   Clearly a problem.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on a diff in pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on code in PR #1504:
URL: https://github.com/apache/solr/pull/1504#discussion_r1163093972


##########
solr/core/src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java:
##########
@@ -331,7 +342,7 @@ protected boolean versionAdd(AddUpdateCommand cmd) throws IOException {
 
     boolean isReplayOrPeersync =
         (cmd.getFlags() & (UpdateCommand.REPLAY | UpdateCommand.PEER_SYNC)) != 0;
-    boolean leaderLogic = isLeader && !isReplayOrPeersync;
+    boolean leaderLogic = leaderLogicWithVersionIntegrityCheck(isReplayOrPeersync, versionOnUpdate);

Review Comment:
   BaseDistributedSearchTestCase#clientFor uses the `clients` list that is populated via [SolrTestCaseJ4#getHttpSolrClient](https://github.com/apache/solr/blob/3d02801a565c1083ad899c1cb4b7bcb21754fe04/solr/test-framework/src/java/org/apache/solr/SolrTestCaseJ4.java#L2642)
   
   ```
       return new HttpSolrClient.Builder(url).build();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on a diff in pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on code in PR #1504:
URL: https://github.com/apache/solr/pull/1504#discussion_r1162791455


##########
solr/core/src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java:
##########
@@ -331,7 +342,7 @@ protected boolean versionAdd(AddUpdateCommand cmd) throws IOException {
 
     boolean isReplayOrPeersync =
         (cmd.getFlags() & (UpdateCommand.REPLAY | UpdateCommand.PEER_SYNC)) != 0;
-    boolean leaderLogic = isLeader && !isReplayOrPeersync;
+    boolean leaderLogic = leaderLogicWithVersionIntegrityCheck(isReplayOrPeersync, versionOnUpdate);

Review Comment:
   > f there is some race condition causing the client to route to us (a new sub-shard) before we know that we are ready to accept it, I'd rather not pretend we are ready in any way.
   
   this is what I am struggling with. I'm not convinced this can be called a race condition on the client side. the test currently picks an http client and issues a request. this can land on a recovering shard split. I am ok with only having the exception in place (no subShardLeader funny business), but then the test's request will fail, who needs to be responsible for the retry?
   
   > Maybe throw an exception like INVALID_STATE that should cause CloudSolrClient to retry
   
   I agree with this idea, but I think it also hints at replacing existing test clients with CloudSolrClient, the existing HttpSolrClient does not perform retries.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on PR #1504:
URL: https://github.com/apache/solr/pull/1504#issuecomment-1503929909

   Leaving this here for future reference. I think we could consider allowing doc updates based on the `isSubShardLeader` but this is tricky to verify, and well beyond my knowledge of this code. so I will remove this change from my current PR (this split shard test was passing with the change over a lot of repetitions for additions, but deletes were not applied correctly).
   
   ```
     private boolean leaderLogicWithVersionIntegrityCheck(
         boolean isReplayOrPeersync, long versionOnUpdate) {
       boolean leaderLogic = isLeader && !isReplayOrPeersync;
       if (!leaderLogic && versionOnUpdate == 0) {
         // refreshing leaderLogic status in case this is a race (see SOLR-7609)
         leaderLogic = isSubShardLeader && !isReplayOrPeersync;
         if (!leaderLogic) {
           throw new SolrException(ErrorCode.BAD_REQUEST, "missing _version_ on update from leader");
         }
       }
       return leaderLogic;
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on PR #1504:
URL: https://github.com/apache/solr/pull/1504#issuecomment-1493173156

   updated the patch with a better fix. I'm now using the `isSubShardLeader` flag which seems to have been introduced for the split shard scenario (not sure). will be running some more tests to determine what other errors I am seeing.
   
   re. deletes. I have tried to enable the delete flow as well but was not successful. I left a few TODOs in the code where this is needed. in short: currently the concurrent deletes will be rejected (which is not ideal), this change could make the check less strict by looking at the `isSubShardLeader` (basically using the same code for all flows). trying this approach makes the test fail because there is a delete that is reported successful, but not applied, so the integrity check at the end fails because there are too many documents found.
   
   I think I would prefer to stabilize the 'add doc' flow and then open another PR for the deletes investigation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] dsmiley commented on pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "dsmiley (via GitHub)" <gi...@apache.org>.
dsmiley commented on PR #1504:
URL: https://github.com/apache/solr/pull/1504#issuecomment-1517240922

   I appreciate the bulleted summary of your last comment very much.  It will make a good commit message.   I think this PR is in good shape.
   
   Proposed CHANGES.txt _(less sure on this)_:
   In maybe BUG category, if not IMPROVEMENT:
   * SOLR-7609: Internal update requests should fail back to the client (who can retry) in some edge cases for shard splits. Use HTTP status 510.  (your name here)
   
   That is *very* similar to my other PR that does the same thing for a similar case.  I might just list both together, which we do sometimes.
   
   _(disclaimer: I'm on vacation)_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org


[GitHub] [solr] stillalex commented on pull request #1504: SOLR-7609 ShardSplitTest NPE investigation

Posted by "stillalex (via GitHub)" <gi...@apache.org>.
stillalex commented on PR #1504:
URL: https://github.com/apache/solr/pull/1504#issuecomment-1520926805

   @tflobbe CHANGES file updated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org