You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Hoss Man (Jira)" <ji...@apache.org> on 2019/09/19 23:45:00 UTC
[jira] [Created] (SOLR-13781) TestContainerReqHandler.testPackageAPI failures imply race condition between update-package and delete-requesthandler

Hoss Man created SOLR-13781:
-------------------------------

             Summary: TestContainerReqHandler.testPackageAPI failures imply race condition between update-package and delete-requesthandler
                 Key: SOLR-13781
                 URL: https://issues.apache.org/jira/browse/SOLR-13781
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Hoss Man


We're seeing roughly an 8% failure rate from {{TestContainerReqHandler.testPackageAPI}} with failures occuring on both master and branch_8x, and on various jenkins servers and various OSes.

All of the failures occur at the same place: A V2 request to {{/node/ext}} to verify that that the {{requestHandler}} List is empty after issuing a {{delete-requesthandler: 'bar'}} payload to the {{/cluster}} API. The logs and failure message indicate that the {{'bar'}} request handler still exists even the assertion does a "sleep/retry" of the verification query 10 times.

While i don't fully understand this test, or the underlying code being tested, i spent a little time digging into the logs from some of these jenkins failures, and comparing them to the logs i see generated when i get a successful test run locally, and I think what's happening here - and the reason that {{delete-requesthandler}} seems to "fail" frequently in this test method, but not in {{testSetClusterReqHandler}} - is because the prior {{update-package}} command is still in process.

After the test code runs an {{update-package}} command, the test executes requests against {{/node/ext/bar}} to verify that the {{version}} has changed as a result of updating the package, but i suspect this is only looking at the _metadata_ that has changed as a result of the {{update-package}} command and not actaully ensuring that the request handler has fully loaded - because the logs when this test fails seem to show that the zkCallback threads kicked off by {{update-package}} command are still running when the zkCallback threads kicked off by the subsequent {{delete-requesthandler}} command are running, and finish *after* them, "re-registering" the handler that was just deleted.
----
It's not 100% clear to me if this is _just_ a test bug - and it should be monitoring something else to know when the request handler's a finished loading - or if this indicates a broader flaw in the design of how commands like {{add-package}} / {{update-package}} / {{add-requesthandler}} / {{delete-requesthandler}} should interact if/when they occur in close temporal proximity.

(ie: if there are zkCallback watchers loading classes and initializing objects as a result of cluster property changes, shouldn't there be some sort of lineraization/synchronization logic to ensure that they get executed in the same order on all the nodes in the cluster?)
----
More detail and log file attachments to follow...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org