You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Hoss Man (Jira)" <ji...@apache.org> on 2019/09/19 23:47:00 UTC

[jira] [Updated] (SOLR-13781) TestContainerReqHandler.testPackageAPI failures imply race condition between update-package and delete-requesthandler

     [ https://issues.apache.org/jira/browse/SOLR-13781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-13781:
----------------------------
    Attachment: egrep-out.local.log.txt
                egrep-out.apache_Lucene-Solr-Tests-8.x_587.testPackageAPI.txt
                apache_Lucene-Solr-Tests-8.x_587.testPackageAPI.txt
                local.log.txt
        Status: Open  (was: Open)

I'm attaching 4 files...
 * apache_Lucene-Solr-Tests-8.x_587.testPackageAPI.txt
 ** the section of logs from the apache/Lucene-Solr-Tests-8.x/587 jenkins build that includes the FAIL of testPackageAPI
 * local.log.txt
 ** logs from my local PASS run of testPackageAPI using the same git SHA and reproduce line
 * egrep-out.apache_Lucene-Solr-Tests-8.x_587.testPackageAPI.txt
 * egrep-out.local.log.txt

...the last 2 being the output of running the following grep command on the first 2...
{noformat}
egrep 'Loaded cluster properties|MemClassLoader|PackageManager RequestHandlers being reloaded|PackageManager Registering request handlers|webapp=null path=/node/ext|webapp=null path=/cluster'
{noformat}
...in order to extract the lines that seem the most important to notice for comparison.

The key bit i'd like to point out is this section of egrep-out.local.log.txt, which begins with the last request client request of {{/node/ext/bar}} to verify it exists, and shows the initial log messages resuting from it's subsequent deletion...
{noformat}
6354 INFO  (qtp822543449-83) [n:127.0.0.1:36249_solr     ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/node/ext/bar params={wt=javabin&version=2} status=0 QTime=1
6373 DEBUG (zkCallback-49-thread-1) [     ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:40725/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3
6373 DEBUG (zkCallback-64-thread-1) [     ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:40725/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3
6373 DEBUG (zkCallback-79-thread-1) [     ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:40725/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3
6373 DEBUG (zkCallback-70-thread-1) [     ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:40725/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3
6374 DEBUG (zkCallback-63-thread-1) [     ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:40725/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3
6374 DEBUG (zkCallback-72-thread-1) [     ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:40725/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3
6375 DEBUG (zkCallback-88-thread-1) [     ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:40725/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3
6375 DEBUG (zkCallback-53-thread-1) [     ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:40725/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3
6378 DEBUG (zkCallback-45-thread-1) [     ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:40725/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3
6380 DEBUG (zkCallback-72-thread-1) [     ] o.a.s.c.PackageManager RequestHandlers being reloaded : []
6381 DEBUG (zkCallback-72-thread-1) [     ] o.a.s.c.PackageManager Registering request handlers [] 
6381 DEBUG (zkCallback-49-thread-1) [     ] o.a.s.c.PackageManager RequestHandlers being reloaded : []
6381 DEBUG (zkCallback-49-thread-1) [     ] o.a.s.c.PackageManager Registering request handlers [] 
6386 DEBUG (zkCallback-53-thread-1) [     ] o.a.s.c.PackageManager RequestHandlers being reloaded : []
6387 DEBUG (zkCallback-53-thread-1) [     ] o.a.s.c.PackageManager Registering request handlers [] 
6388 DEBUG (zkCallback-45-thread-1) [     ] o.a.s.c.PackageManager RequestHandlers being reloaded : []
6388 DEBUG (zkCallback-45-thread-1) [     ] o.a.s.c.PackageManager Registering request handlers [] 
... 
{noformat}
 * There are 9 zkCallback threads that detect the new cluster properties indicating an empty list of requestHandler's
 * 4 of these zkCallback threads (corrisponding to the 4 nodes in the MiniSolrCloudCluster) invoke the PackageManager (the others are evidently unrelated cluster property watchers)
 * These 4 zkCallback threads all agree that no request handlers need reloaded, and no request handlers need registered.

Compare that with the same section of logs from egrep-out.apache_Lucene-Solr-Tests-8.x_587.testPackageAPI.txt ...
{noformat}
3532248 INFO  (qtp1010395414-41961) [n:127.0.0.1:42699_solr     ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/node/ext/bar params={wt=javabin&version=2} status=0 QTime=0
3532251 DEBUG (zkCallback-14907-thread-1) [     ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:43314/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3
3532251 DEBUG (zkCallback-14919-thread-1) [     ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:43314/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3
3532251 DEBUG (zkCallback-14941-thread-1) [     ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:43314/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3
3532251 DEBUG (zkCallback-14935-thread-1) [     ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:43314/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3
3532263 DEBUG (zkCallback-14892-thread-2) [     ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:43314/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3
3532263 DEBUG (zkCallback-14896-thread-2) [     ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:43314/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3
3532264 DEBUG (zkCallback-14896-thread-2) [     ] o.a.s.c.PackageManager RequestHandlers being reloaded : []
3532264 DEBUG (zkCallback-14896-thread-2) [     ] o.a.s.c.PackageManager Registering request handlers [] 
3532265 DEBUG (zkCallback-14898-thread-2) [     ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:43314/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3
3532265 DEBUG (zkCallback-14912-thread-1) [     ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:43314/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3
3532266 DEBUG (qtp1010395414-41962) [n:127.0.0.1:42699_solr     ] o.a.s.c.PackageManager RequestHandlers being reloaded : []
3532266 DEBUG (qtp1010395414-41962) [n:127.0.0.1:42699_solr     ] o.a.s.c.PackageManager Registering request handlers [] 
3532269 DEBUG (zkCallback-14927-thread-2) [     ] o.a.s.c.c.ZkStateReader Loaded cluster properties: {package={global={name=global, url=http://localhost:43314/jar3.jar, sha256=20e0bfaec71b2e93c4da9f2ed3745dda04dc3fc915b66cc0275863982e73b2a3}}, requestHandler={}} to version 3
3532321 DEBUG (zkCallback-14892-thread-2) [     ] o.a.s.c.PackageManager RequestHandlers being reloaded : []
3532321 DEBUG (zkCallback-14892-thread-2) [     ] o.a.s.c.PackageManager Registering request handlers [] 
3532340 DEBUG (zkCallback-14898-thread-2) [     ] o.a.s.c.PackageManager RequestHandlers being reloaded : []
3532340 DEBUG (zkCallback-14898-thread-2) [     ] o.a.s.c.PackageManager Registering request handlers [] 
3532371 INFO  (zkCallback-14898-thread-1) [     ] o.a.s.c.MemClassLoader Defining_class org.apache.solr.core.RuntimeLibReqHandler from runtime jar global 
3532372 INFO  (zkCallback-14896-thread-1) [     ] o.a.s.c.MemClassLoader Defining_class org.apache.solr.core.RuntimeLibReqHandler from runtime jar global 
3532372 INFO  (zkCallback-14927-thread-1) [     ] o.a.s.c.MemClassLoader Defining_class org.apache.solr.core.RuntimeLibReqHandler from runtime jar global 
3532372 DEBUG (zkCallback-14927-thread-1) [     ] o.a.s.c.PackageManager Registering request handlers [bar] 
3532373 DEBUG (zkCallback-14896-thread-1) [     ] o.a.s.c.PackageManager Registering request handlers [bar] 
3532389 DEBUG (zkCallback-14898-thread-1) [     ] o.a.s.c.PackageManager Registering request handlers [bar] 
3532390 INFO  (zkCallback-14892-thread-1) [     ] o.a.s.c.MemClassLoader Defining_class org.apache.solr.core.RuntimeLibReqHandler from runtime jar global 
3532390 DEBUG (zkCallback-14892-thread-1) [     ] o.a.s.c.PackageManager Registering request handlers [bar] 
{noformat}
 * Again, we see 9 zkCallback threads that detect the new cluster properties indicating an empty list of requestHandler's
 * Again, 4 of these zkCallback threads invoke the PackageManager
 * Again, these 4 zkCallback threads all agree that no request handlers need reloaded, and no request handlers need registered.

... BUT! ...
 * the naming of these 4 zkCallback threads indicate they are the *2nd* thread created from their respective threadpools
 * subsequent to them finishing, a similarly named *1st* thread from each of those 4 zkCallback pools calls MemClassLoader to Defining_class & PackageManager to register the 'bar' request handler
 * if we search backwards in the log for these threads, it seems that these are still left over (and still processing) the cluster property changed events that occured as a result of the {{update-package}} command ... they didn't finish their "Defining_class" & "Registering request handlers" execution by the time the test executed the {{delete-requesthandler}}

(FWIW: i'm also a little perplexed why the logs in the failure case show the jetty request handling thread {{qtp1010395414-41962}} currently processing the {{/cluster}} request *also* calling PackageManager directly in addition to whichever zkCallback thread corrisponds to that node)

> TestContainerReqHandler.testPackageAPI failures imply race condition between update-package and delete-requesthandler
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13781
>                 URL: https://issues.apache.org/jira/browse/SOLR-13781
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: apache_Lucene-Solr-Tests-8.x_587.testPackageAPI.txt, egrep-out.apache_Lucene-Solr-Tests-8.x_587.testPackageAPI.txt, egrep-out.local.log.txt, local.log.txt
>
>
> We're seeing roughly an 8% failure rate from {{TestContainerReqHandler.testPackageAPI}} with failures occuring on both master and branch_8x, and on various jenkins servers and various OSes.
> All of the failures occur at the same place: A V2 request to {{/node/ext}} to verify that that the {{requestHandler}} List is empty after issuing a {{delete-requesthandler: 'bar'}} payload to the {{/cluster}} API. The logs and failure message indicate that the {{'bar'}} request handler still exists even the assertion does a "sleep/retry" of the verification query 10 times.
> While i don't fully understand this test, or the underlying code being tested, i spent a little time digging into the logs from some of these jenkins failures, and comparing them to the logs i see generated when i get a successful test run locally, and I think what's happening here - and the reason that {{delete-requesthandler}} seems to "fail" frequently in this test method, but not in {{testSetClusterReqHandler}} - is because the prior {{update-package}} command is still in process.
> After the test code runs an {{update-package}} command, the test executes requests against {{/node/ext/bar}} to verify that the {{version}} has changed as a result of updating the package, but i suspect this is only looking at the _metadata_ that has changed as a result of the {{update-package}} command and not actaully ensuring that the request handler has fully loaded - because the logs when this test fails seem to show that the zkCallback threads kicked off by {{update-package}} command are still running when the zkCallback threads kicked off by the subsequent {{delete-requesthandler}} command are running, and finish *after* them, "re-registering" the handler that was just deleted.
> ----
> It's not 100% clear to me if this is _just_ a test bug - and it should be monitoring something else to know when the request handler's a finished loading - or if this indicates a broader flaw in the design of how commands like {{add-package}} / {{update-package}} / {{add-requesthandler}} / {{delete-requesthandler}} should interact if/when they occur in close temporal proximity.
> (ie: if there are zkCallback watchers loading classes and initializing objects as a result of cluster property changes, shouldn't there be some sort of lineraization/synchronization logic to ensure that they get executed in the same order on all the nodes in the cluster?)
> ----
> More detail and log file attachments to follow...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org