You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2019/07/09 22:15:00 UTC
[jira] [Updated] (SOLR-13616) Possible racecondition/deadlock between collection DELETE and PrepRecovery ? (TestPolicyCloud failures)

     [ https://issues.apache.org/jira/browse/SOLR-13616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-13616:
----------------------------
    Attachment: SOLR-13616.test-incomplete.patch
                thetaphi_Lucene-Solr-master-Linux_24358.log.txt
        Status: Open  (was: Open)

Unlike most tests that explicitly waitFor/assert *active* replicas, TestPolicyCloud (currently) has several tests that only assert the quanity and location of a replica – it doesn't wait for them to become active, so when testing an ADDREPLICA or a SPLITSHARD, those new replicas are still in recovery (or PrepRecovery) when the test tries to do cleanup and delete the collection – which frequently fails with timeout problems.

While we can certainly "improve" TestPolicyCloud to wait for recoveries to finish, and all replicas to be active before attempting to delete the collection, a better question is why this is needed?

I'm attaching {{thetaphi_Lucene-Solr-master-Linux_24358.log.txt}} which demonstrates the problem in {{TestPolicyCloud.testCreateCollectionAddReplica}} here are some highlights...
{noformat}
# thetaphi_Lucene-Solr-master-Linux_24358.log.txt
#
# testCreateCollectionAddReplica

# bulk of test logic is finished, test has added a replica and confirmed it's on the expected node
# 
# but meanwhile, recovery is still ongoing...

  [junit4]   2> 959699 INFO  (recoveryExecutor-5888-thread-1-processing-n:127.0.0.1:42097_solr x:testCreateCollectionAddReplica_shard1_replica_n3 c:testCreateCollectionAddReplica s:shard1 r:core_node4) [n:127.0.0.1:42097_solr c:testCreateCollectionAddReplica s:shard1 r:core_node4 x:testCreateCollectionAddReplica_shard1_replica_n3 ] o.a.s.c.RecoveryStrategy Sending prep recovery command to [https://127.0.0.1:42097/solr]; [WaitForState: action=PREPRECOVERY&core=testCreateCollectionAddReplica_shard1_replica_n1&nodeName=127.0.0.1:42097_solr&coreNodeName=core_node4&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
   [junit4]   2> 959701 INFO  (qtp531873617-17025) [n:127.0.0.1:42097_solr    x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.h.a.PrepRecoveryOp Going to wait for coreNodeName: core_node4, state: recovering, checkLive: true, onlyIfLeader: true, onlyIfLeaderActive: true
   [junit4]   2> 959701 INFO  (qtp531873617-17025) [n:127.0.0.1:42097_solr    x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.h.a.PrepRecoveryOp In WaitForState(recovering): collection=testCreateCollectionAddReplica, shard=shard1, thisCore=testCreateCollectionAddReplica_shard1_replica_n1, leaderDoesNotNeedRecovery=false, isLeader? true, live=true, checkLive=true, currentState=down, localState=active, nodeName=127.0.0.1:42097_solr, coreNodeName=core_node4, onlyIfActiveCheckResult=false, nodeProps: core_node4:{
   [junit4]   2>   "core":"testCreateCollectionAddReplica_shard1_replica_n3",
   [junit4]   2>   "base_url":"https://127.0.0.1:42097/solr",
   [junit4]   2>   "state":"down",
   [junit4]   2>   "node_name":"127.0.0.1:42097_solr",
   [junit4]   2>   "type":"NRT"}
   ...

# the test thread moves on to @After method which calls MiniSolrCloudCluster.deleteAllCollections() ...
   ...
   [junit4]   2> 959703 INFO  (qtp531873617-17021) [n:127.0.0.1:42097_solr     ] o.a.s.h.a.CollectionsHandler Invoked Collection Action :delete with params name=testCreateCollectionAddReplica&action=DELETE&wt=javabin&version=2 and sendToOCPQueue=true
   ...
   [junit4]   2> 959709 INFO  (OverseerThreadFactory-5345-thread-5-processing-n:127.0.0.1:44991_solr) [n:127.0.0.1:44991_solr     ] o.a.s.c.a.c.OverseerCollectionMessageHandler Executing Collection Cmd=action=UNLOAD&deleteInstanceDir=true&deleteDataDir=true&deleteMetricsHistory=true, asyncId=null
   ...
   [junit4]   2> 959750 INFO  (qtp531873617-17325) [n:127.0.0.1:42097_solr    x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.c.SolrCore [testCreateCollectionAddReplica_shard1_replica_n1]  CLOSING SolrCore org.apache.solr.core.SolrCore@39444e66
   ...
   [junit4]   2> 959753 INFO  (qtp531873617-17325) [n:127.0.0.1:42097_solr    x:testCreateCollectionAddReplica_shard1_replica_n1 ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={deleteInstanceDir=true&deleteMetricsHistory=true&core=testCreateCollectionAddReplica_shard1_replica_n1&qt=/admin/cores&deleteDataDir=true&action=UNLOAD&wt=javabin&version=2} status=0 QTime=17


# but meanwhile, PrepRecoveryOp is currently blocked on a call to ZkStateReader.waitForState
# looking for specific conditions for the leader and (new) replica (that needs to recover)
# ... BUT!... the leader core has already been closed, so the watcher never succeeds,
# ...so PrepRecovery keeps waitForState ...

   [junit4]   2> 959855 WARN  (watches-5915-thread-1) [     ] o.a.s.c.c.ZkStateReader Error on calling watcher
   [junit4]   2>           => org.apache.solr.common.SolrException: core not found:testCreateCollectionAddReplica_shard1_replica_n1
   [junit4]   2>        at org.apache.solr.handler.admin.PrepRecoveryOp.lambda$execute$0(PrepRecoveryOp.java:83)
   [junit4]   2> org.apache.solr.common.SolrException: core not found:testCreateCollectionAddReplica_shard1_replica_n1
   [junit4]   2>        at org.apache.solr.handler.admin.PrepRecoveryOp.lambda$execute$0(PrepRecoveryOp.java:83) ~[java/:?]
   [junit4]   2>        at org.apache.solr.common.cloud.ZkStateReader.lambda$waitForState$13(ZkStateReader.java:1742) ~[java/:?]
   [junit4]   2>        at org.apache.solr.common.cloud.ZkStateReader$DocCollectionAndLiveNodesWatcherWrapper.onStateChanged(ZkStateReader.java:2309) ~[java/:?]
   [junit4]   2>        at org.apache.solr.common.cloud.ZkStateReader$Notification.run(ZkStateReader.java:2041) ~[java/:?]
   [junit4]   2>        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
   [junit4]   2>        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
   [junit4]   2>        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) ~[java/:?]
   [junit4]   2>        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
   [junit4]   2>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
   [junit4]   2>        at java.lang.Thread.run(Thread.java:835) [?:?]

#  ... repeats 3 times, same millisecond, diff threads
#       (watches-5915-thread-2 & watches-5915-thread-3

# Other then some intranode /admin/metrics requests, *NOTHING* else happens for ~90 seconds
# Until finally the DELETE collection request times out...

   [junit4]   2> 1049792 INFO  (TEST-TestPolicyCloud.testCreateCollectionAddReplica-seed#[CA8FB8D61B016EFE]) [     ] o.a.s.SolrTestCaseJ4 ###Ending testCreateCollectionAddReplica
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestPolicyCloud -Dtests.method=testCreateCollectionAddReplica -Dtests.seed=CA8FB8D61B016EFE -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=yi-001 -Dtests.timezone=America/Indiana/Indianapolis -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
   [junit4] ERROR   92.6s J0 | TestPolicyCloud.testCreateCollectionAddReplica <<<
   [junit4]    > Throwable #1: org.apache.solr.client.solrj.SolrServerException: Timeout occurred while waiting response from server at: https://127.0.0.1:42097/solr
   [junit4]    >        at __randomizedtesting.SeedInfo.seed([CA8FB8D61B016EFE:4AAFDDF80A428658]:0)
   [junit4]    >        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:667)
   [junit4]    >        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:262)
   [junit4]    >        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:245)
   [junit4]    >        at org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:368)
   [junit4]    >        at org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:296)
   [junit4]    >        at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.sendRequest(BaseCloudSolrClient.java:1128)
   [junit4]    >        at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:897)
   [junit4]    >        at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:829)
   [junit4]    >        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
   [junit4]    >        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:228)
   [junit4]    >        at org.apache.solr.cloud.MiniSolrCloudCluster.deleteAllCollections(MiniSolrCloudCluster.java:547)
   [junit4]    >        at org.apache.solr.cloud.autoscaling.TestPolicyCloud.after(TestPolicyCloud.java:87)
   [junit4]    >        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   [junit4]    >        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >        at java.base/java.lang.reflect.Method.invoke(Method.java:567)
   [junit4]    >        at java.base/java.lang.Thread.run(Thread.java:835)
   [junit4]    > Caused by: java.net.SocketTimeoutException: Read timed out
   [junit4]    >        at java.base/java.net.SocketInputStream.socketRead0(Native Method)
   [junit4]    >        at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
   [junit4]    >        at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
   [junit4]    >        at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
   [junit4]    >        at java.base/sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:448)
   [junit4]    >        at java.base/sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:68)
   [junit4]    >        at java.base/sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1132)
   [junit4]    >        at java.base/sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:828)
{noformat}
This pattern of "RECOVERY happens concurrently with DELETE, lots of wasted time with nothing logged, DELETE times out or fails" repeats in other jenkins jobs, and in other test methods like {{testCreateCollectionSplitShard}} ... when these failures happen, the effects sometimes bleed over into toehr test methods – aparently because the old collection is "partially" deleted – the cores are unloaded, but the collection still exists in the cluster state causing problems when future invokations of the {{@After}} method calls {{MiniSolrCloudCluster.deleteAllCollections()}} ...
{noformat}
   [junit4]   2> 1081521 INFO  (TEST-TestPolicyCloud.testDataProvider-seed#[CA8FB8D61B016EFE]) [     ] o.a.s.SolrTestCaseJ4 ###Ending testDataProvider
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestPolicyCloud -Dtests.method=testDataProvider -Dtests.seed=CA8FB8D61B016EFE -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=yi-001 -Dtests.timezone=America/Indiana/Indianapolis -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
   [junit4] ERROR   31.7s J0 | TestPolicyCloud.testDataProvider <<<
   [junit4]    > Throwable #1: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://127.0.0.1:42097/solr: Could not find collection : testCreateCollectionAddReplica
   [junit4]    >        at __randomizedtesting.SeedInfo.seed([CA8FB8D61B016EFE:F22410E3A9C36A27]:0)
   [junit4]    >        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:656)
   [junit4]    >        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:262)
   [junit4]    >        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:245)
   [junit4]    >        at org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:368)
   [junit4]    >        at org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:296)
   [junit4]    >        at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.sendRequest(BaseCloudSolrClient.java:1128)
   [junit4]    >        at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:897)
   [junit4]    >        at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:829)
   [junit4]    >        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
   [junit4]    >        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:228)
   [junit4]    >        at org.apache.solr.cloud.MiniSolrCloudCluster.deleteAllCollections(MiniSolrCloudCluster.java:547)
   [junit4]    >        at org.apache.solr.cloud.autoscaling.TestPolicyCloud.after(TestPolicyCloud.java:87)
{noformat}
FWIW, I should note:
 * I've been unable to reproduce this locally
 * the failures happen on diff jenkins servers, on diff branches, and different OSes
 * so far all of the failures i've seen occur when randomized ssl=true, but AFAICT there isn't any obvious indicatiion that the problem relates to the use of ssl – i think that just that the added CPU contention caused by using SSL seems to slow things down enough on the jenkins machine that the race contidion happens
 ** ie: because of SSL slowdown, the recovery isn't fully completed before the DELETE collection cleanup requests are processed

----
My initial hunch/impression was that Overseer (Queue) was getting deadlocked trying to invoke the PrepRecoveryOp (which was blocked waiting for a leader that no longer exists) preventing the DeleteCollectionCmd from executing (which would remove the collection from the ClusterState – the only state change possible to cause the PrepRecoveryOp's waitForState call to abort.

I tried to write a new test forcing this specific situation (attached), using a {{CloseHook}} on the leader {{SolrCore}} and {{waitForState()}} predicate that would gate eachother using {{CountDownLatches}} to try and coerce the specific sequence of events that seemed to be problematic – but i can't seem to force a this type of failure (or even semi-reliably trigger it given that the ZkStateReader executes watchers in a random order).

If anyone else has any ideas what's going on here i'm all ears ... in the meantime I'm going to try and work up some general improvements to TestPolicyCLoud to make the assertions tighter and insure all the expected replicas are fully active before finishing the test.

> Possible racecondition/deadlock between collection DELETE and PrepRecovery ? (TestPolicyCloud failures)
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13616
>                 URL: https://issues.apache.org/jira/browse/SOLR-13616
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: SOLR-13616.test-incomplete.patch, thetaphi_Lucene-Solr-master-Linux_24358.log.txt
>
>
> Based on some recent jenkins failures in TestPolicyCloud, I suspect there is a possible deadlock condition when attempting to delete a collection while recovery is in progress.
> I haven't been able to identify exactly where/why/how the problem occurs, but it does not appear to be a test specific problem, and seems like it could potentially affect anyone unlucky enough to issue poorly timed DELETE.
> Details to follow in comments...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org