You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Mark Hanson (Jira)" <ji...@apache.org> on 2021/11/03 07:06:00 UTC
[jira] [Commented] (GEODE-9425) AutoConnectionSource thread in client can't query for available locators when it is connected to a locator that was shut down

    [ https://issues.apache.org/jira/browse/GEODE-9425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17437768#comment-17437768 ] 

Mark Hanson commented on GEODE-9425:
------------------------------------

Are there more logs available?

> AutoConnectionSource thread in client can't query for available locators when it is connected to a locator that was shut down
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-9425
>                 URL: https://issues.apache.org/jira/browse/GEODE-9425
>             Project: Geode
>          Issue Type: Bug
>          Components: client/server
>    Affects Versions: 1.15.0
>            Reporter: Lynn Gallinat
>            Assignee: Mark Hanson
>            Priority: Major
>
> The AutoConnectionSource thread runs in a client and queries the locator that client is connected to so it can update the list of available locators.
>  But if the locator the client is connected to was shut down, the client can't get an updated locator list.
>  In this case the locator was shut down and is not coming back, but there is another available locator.
>  However we can't find out what that available locator is because we can't complete the query.
> To summarize: The AutoConnectionSource thread that runs in a client to update the list of available locators should be able to get a list of available locators even when that client is connected to a locator that was shut down.
> The AutoConnectionSource thread starts and runs every 10 seconds. This is from the client's system log.
>  [info 2021/07/07 19:37:33.723 GMT clientgemfire1_host1_881 <vm_0_thr_0_client1_host1_881> tid=0x2d] AutoConnectionSource UpdateLocatorListTask started with interval=10000 ms.
> After the locator is shut down the AutoConnectionSource thread can't complete its work so we get stuck threads.
> This stuck thread stack shows it is trying to run UpdateLocatorListTask.
> {noformat}
> clientgemfire1_881/system.log: [warn 2021/07/07 19:47:25.784 GMT clientgemfire1_host1_881 <ThreadsMonitor> tid=0x36] Thread <286> (0x11e) that was executed at <07 Jul 2021 19:46:03 GMT> has been stuck for <82.041 seconds> and number of thread monitor iteration <1>
> Thread Name <poolTimer-pool-24> state <RUNNABLE>
> Executor Group <ScheduledThreadPoolExecutorWithKeepAlive>
> Monitored metric <ResourceManagerStats.numThreadsStuck>
> Thread stack for "poolTimer-pool-24" (0x11e):
> java.lang.ThreadState: RUNNABLE (in native)
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>     - locked java.net.SocksSocketImpl@3e95a505
>   at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>   at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:607)
>   at org.apache.geode.distributed.internal.tcpserver.AdvancedSocketCreatorImpl.connect(AdvancedSocketCreatorImpl.java:102)
>   at org.apache.geode.internal.net.SCAdvancedSocketCreator.connect(SCAdvancedSocketCreator.java:51)
>   at org.apache.geode.distributed.internal.tcpserver.ClusterSocketCreatorImpl.connect(ClusterSocketCreatorImpl.java:96)
>   at org.apache.geode.distributed.internal.tcpserver.TcpClient.getServerVersion(TcpClient.java:246)
>   at org.apache.geode.distributed.internal.tcpserver.TcpClient.requestToServer(TcpClient.java:151)
>   at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocatorUsingConnection(AutoConnectionSourceImpl.java:217)
>   at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocator(AutoConnectionSourceImpl.java:207)
>   at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryLocators(AutoConnectionSourceImpl.java:254)
>   at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.access$200(AutoConnectionSourceImpl.java:68)
>   at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl$UpdateLocatorListTask.run2(AutoConnectionSourceImpl.java:458)
>   at org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1334)
>   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:285)
>   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>   - java.util.concurrent.ThreadPoolExecutor$Worker@24cd39b5
> {noformat}
> Impact on running cache operations:
>  Any operations in progress by the client connected to a locator that was shut down can take 59 seconds to complete, which is the default socket connect timeout.
>  From org.apache.geode.cache.client.PoolFactory:
>  int DEFAULT_SOCKET_CONNECT_TIMEOUT = 59000;
> If the client's connection pool has multiple locators listed, then we can see a 59 second timeout for each of those that has been shut down so an operation could take minutes to complete.
>  Here is a stack for such a timeout:
> {noformat}
> clientgemfire1_881/system.log: [info 2021/07/07 19:48:57.936 GMT clientgemfire1_host1_881 <queueTimer-pool1> tid=0x85] Communication with locator gemfire-cluster-locator-3.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.18.108:10334 failed
> java.net.SocketTimeoutException: connect timed out
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>         at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>         at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:607)
>         at org.apache.geode.distributed.internal.tcpserver.AdvancedSocketCreatorImpl.connect(AdvancedSocketCreatorImpl.java:102)
>         at org.apache.geode.internal.net.SCAdvancedSocketCreator.connect(SCAdvancedSocketCreator.java:51)
>         at org.apache.geode.distributed.internal.tcpserver.ClusterSocketCreatorImpl.connect(ClusterSocketCreatorImpl.java:96)
>         at org.apache.geode.distributed.internal.tcpserver.TcpClient.getServerVersion(TcpClient.java:246)
>         at org.apache.geode.distributed.internal.tcpserver.TcpClient.requestToServer(TcpClient.java:151)
>         at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocatorUsingConnection(AutoConnectionSourceImpl.java:217)
>         at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryOneLocator(AutoConnectionSourceImpl.java:207)
>         at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.queryLocators(AutoConnectionSourceImpl.java:254)
>         at org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.findServersForQueue(AutoConnectionSourceImpl.java:188)
>         at org.apache.geode.cache.client.internal.QueueManagerImpl.findQueueServers(QueueManagerImpl.java:823)
>         at org.apache.geode.cache.client.internal.QueueManagerImpl.recoverRedundancy(QueueManagerImpl.java:640)
>         at org.apache.geode.cache.client.internal.QueueManagerImpl.access$700(QueueManagerImpl.java:77)
>         at org.apache.geode.cache.client.internal.QueueManagerImpl$RedundancySatisfierTask.run2(QueueManagerImpl.java:1449)
>         at org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1334)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Then 59 seconds later there can be another timeout while the client tries to talk to a different locator that was shut down.
> Notes on how to reproduce:
>  Need servers and at least 2 locators.
>  Need a client connected to a locator.
> This is the client's pool configuration from my example where there are 4 locators to start with, and all are running.
> {noformat}
> [info 2021/07/07 19:37:33.715 GMT <vm_0_thr_0_client1_host1_881> tid=0x2d] Creating pool named: pool with attributes
>     freeConnectionTimeout: 10000
>     idleTimeout: 5000
>     loadConditioningInterval: 5000
>     locators: [gemfire-cluster-locator-0.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.17.239:10334, gemfire-cluster-locator-1.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.22.157:10334, gemfire-cluster-locator-2.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.10.237:10334, gemfire-cluster-locator-3.gemfire-cluster-locator.namespace-1424574601.svc.cluster.local/10.20.18.108:10334]
>     minConnections: 2
>     maxConnections: -1
>     multiuserAuthentication: false
>     pingInterval: 10000
>     poolName: pool
>     prSingleHopEnabled: true
>     readTimeout: 3600000
>     retryAttempts: -1
>     servers: []
>     serverGroup:
>     socketBufferSize: 32768
>     statisticInterval: -1
>     subscriptionAckInterval: 100
>     subscriptionEnabled: true
>     subscriptionMessageTrackingTimeout: 900000
>     subscriptionRedundancy: 1
> {noformat}
> After a client connects to the locator, that locator is shut down.
> In this example, 3 locators were shutdown so the only one left was gemfire-cluster-locator-0.
>  Note that the view in the system log for gemfire-cluster-locator-0 shows that it knows there is now only one locator.
> {noformat}
> [info 2021/07/07 19:38:04.420 GMT gemfire-cluster-locator-0 <Geode Membership View Creator> tid=0xc8] sending new view View[gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v2>:41000|16] members: [gemfire-cluster-locator-0(gemfire-cluster-locator-0:1:locator)<ec><v2>:41000, gemfire-cluster-server-2(gemfire-cluster-server-2:1)<v3>:41000{lead}, gemfire-cluster-server-5(gemfire-cluster-server-5:1)<v4>:41000, gemfire-cluster-server-3(gemfire-cluster-server-3:1)<v6>:41000, gemfire-cluster-server-4(gemfire-cluster-server-4:1)<v7>:41000, gemfire-cluster-server-0(gemfire-cluster-server-0:1)<v8>:41000, gemfire-cluster-server-1(gemfire-cluster-server-1:1)<v9>:41000] shutdown: [gemfire-cluster-server-6(gemfire-cluster-server-6:1)<v5>:41000]
> {noformat}
> After an appropriate amount of time the stuck thread stack appears in the client's system.log.
> If operations were in progress in that client, they timeout after 59 seconds and try another locator, possibly another stopped locator and then it waits another 59 seconds.
> This can repeat for each locator that was shut down making an operation take minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)