You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Dmitry Konstantinov (Jira)" <ji...@apache.org> on 2021/03/22 12:45:00 UTC
[jira] [Created] (CASSANDRA-16531) cqlsh can stuck randomly during a connect-timeout period in Kubernetes

Dmitry Konstantinov created CASSANDRA-16531:
-----------------------------------------------

             Summary: cqlsh can stuck randomly during a connect-timeout period in Kubernetes
                 Key: CASSANDRA-16531
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16531
             Project: Cassandra
          Issue Type: Bug
          Components: Tool/cqlsh
            Reporter: Dmitry Konstantinov
         Attachments: cqlsh_debug.txt, pystack-2868.txt

Hello, we get the following issue for cqlsh in Kubernetes environment.

When cqlsh is executed against a head-less Cassandra service (which is configured for Cassandra pods) - it can stuck randomly during a connect-timeout period. 
 We reproduced the issue using the following command:
{code:java}
for i in {1..1000}; do cqlsh --debug --connect-timeout="600" -u cassandra -p cassandra cassandra.env-1-cassandra.svc 9042 -k keyspace1_test -e "select cql_version from system.local"; sleep 1; done
 {code}
if a specific Cassandra node address or IP is provided instead of service DNS name - the issue is not reproduced.
 Additional observation - stuck is happening in cqlsh after printing a result of select, so the select query itself is executed normally.

The following logic was added to cqlsh.py to print debug logs in Cassandra python driver:
{code:java}
import logging

logging.basicConfig(filename='/tmp/cqlsh.log', level=logging.DEBUG, format = '%(asctime)s | %(levelname)-8s | %(threadName)s | %(module)s:%(name)s | %(lineno)04d | %(message)s')
logging.info('=================================================')
{code}
The log captured for the cqlsh invocation when the issue is reproduced: [^cqlsh_debug.txt] 
 According the log - cassandra.env-1-cassandra.svc provided as a contact point to cqlsh is resolved immediately to 3 node IPs, before any discovery queries to Cassandra servers:
{code:java}
2021-03-18 21:37:11,566 | DEBUG    | MainThread | pool:cassandra.pool | 0139 | Host 10.129.19.19 is now marked up
2021-03-18 21:37:11,566 | DEBUG    | MainThread | pool:cassandra.pool | 0139 | Host 10.129.205.119 is now marked up
2021-03-18 21:37:11,566 | DEBUG    | MainThread | pool:cassandra.pool | 0139 | Host 10.130.169.162 is now marked up
{code}
the correspondent logic in cassandra/cluster.py from Cassandra python driver:
{code:java}
self.contact_points_resolved = [endpoint[4][0] for a in self.contact_points
                                                   for endpoint in socket.getaddrinfo(a, self.port, socket.AF_UNSPEC, socket.SOCK_STREAM)]
{code}
So, the driver resolves all IPs for the DNS name, like:
{code:java}
nslookup cassandra.env-1-cassandra
Server:         172.30.0.10
Address:        172.30.0.10#53

Name:   cassandra.env-1-cassandra.svc.cluster.local
Address: 10.129.205.119
Name:   cassandra.env-1-cassandra.svc.cluster.local
Address: 10.129.38.98
Name:   cassandra.env-1-cassandra.svc.cluster.local
Address: 10.130.169.162
{code}
After it the driver uses all of the IPs as contact points. 
 Several threads are used to work with connections in the driver, Thread-1 and Thread-2 are used to handle events related to connection state change and execute re-connects.
 Once cqlsh executed a query and got a result - it triggers a graceful shutdown of the driver instance.
 The shutdown logic is closing connections:
{code:java}
2021-03-18 21:37:14,140 | DEBUG    | Dummy-3 | asyncorereactor:cassandra.io.asyncorereactor | 0368 | Closing connection (139784430944720) to 10.129.205.119
2021-03-18 21:37:14,140 | DEBUG    | Dummy-3 | asyncorereactor:cassandra.io.asyncorereactor | 0375 | Closed socket to 10.129.205.119
{code}
At this moment of time in parallel the logic related to host events handling decides to re-connect for a host sometimes (race condition):
{code:java}
2021-03-18 21:37:14,145 | DEBUG    | Thread-2 | cluster:cassandra.cluster | 2792 | [control connection] Attempting to reconnect
2021-03-18 21:37:14,145 | DEBUG    | Thread-2 | cluster:cassandra.cluster | 2742 | [control connection] Opening new connection to 10.129.205.119
{code}
The thread starts to wait for connect timeout to initiate a re-connection.
{code:java}
2021-03-18 21:37:14,194 | DEBUG    | Dummy-3 | cluster:cassandra.cluster | 3245 | Shutting down Cluster Scheduler
2021-03-18 21:37:14,245 | DEBUG    | Dummy-3 | cluster:cassandra.cluster | 2848 | Shutting down control connection

<--- stuck is here -->

2021-03-18 21:47:14,146 | DEBUG    | Thread-2 | connection:cassandra.connection | 0639 | Not sending options message for new connection(139784430946896) to 10.129.205.119 because compression is disabled and a cql version was not specified
2021-03-18 21:47:14,147 | DEBUG    | Thread-2 | connection:cassandra.connection | 0712 | Sending StartupMessage on <AsyncoreConnection(139784430946896) 10.129.205.119:9042>
2021-03-18 21:47:14,147 | DEBUG    | Thread-2 | connection:cassandra.connection | 0720 | Sent StartupMessage on <AsyncoreConnection(139784430946896) 10.129.205.119:9042>
2021-03-18 21:47:14,147 | DEBUG    | Thread-2 | asyncorereactor:cassandra.io.asyncorereactor | 0368 | Closing connection (139784430946896) to 10.129.205.119
2021-03-18 21:47:14,147 | DEBUG    | Thread-2 | asyncorereactor:cassandra.io.asyncorereactor | 0375 | Closed socket to 10.129.205.119
2021-03-18 21:47:14,147 | DEBUG    | Thread-2 | connection:cassandra.connection | 0761 | Connection to 10.129.205.119 was closed during the startup handshake
2021-03-18 21:47:14,147 | WARNING  | Thread-2 | cluster:cassandra.cluster | 2731 | [control connection] Error connecting to 10.129.205.119:
Traceback (most recent call last):
  File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 2724, in _reconnect_internal
    return self._try_connect(host)
  File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 2746, in _try_connect
    connection = self._cluster.connection_factory(host.address, is_control_connection=True)
  File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 1138, in connection_factory
    return self.connection_class.factory(address, self.connect_timeout, *args, **kwargs)
  File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/connection.py", line 341, in factory
    raise OperationTimedOut("Timed out creating connection (%s seconds)" % timeout)
OperationTimedOut: errors=Timed out creating connection (600 seconds), last_host=None
{code}
After it - it realises that the connection is closed due to shutdown and the re-connect task is finally completed:
{code:java}
2021-03-18 21:47:14,147 | DEBUG    | Thread-2 | cluster:cassandra.cluster | 2814 | [control connection] error reconnecting
Traceback (most recent call last):
  File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 2794, in _reconnect
    self._set_new_connection(self._reconnect_internal())
  File "/usr/share/cassandra/lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 2733, in _reconnect_internal
    raise DriverException("[control connection] Reconnection in progress during shutdown")
DriverException: [control connection] Reconnection in progress during shutdown
{code}
Once task is completed - the thread pool with Thread-1 and Thread-2 is stopped and cqlsh execution is terminated.

The same is visible in thread dumps captured for cqlsh ( [^pystack-2868.txt] ).
 The thread which is trying to re-connect (waits for a timeout to re-connect):
{code:java}
File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 2794, in _reconnect
 self._set_new_connection(self._reconnect_internal())
 File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 2724, in _reconnect_internal
 return self._try_connect(host)
 File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 2746, in _try_connect
 connection = self._cluster.connection_factory(host.address, is_control_connection=True)
 File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 1138, in connection_factory
 return self.connection_class.factory(address, self.connect_timeout, *args, **kwargs)
 File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/connection.py", line 332, in factory
 conn = cls(host, *args, **kwargs)
 File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/io/asyncorereactor.py", line 355, in __init__
 init_handler.wait(kwargs["connect_timeout"])
 File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/io/asyncorereactor.py", line 77, in wait
 self.event.wait(timeout)
 File "/usr/local/lib/python2.7/threading.py", line 614, in wait
 self.__cond.wait(timeout)
 File "/usr/local/lib/python2.7/threading.py", line 359, in wait
 _sleep(delay)
{code}
The thread is waiting for thread pool (Cluster.executor field) shutdown to complete:
{code:java}
 File "/usr/local/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
 func(*targs, **kargs)
 File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 181, in _shutdown_clusters
 cluster.shutdown()
 File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/cassandra-driver-internal-only-3.11.0-bb96859b.zip/cassandra-driver-3.11.0-bb96859b/cassandra/cluster.py", line 1264, in shutdown
 self.executor.shutdown()
 File "/opt/tools/apache-cassandra/3.11.5/bin/../lib/futures-2.1.6-py2.py3-none-any.zip/concurrent/futures/thread.py", line 137, in shutdown
 t.join()
 File "/usr/local/lib/python2.7/threading.py", line 940, in join
 self.__block.wait()
 File "/usr/local/lib/python2.7/threading.py", line 340, in wait
 waiter.acquire()
 File "<string>", line 1, in <module>
 File "<string>", line 1, in <module>
{code}
The difference between the case when a single IP/hostname is specified and a service name - how many IPs are resolved and how many contact points are used to establish a control connection during startup.

So, it looks like the root cause of the delay is a race condition in the Cassandra python driver within cqlsh tool, the race condition is happening on shutdown for the cases when a DNS name as a contact point to cqlsh is resolved to multiple IPs.

The issue was found in Cassandra 3.11.5 but the same is actual for 3.11.10 (python driver build is the same - cassandra-driver-3.11.0-bb96859b).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org