You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Alaykumar Barochia (Jira)" <ji...@apache.org> on 2022/11/28 06:52:00 UTC
[jira] [Created] (CASSANDRA-18075) Upgraded (C* 4.0.4) node stops communicating with older version (3.11.4) nodes during upgrade
Alaykumar Barochia created CASSANDRA-18075:
----------------------------------------------
Summary: Upgraded (C* 4.0.4) node stops communicating with older version (3.11.4) nodes during upgrade
Key: CASSANDRA-18075
URL: https://issues.apache.org/jira/browse/CASSANDRA-18075
Project: Cassandra
Issue Type: Bug
Components: Feature/Encryption
Reporter: Alaykumar Barochia
Attachments: cassandra-env.sh_3114, cassandra-env.sh_404, cassandra.yaml_3114, cassandra.yaml_404, system.log_10.110.44.207
We are testing upgrade from Cassandra 3.11.4 to 4.0.4 on our test cluster which is SSL enabled and facing an issue.
Our cluster size is 3x3.
{noformat}
Datacenter: abssl_dev_tap_ttc
=============================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.109.6.153 94.27 KiB 16 100.0% 130e59d2-2a9a-4039-a42f-deb20afcf288 rack1
UN 10.109.45.8 104.43 KiB 16 100.0% 35274a2c-f915-4308-9981-d207a4e2108f rack1
UN 10.109.66.149 104.23 KiB 16 100.0% ea0151bc-fb6c-425d-af42-75c10e52f941 rack1
Datacenter: abssl_dev_tap_tte
=============================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.110.4.110 104.44 KiB 16 100.0% fd4a9fa8-f2a9-494c-afb8-7cb8a08c7554 rack1
UN 10.110.44.220 99.33 KiB 16 100.0% f1dc35c0-a1c2-45fe-9f65-b1cc3d7f6947 rack1
UN 10.110.49.242 65.57 KiB 16 100.0% 72bc4ae5-876d-4d0a-91ac-6cf8b531b4dd rack1
dbaasprod-ca-abssl-de-393671-v001-yqlvf:~# nodetool describecluster
Cluster Information:
Name: abssl_dev
Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
DynamicEndPointSnitch: enabled
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
f68fbc0c-c9d6-3709-8075-c5a0d74192f2: [10.110.4.110, 10.110.44.220, 10.109.6.153, 10.109.45.8, 10.109.66.149, 10.110.49.242]
{noformat}
During the upgrade, we re-run the pipeline in which we get new server (with different IP) that will have Cassandra 4.0.4 binary.
Disk '/data' (contains data files, commitlogs etc.) will get detached from the old server and get attached to the new server.
This process works fine on non-SSL cluster but when we perform this on SSL cluster, new node stops communicating with the rest of the nodes.
In this example, after upgrade, node 10.110.4.110 got replaced with new server with new IP 10.110.44.207.
*Output from 3.11.4 node:*
{noformat}
dbaasprod-ca-abssl-dc-437097-v001-7mump:~# hostname -i
10.109.6.153
dbaasprod-ca-abssl-dc-437097-v001-7mump:~# java -version
openjdk version "1.8.0_322"
OpenJDK Runtime Environment (Temurin)(build 1.8.0_322-b06)
OpenJDK 64-Bit Server VM (Temurin)(build 25.322-b06, mixed mode)
dbaasprod-ca-abssl-dc-437097-v001-7mump:~# nodetool status
Datacenter: abssl_dev_tap_ttc
=============================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.109.6.153 135.24 KiB 16 100.0% 130e59d2-2a9a-4039-a42f-deb20afcf288 rack1
UN 10.109.45.8 135.35 KiB 16 100.0% 35274a2c-f915-4308-9981-d207a4e2108f rack1
UN 10.109.66.149 135.25 KiB 16 100.0% ea0151bc-fb6c-425d-af42-75c10e52f941 rack1
Datacenter: abssl_dev_tap_tte
=============================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DN 10.110.4.110 104.44 KiB 16 100.0% fd4a9fa8-f2a9-494c-afb8-7cb8a08c7554 rack1
UN 10.110.44.220 104.44 KiB 16 100.0% f1dc35c0-a1c2-45fe-9f65-b1cc3d7f6947 rack1
UN 10.110.49.242 65.57 KiB 16 100.0% 72bc4ae5-876d-4d0a-91ac-6cf8b531b4dd rack1
dbaasprod-ca-abssl-dc-437097-v001-7mump:~# nodetool describecluster
Cluster Information:
Name: abssl_dev
Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
DynamicEndPointSnitch: enabled
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
f68fbc0c-c9d6-3709-8075-c5a0d74192f2: [10.110.44.220, 10.109.6.153, 10.109.45.8, 10.109.66.149, 10.110.49.242]
UNREACHABLE: [10.110.4.110]
{noformat}
*Output from 4.0.4 node:*
{noformat}
dbaasprod-ca-abssl-de-393671-v003-dxpyv:~# hostname -i
10.110.44.207
dbaasprod-ca-abssl-de-393671-v003-dxpyv:~# java -version
openjdk version "11.0.15" 2022-04-19
OpenJDK Runtime Environment Temurin-11.0.15+10 (build 11.0.15+10)
OpenJDK 64-Bit Server VM Temurin-11.0.15+10 (build 11.0.15+10, mixed mode)
dbaasprod-ca-abssl-de-393671-v003-dxpyv:~# nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
DN 10.109.6.153 ? 16 0.0% 130e59d2-2a9a-4039-a42f-deb20afcf288 r1
DN 10.109.45.8 ? 16 0.0% 35274a2c-f915-4308-9981-d207a4e2108f r1
DN 10.109.66.149 ? 16 0.0% ea0151bc-fb6c-425d-af42-75c10e52f941 r1
DN 10.110.44.220 ? 16 0.0% f1dc35c0-a1c2-45fe-9f65-b1cc3d7f6947 r1
DN 10.110.49.242 ? 16 0.0% 72bc4ae5-876d-4d0a-91ac-6cf8b531b4dd r1
Datacenter: abssl_dev_tap_tte
=============================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.110.44.207 146.27 KiB 16 100.0% fd4a9fa8-f2a9-494c-afb8-7cb8a08c7554 rack1
dbaasprod-ca-abssl-de-393671-v003-dxpyv:~# nodetool describecluster
Cluster Information:
Name: abssl_dev
Snitch: org.apache.cassandra.locator.GossipingPropertyFileSnitch
DynamicEndPointSnitch: disabled
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
1ccaeb62-5816-3599-897f-de59fd56eef2: [10.110.44.207]
UNREACHABLE: [10.109.45.8, 10.109.66.149, 10.110.44.220, 10.109.6.153, 10.110.49.242]
Stats for all nodes:
Live: 1
Joining: 0
Moving: 0
Leaving: 0
Unreachable: 5
Data Centers:
DC1 #Nodes: 5 #Down: 0
abssl_dev_tap_tte #Nodes: 1 #Down: 0
Database versions:
: [10.109.45.8:7000, 10.109.66.149:7000, 10.110.44.220:7000, 10.109.6.153:7000, 10.110.49.242:7000]
4.0.4: [10.110.44.207:7000]
Keyspaces:
system_schema -> Replication class: LocalStrategy {}
system -> Replication class: LocalStrategy {}
system_auth -> Replication class: NetworkTopologyStrategy {abssl_dev_tap_tte=3, abssl_dev_tap_ttc=3}
system_distributed -> Replication class: NetworkTopologyStrategy {abssl_dev_tap_tte=3, abssl_dev_tap_ttc=3}
system_traces -> Replication class: NetworkTopologyStrategy {abssl_dev_tap_tte=3, abssl_dev_tap_ttc=3}
{noformat}
Getting below error in system.log file of new node 10.110.44.207 which has Cassandra version 4.0.4.
{noformat}
WARN [Messaging-EventLoop-3-6] 2022-11-28 06:20:49,577 NoSpamLogger.java:95 - /10.110.44.207:7000->/10.109.45.8:7000-URGENT_MESSAGES-[no-channel] dropping message of type GOSSIP_DIGEST_SYN whose timeout expired before reaching the network
INFO [Messaging-EventLoop-3-6] 2022-11-28 06:21:17,921 NoSpamLogger.java:92 - /10.110.44.207:7000->/10.110.49.242:7000-URGENT_MESSAGES-[no-channel] failed to connect
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /10.110.49.242:7000
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
at io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
at io.netty.channel.unix.Socket.finishConnect(Socket.java:251)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:673)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:530)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
{noformat}
I am attaching the cassandra.yaml, cassandra-env.sh files from both versions (3.11.4 and 4.0.4).
Also attaching the system.log file from upgraded node 10.110.44.207.
It seems like some bug and hence raising this Jira. Can you please have a look?
Let me know if you need any more details.
Thanks,
Alaykumar Barochia
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org