You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Kamilla Aslami (Jira)" <ji...@apache.org> on 2022/03/01 17:32:00 UTC

[jira] [Commented] (GEODE-9512) java.lang.IllegalStateException: Detected old version (pre 5.0.1) of GemFire or non-GemFire during handshake due to initial byte being 1

    [ https://issues.apache.org/jira/browse/GEODE-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499672#comment-17499672 ] 

Kamilla Aslami commented on GEODE-9512:
---------------------------------------

server-1 is in the process of shutting down:

 
{noformat}
[info 2021/08/16 20:40:23.559 GMT system-test-gemfire-server-1 <ReconnectThread> tid=0x8e] Now closing distribution for system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000
[info 2021/08/16 20:40:23.559 GMT system-test-gemfire-server-1 <ReconnectThread> tid=0x8e] DistributionManager stopped in 103ms.
{noformat}
 

The reader (locator-0) receives handshake message sent by server-1. It reads the message from the buffer and then checks if the sender of the message is in the view. server-1 wasn't in the view anymore at that time, so the reader logs:
{noformat}
[warn 2021/08/16 20:40:23.858 GMT system-test-gemfire-locator-0 <P2P message reader for system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87] P2P message reader@6673b475 timed out during a membership check.{noformat}
The log message comes from this code block in readHandshakeForReceiver():
{noformat}
  final String authInit = System.getProperty(SECURITY_SYSTEM_PREFIX + SECURITY_PEER_AUTH_INIT);
  final boolean isSecure = authInit != null && !authInit.isEmpty();
  if (isSecure) {
    if (owner.getConduit().waitForMembershipCheck(remoteMember)) {
      sendOKHandshakeReply();
      notifyHandshakeWaiter(true);
    } else {
      // check if we need notifyHandshakeWaiter() call.
      notifyHandshakeWaiter(false);
      logger.warn("{} timed out during a membership check.",
          p2pReaderName());
      return true;
    }
  } else {{noformat}
If peer authentication is enabled, the receiver checks if the member is in the view by calling waitForMembershipCheck(). I suspect the idea behind this is that a node needs to authenticate to join the cluster, and if a sender is in the view, we can assume that it has been successfully authenticated.If the membership check failed (=returned false), the receiver calls notifyHandshakeWaiter(false):
{noformat}
  void notifyHandshakeWaiter(boolean success) {
    synchronized (handshakeSync) {
      if (handshakeRead || handshakeCancelled) {
        return;
      }
      clearSSLInputBuffer();
      if (success) {
        handshakeRead = true;
      } else {
        handshakeCancelled = true;
      }
      handshakeSync.notifyAll();
    }
  }{noformat}
This method sets either handshakeRead or handshakeCancelled to true and notifies the handshake waiter about the change. I assume that in our case, the handshake waiter is system-test-gemfire-server-1 that is in the process of shutting down and might not receive the notification. In our scenario, notifyHandshakeWaiter method sets handshakeCancelled = true.

After that, readHandshakeForReceiver returns to the caller - processInputBuffer. This method has a loop that keeps reading messaged from buffer until it's empty. In this loop, there's this code block that checks handshakeRead:
{noformat}
  if (handshakeRead) {
    try {
      readMessage(peerDataBuffer, threadMonitorExecutor);
    } catch (SerializationException e) {
      logger.info("input buffer startPos {} oldLimit {}", startPos, oldLimit);
      throw e;
    }
  } else {
    try (ByteBufferInputStream bbis = new ByteBufferInputStream(peerDataBuffer);
        DataInputStream dis = new DataInputStream(bbis)) {
      if (!isReceiver) {
        readHandshakeForSender(dis, peerDataBuffer);
        return;
      }
      if (readHandshakeForReceiver(dis)) {
        ioFilter.doneReading(peerDataBuffer);
        return;
      }
    }
  }{noformat}
If handshakeRead is set to true, the method reads a regular message otherwise, it expects to read the handshake.

So in our case, we already read the handshake by calling readHandshakeForReceiver once. But readHandshakeForReceiver never set handshakeRead to true because waitForMembershipCheck failed. So in the next iteration, processInputBuffer will call readHandshakeForReceiver again (it doesn't check handshakeCancelled) even if the next message in the buffer isn't the handshake.Note that if there's nothing to read from the socket after waitForMembershipCheck failed, we don't get the failure - the connection will just get closed. But if there was another message after the handshake message, the receiver will readHandshakeForReceiver again and throw the IllegalStateException.

To fix this issue, we should close the connection from the receiving side if waitForMembershipCheck failed, i.e:

 
{noformat}
  if (owner.getConduit().waitForMembershipCheck(remoteMember)) {
    sendOKHandshakeReply();
    notifyHandshakeWaiter(true);
  } else {
    // check if we need notifyHandshakeWaiter() call.
    notifyHandshakeWaiter(false);
==>    requestClose(p2pReaderName() + "timed out during a membership check.");
    return true;
  }
 
{noformat}
 

> java.lang.IllegalStateException: Detected old version (pre 5.0.1) of GemFire or non-GemFire during handshake due to initial byte being 1
> ----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-9512
>                 URL: https://issues.apache.org/jira/browse/GEODE-9512
>             Project: Geode
>          Issue Type: Bug
>          Components: membership, messaging
>    Affects Versions: 1.15.0
>            Reporter: Juan Ramos
>            Assignee: Kamilla Aslami
>            Priority: Major
>
> I've hit this issue while executing some chaos testing over a GemFire cluster using 2 locators and 3 servers; {{SSL}} is enabled, and a dummy {{SecurityManager}} is configured which authenticates any user and always returns {{true}} within the {{authorize}} method. There are 3 {{PARTITION_REDUNDANT}} regions configured, one per client, each with 1 redundant copy. Once the cluster is up and running, 3 clients continuously execute {{Region.put}} operations on a well known set of keys for its own {{Region}} (created with {{PROXY}} type), and another process executes the following logic in parallel (pseudocode):
> {noformat}
> for server in ${servers}
> do
> 	# Pause the JVM for 30 seconds to simulate a stop the world GC
> 	kill -STOP server 
> 	sleep 30
> 	# Unpause the JVM, wait for member to reconnect and regions to recover redundancy configured
> 	kill -CONT "${SERVER_PID}"
> 	waitForReconnectcompletedInServerLog
> 	waitForNumBucketsWithoutRedundancyToBeZeroInGfshShowRegionMetrics
> done
> {noformat}
> The test works fine most of the time, but randomly fails due to an unexpected exception logged within the logs for at least one locator. The exception is always reported from a {{P2P message reader}} thread on the locator for a server member that has just returned to life, as an example:
> {noformat}
> #### LOCATOR-0
> [warn 2021/08/17 05:20:45.166 GMT system-test-gemfire-locator-0 <P2P message reader for system-test-gemfire-server-2(system-test-gemfire-server-2:1)<v4>:41000 unshared ordered sender uid=61 dom #1 local port=48141 remote port=46174> tid=0x6f] P2P message reader@354fac47 timed out during a membership check.
> [fatal 2021/08/17 05:20:45.166 GMT system-test-gemfire-locator-0 <P2P message reader for system-test-gemfire-server-2(system-test-gemfire-server-2:1)<v4>:41000 unshared ordered sender uid=61 dom #1 local port=48141 remote port=46174> tid=0x6f] Error deserializing P2P handshake message
> java.lang.IllegalStateException: Detected old version (pre 5.0.1) of GemFire or non-GemFire during handshake due to initial byte being 1
> 	at org.apache.geode.internal.tcp.Connection.readHandshakeForReceiver(Connection.java:2875)
> 	at org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2825)
> 	at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1649)
> 	at org.apache.geode.internal.tcp.Connection.run(Connection.java:1489)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> 	at java.base/java.lang.Thread.run(Thread.java:829)
> #### SERVER-2
> [warn 2021/08/17 05:20:44.012 GMT system-test-gemfire-server-2 <StatSampler> tid=0x35] Statistics sampling thread detected a wakeup delay of 29070 ms, indicating a possible resource issue. Check the GC, memory, and CPU statistics.
> [warn 2021/08/17 05:20:44.015 GMT system-test-gemfire-server-2 <Geode Heartbeat Sender> tid=0x23] Failure detection heartbeat-generation thread overslept by more than a full period. Asleep time: 31,175,291,931 nanoseconds. Period: 2,500,000,000 nanoseconds.
> [info 2021/08/17 05:20:44.143 GMT system-test-gemfire-server-2 <unicast receiver,system-test-gemfire-server-2-25596> tid=0x1e] saving cache server configuration for use with the cluster-configuration service on reconnect
> [info 2021/08/17 05:20:44.143 GMT system-test-gemfire-server-2 <unicast receiver,system-test-gemfire-server-2-25596> tid=0x1e] cache server configuration saved
> [info 2021/08/17 05:20:44.233 GMT system-test-gemfire-server-2 <DisconnectThread> tid=0xe5] Stopping membership services
> [info 2021/08/17 05:20:44.455 GMT system-test-gemfire-server-2 <ReconnectThread> tid=0xe5] Disconnecting old DistributedSystem to prepare for a reconnect attempt
> [info 2021/08/17 05:20:44.463 GMT system-test-gemfire-server-2 <ReconnectThread> tid=0xe5] GemFireCache[id = 252990056; isClosing = true; isShutDownAll = false; created = Tue Aug 17 05:11:50 GMT 2021; server = true; copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.
> [info 2021/08/17 05:20:44.544 GMT system-test-gemfire-server-2 <ReconnectThread> tid=0xe5] Cache server on port 40404 is shutting down.
> [info 2021/08/17 05:20:44.565 GMT system-test-gemfire-server-2 <Queue Removal Thread> tid=0x5e] The QueueRemovalThread is done.
> {noformat}
>  
> The full set of logs and statistics from all member of the cluster can be found [here|https://drive.google.com/file/d/1jU_LIut9DVlZNniAdb53IglM3iWbX49C/view].
> Contrary to what the exception message states, it's worth noticing that *_all members within the cluster (including clients) are using the same Geode version._*
> —
> Below are some extra logs from when I was able to reproduce the issue with {{log-level=fine}} configured on all members:
> {noformat}
> #### LOCATOR-0
> [debug 2021/08/16 20:40:22.858 GMT system-test-gemfire-locator-0 <P2P message reader for system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87] P2P handshake remoteAddr is system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 (GEODE 1.15.0)
> [warn 2021/08/16 20:40:23.858 GMT system-test-gemfire-locator-0 <P2P message reader for system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87] P2P message reader@6673b475 timed out during a membership check.
> [fatal 2021/08/16 20:40:23.858 GMT system-test-gemfire-locator-0 <P2P message reader for system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87] Error deserializing P2P handshake message
> java.lang.IllegalStateException: Detected old version (pre 5.0.1) of GemFire or non-GemFire during handshake due to initial byte being 1
> 	at org.apache.geode.internal.tcp.Connection.readHandshakeForReceiver(Connection.java:2875)
> 	at org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2825)
> 	at org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1649)
> 	at org.apache.geode.internal.tcp.Connection.run(Connection.java:1489)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> 	at java.base/java.lang.Thread.run(Thread.java:829)
> [debug 2021/08/16 20:40:23.863 GMT system-test-gemfire-locator-0 <P2P message reader for system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87] readMessages terminated id=/10.12.0.9:58373 from system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 isHandshakeReader=false
> [debug 2021/08/16 20:40:23.863 GMT system-test-gemfire-locator-0 <P2P message reader for system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> tid=0x87] Stopping P2P message reader@6673b475 for system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000
> #### SERVER-1
> [warn 2021/08/16 20:40:22.698 GMT system-test-gemfire-server-1 <StatSampler> tid=0x35] Statistics sampling thread detected a wakeup delay of 29219 ms, indicating a possible resource issue. Check the GC, memory, and CPU statistics.
> ...
> [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <ClientMembership Event Invoker1> tid=0x58] Starting thread for P2P handshake reader@6161c139-47
> [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <P2P handshake reader@6161c139-47> tid=0x33] Starting P2P handshake reader@6161c139-47 on Socket[addr=/10.12.0.9,port=58373,localport=57818]
> [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <P2P handshake reader@6161c139-47> tid=0x33] readMessages terminated id=/10.12.1.6:44219 from system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000 isHandshakeReader=false
> [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <P2P handshake reader@6161c139-47> tid=0x33] Stopping P2P handshake reader@6161c139-47 for system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000
> [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <ClientMembership Event Invoker1> tid=0x58] ConnectionTable: created an ordered connection: system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000(uid=47)
> [debug 2021/08/16 20:40:23.059 GMT system-test-gemfire-server-1 <ClientMembership Event Invoker1> tid=0x58] Sending on these 1 connections: [system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000(uid=47)]
> [debug 2021/08/16 20:40:23.065 GMT system-test-gemfire-server-1 <ClientMembership Event Invoker1> tid=0x58] distribution of message aborted by shutdown: UpdateOperation(EntryEventImpl[op=CREATE;region=/_notificationRegion_10.12.1.6<v3>41000;key=org.apache.geode.management.internal.NotificationKey@9b5d432b;callbackArg=null;originRemote=false;originMember=system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000;callbacksInvoked;id=EventID[system-test-gemfire-server-1(system-test-gemfire-server-1)<v3>:41000;threadID=2;sequenceID=5]])
> {noformat}
> The full logs and statistics for all member of the cluster can be found [here|https://drive.google.com/file/d/1gAxLzxAYs70hBtrnXq6pI68K1IkAWpRt/view].
> I'm not familiar with this part of the code but, from the above logs, it looks like the same reader ({{reader@6161c139-47}}) is being used by two different threads at the same time?, is this expected?, might that be the cause of the exception we're seeing?.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)