You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@geode.apache.org by Mario Kevo <ma...@est.tech> on 2020/04/28 13:25:41 UTC

Handling packet drop between sites

Hi geode-dev,

I have a question about how Geode handle when some packets from batch is dropped.
I create Geode WAN with two sites and established replication between them. Also modified iptables to drop all packets that comes to receiver port.
In that case I have that some threads are stucked. Seems like gw sender never received any response back.
[warn 2020/04/27 13:19:04.667 CEST <ThreadsMonitor> tid=0x11] Thread 128 (0x80) is stuck

[warn 2020/04/27 13:19:04.669 CEST <ThreadsMonitor> tid=0x11] Thread <128> (0x80) that was executed at <27 Apr 2020 13:18:13 CEST> has been stuck for <50.997 seconds> and number of thread monitor iteration <1>
Thread Name <poolTimer-ny-27> state <RUNNABLE>
Executor Group <ScheduledThreadPoolExecutorWithKeepAlive>
Monitored metric <ResourceManagerStats.numThreadsStuck>
Thread stack:
java.net.PlainSocketImpl.socketConnect(Native Method)
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
java.net.Socket.connect(Socket.java:607)
org.apache.geode.distributed.internal.tcpserver.AdvancedSocketCreatorImpl.connect(AdvancedSocketCreatorImpl.java:102)
org.apache.geode.internal.net.SCAdvancedSocketCreator.connect(SCAdvancedSocketCreator.java:51)
org.apache.geode.distributed.internal.tcpserver.TcpSocketCreatorImpl.connect(TcpSocketCreatorImpl.java:59)
org.apache.geode.distributed.internal.tcpserver.ClientSocketCreatorImpl.connect(ClientSocketCreatorImpl.java:54)
org.apache.geode.cache.client.internal.ConnectionImpl.connect(ConnectionImpl.java:94)
org.apache.geode.cache.client.internal.ConnectionConnector.connectClientToServer(ConnectionConnector.java:75)
org.apache.geode.cache.client.internal.ConnectionFactoryImpl.createClientToServerConnection(ConnectionFactoryImpl.java:118)
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.createPooledConnection(ConnectionManagerImpl.java:206)
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.forceCreateConnection(ConnectionManagerImpl.java:216)
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.borrowConnection(ConnectionManagerImpl.java:326)
org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnServer(OpExecutorImpl.java:329)
org.apache.geode.cache.client.internal.OpExecutorImpl.executeOn(OpExecutorImpl.java:303)
org.apache.geode.cache.client.internal.PoolImpl.executeOn(PoolImpl.java:839)
org.apache.geode.cache.client.internal.PingOp.execute(PingOp.java:36)
org.apache.geode.cache.client.internal.LiveServerPinger$PingTask.run2(LiveServerPinger.java:90)
org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1329)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:276)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

Also, I tried to run the same test with 200K entries and drop 70% of packets and see that exception is again there and it takes approx. 40min to transmit all entries to another site.

How Geode handle dropping some packets from the batch? Does anyone made some tests on this behavior?

Thanks,
Mario


Re: Handling packet drop between sites

Posted by Anthony Baker <ba...@vmware.com>.
TCP behaves really poorly in the face of significant packet loss.  You can look into tcp_retries1 and tcp_retries2 [1] for some explanations and tuning.  Eventually, TCP will give up attempting to deliver a packet but this may take up to 30min depending on configuration.  IIRC, it’s only at that point that the socket signals an error to the JVM.  On top of TCP, you can layer application protocols for liveness including timeouts, request/reply semantics, and periodic messaging.

I would expect that geode should:

1) Not lose any batch events even if packets get dropped
2) Recover quickly when the network becomes stable again

When a batch is sent to a remote site, it is not dequeued from the sender until the destination site sends a response that the batch was delivered without error [2].

Note also that the log message below does not strictly indicate a hang, it could be just making progress slowly.

HTH and looking forward to the results of your investigations.

Anthony


[1] https://linux.die.net/man/7/tcp
[2] There is a corner case if the destination is over the critical threshold


On Apr 28, 2020, at 6:25 AM, Mario Kevo <ma...@est.tech>> wrote:

Hi geode-dev,

I have a question about how Geode handle when some packets from batch is dropped.
I create Geode WAN with two sites and established replication between them. Also modified iptables to drop all packets that comes to receiver port.
In that case I have that some threads are stucked. Seems like gw sender never received any response back.
[warn 2020/04/27 13:19:04.667 CEST <ThreadsMonitor> tid=0x11] Thread 128 (0x80) is stuck

[warn 2020/04/27 13:19:04.669 CEST <ThreadsMonitor> tid=0x11] Thread <128> (0x80) that was executed at <27 Apr 2020 13:18:13 CEST> has been stuck for <50.997 seconds> and number of thread monitor iteration <1>
Thread Name <poolTimer-ny-27> state <RUNNABLE>
Executor Group <ScheduledThreadPoolExecutorWithKeepAlive>
Monitored metric <ResourceManagerStats.numThreadsStuck>
Thread stack:
java.net.PlainSocketImpl.socketConnect(Native Method)
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
java.net.Socket.connect(Socket.java:607)
org.apache.geode.distributed.internal.tcpserver.AdvancedSocketCreatorImpl.connect(AdvancedSocketCreatorImpl.java:102)
org.apache.geode.internal.net.SCAdvancedSocketCreator.connect(SCAdvancedSocketCreator.java:51)
org.apache.geode.distributed.internal.tcpserver.TcpSocketCreatorImpl.connect(TcpSocketCreatorImpl.java:59)
org.apache.geode.distributed.internal.tcpserver.ClientSocketCreatorImpl.connect(ClientSocketCreatorImpl.java:54)
org.apache.geode.cache.client.internal.ConnectionImpl.connect(ConnectionImpl.java:94)
org.apache.geode.cache.client.internal.ConnectionConnector.connectClientToServer(ConnectionConnector.java:75)
org.apache.geode.cache.client.internal.ConnectionFactoryImpl.createClientToServerConnection(ConnectionFactoryImpl.java:118)
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.createPooledConnection(ConnectionManagerImpl.java:206)
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.forceCreateConnection(ConnectionManagerImpl.java:216)
org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.borrowConnection(ConnectionManagerImpl.java:326)
org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnServer(OpExecutorImpl.java:329)
org.apache.geode.cache.client.internal.OpExecutorImpl.executeOn(OpExecutorImpl.java:303)
org.apache.geode.cache.client.internal.PoolImpl.executeOn(PoolImpl.java:839)
org.apache.geode.cache.client.internal.PingOp.execute(PingOp.java:36)
org.apache.geode.cache.client.internal.LiveServerPinger$PingTask.run2(LiveServerPinger.java:90)
org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1329)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:276)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

Also, I tried to run the same test with 200K entries and drop 70% of packets and see that exception is again there and it takes approx. 40min to transmit all entries to another site.

How Geode handle dropping some packets from the batch? Does anyone made some tests on this behavior?

Thanks,
Mario