You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Gareth Davis (JIRA)" <ji...@apache.org> on 2014/09/19 17:43:36 UTC

[jira] [Commented] (AVRO-1407) NettyTransceiver can cause a infinite loop when slow to connect

    [ https://issues.apache.org/jira/browse/AVRO-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140754#comment-14140754 ] 

Gareth Davis commented on AVRO-1407:
------------------------------------

10 months to respond doesn't seem too bad.... sorry.

The channel only needs to be closed only on an exception, hence the catch Throwable.  The core problem is that the constructor is allocating resources that can't aren't reachable if the constructor fails.

> NettyTransceiver can cause a infinite loop when slow to connect
> ---------------------------------------------------------------
>
>                 Key: AVRO-1407
>                 URL: https://issues.apache.org/jira/browse/AVRO-1407
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.7.5, 1.7.6
>            Reporter: Gareth Davis
>         Attachments: AVRO-1407-1.patch
>
>
> When a new {{NettyTransceiver}} is created it forces the channel to be allocated and connected to the remote host. it waits for the connectTimeout ms on the [connect channel future|https://github.com/apache/avro/blob/1579ab1ac95731630af58fc303a07c9bf28541d6/lang/java/ipc/src/main/java/org/apache/avro/ipc/NettyTransceiver.java#L271] this is obivously a good thing it's only that on being unsuccessful, ie {{!channelFuture.isSuccess()}} an exception is thrown and the call to the constructor fails with an {{IOException}}, but has the potential to leave a active channel associated with the {{ChannelFactory}}
> The problem is that a Netty {{NioClientSocketChannelFactory}} will not shutdown if there are active channels still around and if you have supplied the {{ChannelFactory}} to the {{NettyTransceiver}} then  you will not be able to cancel it by calling {{ChannelFactory.releaseExternalResources()}} like the [Flume Avro RPC client does|https://github.com/apache/flume/blob/b8cf789b8509b1e5be05dd0b0b16c5d9af9698ae/flume-ng-sdk/src/main/java/org/apache/flume/api/NettyAvroRpcClient.java#L158]. In order to recreate this you need a very laggy network, where the connect attempt takes longer than the connect timeout but does actually work, this very hard to organise in a test case, although I do have a test setup using vagrant VM's that recreates this everytime, using the Flume RPC client and server.
> The following stack is from a production system, it won't ever leave recover until the channel is disconnected (by forcing a disconnect at the remote host) or restarting the JVM.
> {noformat:title=Production stack trace}
> "TLOG-0" daemon prio=10 tid=0x00007f581c7be800 nid=0x39a1 waiting on condition [0x00007f57ef9f2000]
>   java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   parking to wait for <0x00000007218b16e0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
>   at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
>   at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1253)
>   at org.jboss.netty.util.internal.ExecutorUtil.terminate(ExecutorUtil.java:103)
>   at org.jboss.netty.channel.socket.nio.AbstractNioWorkerPool.releaseExternalResources(AbstractNioWorkerPool.java:80)
>   at org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.releaseExternalResources(NioClientSocketChannelFactory.java:181)
>   at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:142)
>   at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:101)
>   at org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:564)
>   locked <0x00000006c30ae7b0> (a org.apache.flume.api.NettyAvroRpcClient)
>   at org.apache.flume.api.RpcClientFactory.getInstance(RpcClientFactory.java:88)
>   at org.apache.flume.api.LoadBalancingRpcClient.createClient(LoadBalancingRpcClient.java:214)
>   at org.apache.flume.api.LoadBalancingRpcClient.getClient(LoadBalancingRpcClient.java:205)
>   locked <0x00000006a97b18e8> (a org.apache.flume.api.LoadBalancingRpcClient)
>   at org.apache.flume.api.LoadBalancingRpcClient.appendBatch(LoadBalancingRpcClient.java:95)
>   at com.ean.platform.components.tlog.client.service.AvroRpcEventRouter$1.call(AvroRpcEventRouter.java:45)
>   at com.ean.platform.components.tlog.client.service.AvroRpcEventRouter$1.call(AvroRpcEventRouter.java:43)
> {noformat}
> The solution is very simple, and a patch should be along in a moment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)