You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@dubbo.apache.org by "beiwei30 (GitHub)" <gi...@apache.org> on 2019/01/17 06:39:07 UTC

[GitHub] [incubator-dubbo] beiwei30 commented on issue #3262: "No provider available in XXXX" RpcException after server IO Busy

## Version

dubbo-2.5.9。我们修改了Transporter的默认SPI,使用netty4:

```java
@SPI("netty4")
public interface Transporter{

}
```

## Issue

服务注册中心有服务的提供者信息,且提供者配置都正常,但服务消费者调用服务时,报"No provider available in XXXX"的RpcException。

## Analysis

### Step 1

通过gcore命令dump了jvm的内存,使用mat分析发现大量异常状态的连接,即com.alibaba.dubbo.remoting.transport.netty4.NettyClient中,closed=false但channel.ch.state=4。这就造成DubboInvoker在判断服务提供方是否可用时,都不可用。其判断顺序如下:

```java
/*
  com.alibaba.dubbo.rpc.cluster.support.AvailableCluster
*/
    public  Invoker join(Directory directory) throws RpcException {
        return new AbstractClusterInvoker(directory) {
            public Result doInvoke(Invocation invocation, List> invokers, LoadBalance loadbalance) throws RpcException {
                for (Invoker invoker : invokers) {
                    if (invoker.isAvailable()) {//ㄟ(▔▽▔)ㄏ
                        return invoker.invoke(invocation);
                    }
                }
                throw new RpcException("No provider available in " + invokers);
            }
        };
    }
```

```java
/*
  com.alibaba.dubbo.registry.integration.RegistryDirectory
*/
    public boolean isAvailable() {
        if (isDestroyed()) {
            return false;
        }
        Map> localUrlInvokerMap = urlInvokerMap;
        if (localUrlInvokerMap != null && localUrlInvokerMap.size() > 0) {
            for (Invoker invoker : new ArrayList>(localUrlInvokerMap.values())) {
                if (invoker.isAvailable()) {//ㄟ(▔▽▔)ㄏ
                    return true;
                }
            }
        }
        return false;
    }
```

```java
/*
  com.alibaba.dubbo.rpc.protocol.dubbo.DubboInvoker
*/
    public boolean isAvailable() {
        if (!super.isAvailable())
            return false;
        for (ExchangeClient client : clients) {
            if (client.isConnected() && !client.hasAttribute(Constants.CHANNEL_ATTRIBUTE_READONLY_KEY)) {//ㄟ(▔▽▔)ㄏ
                return true;
            }
        }
        return false;
    }
```

```java
/*
  com.alibaba.dubbo.remoting.transport.netty4.NettyChannel
*/
    public boolean isConnected() {
        return channel.isActive();//ㄟ(▔▽▔)ㄏ
    }
```

```java
/*
  io.netty.channel.socket.nio.NioSocketChannel
*/
    public boolean isActive() {
        java.nio.channels.SocketChannel ch = this.javaChannel();
        return ch.isOpen() && ch.isConnected();//ㄟ(▔▽▔)ㄏ
    }
```

```java
/*
  sun.nio.ch.SocketChannelImpl
*/
    public boolean isConnected() {
        synchronized(this.stateLock) {
            return this.state == 2;//ㄟ(▔▽▔)ㄏ
        }
    }
```

同时,这种异常连接因为closed=false也造成HeartBeatTask无法处理重连

```java
/*
  com.alibaba.dubbo.remoting.exchange.support.header.HeartBeatTask
*/
    public void run() {
        try {
            long now = System.currentTimeMillis();
            for (Channel channel : channelProvider.getChannels()) {
                if (channel.isClosed()) {//ㄟ(▔▽▔)ㄏ
                    continue;
                }
                //SKIPPPP
        } catch (Throwable t) {
            logger.warn("Unhandled exception when heartbeat, cause: " + t.getMessage(), t);
        }
    }
```

### Step 2

分析服务器的nmon文件,发现出现问题前时刻DISKBUSY=100%,CPU=100%,NETWORK≈0。

### Step 3

进一步分析Dubbo的心跳重连机制:

> 如果channel上读取时间超过3个heartbeat(默认1分钟)未更新,消费者会重新建立NettyClient底层连接,重建连接会先吊用disconnect()关闭已有连接,并通过connect()建立新连接。

```java
/*
  com.alibaba.dubbo.remoting.exchange.support.header.HeartBeatTask
*/
    if (lastRead != null && now - lastRead > heartbeatTimeout) {
        logger.warn("Close channel " + channel
                + ", because heartbeat read idle time out: " + heartbeatTimeout + "ms");
        if (channel instanceof Client) {
            try {
                ((Client) channel).reconnect();//ㄟ(▔▽▔)ㄏ
            } catch (Exception e) {
                //do nothing
            }
        } else {
            channel.close();
        }
    }
```

```java
/*
  com.alibaba.dubbo.remoting.transport.AbstractClient
*/
    public void reconnect() throws RemotingException {
        disconnect();//ㄟ(▔▽▔)ㄏ
        connect();
    }

    public void disconnect() {
        connectLock.lock();
        try {
            destroyConnectStatusCheckCommand();
            try {
                Channel channel = getChannel();
                if (channel != null) {
                    channel.close();//ㄟ(▔▽▔)ㄏ
                }
            } catch (Throwable e) {
                logger.warn(e.getMessage(), e);
            }
            try {
                doDisConnect();
            } catch (Throwable e) {
                logger.warn(e.getMessage(), e);
            }
        } finally {
            connectLock.unlock();
        }
    }
```

```java
/*
  com.alibaba.dubbo.remoting.transport.netty4.NettyChannel
*/
    public void close() {
        try {
            super.close();
        } catch (Exception e) {
            logger.warn(e.getMessage(), e);
        }
        try {
            removeChannelIfDisconnected(channel);
        } catch (Exception e) {
            logger.warn(e.getMessage(), e);
        }
        try {
            attributes.clear();
        } catch (Exception e) {
            logger.warn(e.getMessage(), e);
        }
        try {
            if (logger.isInfoEnabled()) {
                logger.info("Close netty channel " + channel);
            }
            channel.close();//ㄟ(▔▽▔)ㄏ
        } catch (Exception e) {
            logger.warn(e.getMessage(), e);
        }
    }
```

查看Netty4的disconnect()处理,与Netty3有较大的区别:

> Netty3中close()返回Futrue对象前发送OUT事件,OUT事件将channel的state状态直接置为ST_CLOSED。

```java
/*
  org.jboss.netty.channel.Channels
*/
    public static ChannelFuture close(Channel channel) {
        ChannelFuture future = channel.getCloseFuture();
        channel.getPipeline().sendDownstream(new DownstreamChannelStateEvent(channel, future, ChannelState.OPEN, Boolean.FALSE));
        return future;
    }
```

> Netty4中将OUT事件放到IO线程处理,并立即返回Future对象,此时判断channnel状态依然为可用,即channel.isActive()返回true。

```java
/*
  io.netty.channel.AbstractChannelHandlerContext
*/
    public ChannelFuture close(final ChannelPromise promise) {
        if (!this.validatePromise(promise, false)) {
            return promise;
        } else {
            final AbstractChannelHandlerContext next = this.findContextOutbound();
            EventExecutor executor = next.executor();
            if (executor.inEventLoop()) {
                next.invokeClose(promise);
            } else {
                safeExecute(executor, new OneTimeTask() {
                    public void run() {
                        next.invokeClose(promise);
                    }
                }, promise, (Object)null);
            }
            return promise;
        }
    }
```

## 初步猜测

`AbstractClient.reconnect()`期望的是断开连接后立刻重建连接,这种模式在Netty3下,不会出现逻辑问题。但在Netty4下,采用异步处理的模式,很容易出现`AbstractClient.disconnect()`-->`AbstractClient.connect()`-->`run(){next.invokeClose(promise)}`的执行顺序,特别是在服务器IO出现短暂异常时。

我们已经通过简单代码来模拟出现closed=false但channel.ch.state=4的现象,并且可以在测试环境通过限制服务器IO的方式重现场景。

[ Full content available at: https://github.com/apache/incubator-dubbo/issues/3262 ]
This message was relayed via gitbox.apache.org for notifications@dubbo.apache.org