You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ratis.apache.org by "Glen Geng (Jira)" <ji...@apache.org> on 2020/09/24 02:26:00 UTC

[jira] [Commented] (RATIS-1072) Should not shutdown and re-create channel/stub in GrpcServerProtocolClient when StreamObserver::onError() is called.

    [ https://issues.apache.org/jira/browse/RATIS-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201218#comment-17201218 ] 

Glen Geng commented on RATIS-1072:
----------------------------------

Please check reply from grpc community:

[grpc/grpc-java#7449 (comment)|https://github.com/grpc/grpc-java/issues/7449#issuecomment-697766246]
{quote}A channel can be used throughout the lifetime of the application. There shouldn't be a case that the application needs to shut down and recreate it.
{quote}

> Should not shutdown and re-create channel/stub in GrpcServerProtocolClient when StreamObserver::onError() is called.
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: RATIS-1072
>                 URL: https://issues.apache.org/jira/browse/RATIS-1072
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: Glen Geng
>            Assignee: Glen Geng
>            Priority: Major
>              Labels: MultiRaft, dense-storage
>             Fix For: 1.1.0
>
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> [link title|http://example.com/] 
> When InstallSnapshotResponseHandler::onError() or AppendLogResponseHandler::onError() is called, it will:
>  # call rpcService.getProxies().resetProxy(getFollowerId()).
>  # set appendLogRequestObserver to be null.
> The recovery steps for this BiDi streaming call is actually: 
>  # call ManagedChannel.shutdown()
>  # re-create the channel from NettyChannelBuilder
>  # re-create the async stub from the channel.
>  # re-create a StreamObserver<OurRequest> from the async stub.
>  
> *Above steps is un-necessary and dangerous.*
>  
> *It is un-necessary:*
> according to the [reply from grpc|[https://github.com/grpc/grpc-java/issues/7442]]]
> {quote}Yes, you don't need to recreate the channel (and stub) for starting a new RPC call. The overhead for creating a channel is big.
> Receiving {{StreamObserver.onError()}} only indicates that specific RPC has been terminated. The connection should be still there. So you can definitely start new RPCs on the existing connection.
> {quote}
>  
> *It is dangerous:*
> In multi-raft configuraiton, if two leaders sit in DN1, their followers sit in DN2, their steaming call will share the managed channel in the same GrpcServerProtocolClient that targeting from DN1 to DN2.
> If leader1 calls StreamObserver::onError(), it will shut down and re-create the underlying channel. In the same time, leader2's streaming call will be affected, and call StreamObserver::onError() to shut down and re-create the channel as well. The consequences is, the channel will be continually shutdown and re-create.
>  
> *trace from DN consisting of the leaders:*
> We can see that, the channel targetting 172.16.0.69 is continually shutdown and re-create in every 3s.
> {quote}2020-09-18 19:47:04,025 [grpc-default-executor-93] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134495, target=172.16.90.69:9858}}. 
>  2020-09-18 19:47:07,025 [grpc-default-executor-115] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134499, target=172.16.90.69:9858}}. 
>  2020-09-18 19:47:10,026 [grpc-default-executor-106] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134503, target=172.16.90.69:9858}}. 
>  2020-09-18 19:47:13,026 [grpc-default-executor-115] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134507, target=172.16.90.69:9858}}. 
>  2020-09-18 19:47:16,027 [grpc-default-executor-106] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134511, target=172.16.90.69:9858}}. 
>  2020-09-18 19:47:19,027 [grpc-default-executor-114] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134532, target=172.16.90.69:9858}}. 
>  2020-09-18 19:47:22,027 [grpc-default-executor-105] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134540, target=172.16.90.69:9858}}. 
>  2020-09-18 19:47:25,028 [grpc-default-executor-104] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134544, target=172.16.90.69:9858}}. 
>  2020-09-18 19:47:28,028 [grpc-default-executor-117] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134548, target=172.16.90.69:9858}}. 
>  2020-09-18 19:47:31,028 [grpc-default-executor-114] WARN org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down connection: ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134552, target=172.16.90.69:9858}}. 
>   
> {quote}
> *trace from DN consisting of the followers:*
> Streams of different raft-group is continually affected. The (term, index) indicate that they belong to different raft-group, (t:9, i:8713), (t:7, i:12312),(t:3, i:213868),(t:10, i:12504), meanwhile raft peer id df173506-6978-47e5-8aef-034cafd221a6 means they are from leaders on the same DN.
> {quote}2020-09-18 19:51:49,498 [grpc-default-executor-113] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#320-t9, previous=(t:9, i:8713), leaderCommit=8714, initializing? false, entries: size=1, first=(t:9, i:8714), METADATAENTRY(c:8713): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
>  2020-09-18 19:51:52,499 [grpc-default-executor-104] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#608-t7, previous=(t:7, i:12312), leaderCommit=12313, initializing? false, entries: size=1, first=(t:7, i:12313), METADATAENTRY(c:12312): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
>  2020-09-18 19:51:52,499 [grpc-default-executor-109] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#1250591-t3, previous=(t:3, i:213868), leaderCommit=213869, initializing? false, entries: size=1, first=(t:3, i:213869), METADATAENTRY(c:213865): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
>  2020-09-18 19:51:52,499 [grpc-default-executor-112] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#601-t10, previous=(t:10, i:12504), leaderCommit=12505, initializing? false, entries: size=1, first=(t:10, i:12505), METADATAENTRY(c:12504): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
>  2020-09-18 19:51:55,499 [grpc-default-executor-109] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#785-t10, previous=(t:10, i:6660), leaderCommit=6661, initializing? false, entries: size=1, first=(t:10, i:6661), METADATAENTRY(c:6660): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
>  2020-09-18 19:51:55,499 [grpc-default-executor-104] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#701-t6, previous=(t:6, i:9305), leaderCommit=9306, initializing? false, entries: size=1, first=(t:6, i:9306), METADATAENTRY(c:9305): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
>  2020-09-18 19:51:58,499 [grpc-default-executor-114] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#611-t7, previous=(t:7, i:12312), leaderCommit=12313, initializing? false, entries: size=1, first=(t:7, i:12313), METADATAENTRY(c:12312): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
>  2020-09-18 19:51:58,499 [grpc-default-executor-112] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#604-t10, previous=(t:10, i:12504), leaderCommit=12505, initializing? false, entries: size=1, first=(t:10, i:12505), METADATAENTRY(c:12504): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
>  2020-09-18 19:51:58,499 [grpc-default-executor-113] WARN org.apache.ratis.grpc.server.GrpcServerProtocolService: 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#324-t9, previous=(t:9, i:8713), leaderCommit=8714, initializing? false, entries: size=1, first=(t:9, i:8714), METADATAENTRY(c:8713): org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: cancelled before receiving half close
>   
> {quote}
> *The fix is simple* 
> We can reuse the the channel and stub, just need {{step 4}} to re-create the {{StreamObserver.}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)