You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Jon Shoemaker (Jira)" <ji...@apache.org> on 2022/08/23 15:58:00 UTC

[jira] [Commented] (NIFI-9878) DistributedCacheMap Handshake failure, processor hang indefinitely.

    [ https://issues.apache.org/jira/browse/NIFI-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17583714#comment-17583714 ] 

Jon Shoemaker commented on NIFI-9878:
-------------------------------------

Experiencing the same issue.  In our scenario it works correctly most of the time but occasionally the handshake response is never received and the processor thread hangs until the processor is terminated.  Some of these stuck threads happen when the DistributedCacheServer is restarted.

> DistributedCacheMap Handshake failure, processor hang indefinitely.
> -------------------------------------------------------------------
>
>                 Key: NIFI-9878
>                 URL: https://issues.apache.org/jira/browse/NIFI-9878
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 1.15.3
>            Reporter: Aaron Rich
>            Priority: Major
>              Labels: Handshake, distributed_cache
>         Attachments: image-2022-04-05-21-54-31-002.png, image-2022-04-05-21-55-16-221.png
>
>
> When a DistributedCacheMapClient attempts to connect to a DistributedCacheMapServer, but the handshake response is never received by the client, the PutDistributedCacheMap processor with hang indefinitely. The handshake never times out.
> A situation like this can be caused if a proxy allows for the TCP connection to be established between client and server but fails to deliver handshake data to/from DistributedCacheMapServer (for example an unstable Istio service mesh between the two). Could also happen if a client was accidentally misconfigured to point to wrong TCP server point (one that wasn't hosting a DistributedCacheMapServer.
> Steps to recreate:
> 1) Set up a PutDistributedCacheMap processor with a DistributedMapCacheClientService
> 2) Configure DistributedMapCacheClientService to point to a non DistributedCacheMapServer tcp server (nc -lk 127.0.0.1 4457). This simulates a situation where the socket connection can be made but there is no handshake response from the server (for example, server is in bad state and unable to respond, a proxy is misbehaving, etc).
> 3) use generateFlowFile to trigger PutDistributedCacheMap  processor.
> 4) processor will hang with no failure or success. Processor will have to be force terminated.
> !image-2022-04-05-21-54-31-002.png!
> !image-2022-04-05-21-55-16-221.png!
> Hang occurs at :
> CacheClientRequestHandler.java:92: handshakeHandler.waitHandshakeComplete();
>  
> Currently, the "connection timeout" parameter is only used to timeout the establishment of the TCP socket connection, not the full application layer connection.
> Suggestion:
> Handshake should have a timeout too to be robust to handle a network outage where the TCP connection is able to be created, but the handshake data can't be exchanged. The processor hanging prevents any way to handle this error in a dataflow.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)