You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by Mark Bean <ma...@gmail.com> on 2021/07/08 18:04:14 UTC

Odd behavior for site-to-site

We're seeing some odd behavior using site-to-site. The input port on a
3-node cluster will eventually stop receiving new data. In the log, I see
the following:

2021-07-08 13:13:14,010 ERROR [NiFi Web Server-43017]
o.a.nifi.web.api.ApplicationResource Exception detail:
org.apache.nifi.processor.exception.ProcessException:
java.lang.InterruptedException
        at
org.apache.nifi.remote.StandardPublicPort.receiveFlowFiles(StandardPublicPort.java:588)
        at
org.apache.nifi.web.api.DataTransferResource.receiveFlowFiles(DataTransferResource.java:277)
...

Then many more similar messages:
2021-07-08 13:13:14,015 ERROR [NiFi Web Server-47691]
0.a.nifi.web.api.ApplicationResource Exception detail:
org.apache.nifi.processor.exception.ProcessException:
org.apache.nifi.processor.exception.ProcessException: Interrupted while
waiting for site-to-site request to be serviced
        at
org.apache.nifi.remote.StandardPublicPort.receiveFlowFiles(StandardPublicPort.java:588)
        at
org.apache.nifi.web.api.DataTransferResource.receiveFlowFiles(DataTransferResource.java:277)
...

It's unclear what is causing the exception (possibly some network
instability), but the only way we have been able to get data flowing again
is to restart the NiFi node. Even more concerning is that when NIFi is
restarted, there are many thousand messages indicating:

2021-07-08 15:29:12,097 INFO [main] o.a.n.c.repository.FileSystemRepository
Found unknown file /cont_repo/content/336/1625700387433-161104 (1333153
bytes) in File System Repository; removing file

I suspect the failed site-to-site transfer completed writing data (content)
to disk, but was interrupted prior to creating a flowfile and
committing the Process Session. If this is true, this could cause the repo
to fill with data that will never get cleaned up until a NiFi restart.

I'm looking for someone with detailed knowledge of the internals of
site-to-site to comment on this issue - either the hard stop on receiving
additional data via site-to-site, or the orphaned content.

NiFi Version: 1.12.1

Thanks,
Mark

Re: Odd behavior for site-to-site

Posted by Mark Bean <ma...@gmail.com>.

I started looking on the Remote Process Group side of the connection. It is reporting "Pipe closed". The site-to-site connection seems to stay in this state irreversibly. This type of error should initiate a restart of the connection to the remote port. I'm not sure where the best place to do this is though.

Would this be in the StandardRemoteGroupPort? Or maybe within the transaction itself (AbstractTransaction)? I want to avoid constantly recreating a connection when not required since that could cause a performance impact.

Thanks,
Mark

On 2021/07/08 18:04:14, Mark Bean <ma...@gmail.com> wrote: 
> We're seeing some odd behavior using site-to-site. The input port on a
> 3-node cluster will eventually stop receiving new data. In the log, I see
> the following:
> 
> 2021-07-08 13:13:14,010 ERROR [NiFi Web Server-43017]
> o.a.nifi.web.api.ApplicationResource Exception detail:
> org.apache.nifi.processor.exception.ProcessException:
> java.lang.InterruptedException
>         at
> org.apache.nifi.remote.StandardPublicPort.receiveFlowFiles(StandardPublicPort.java:588)
>         at
> org.apache.nifi.web.api.DataTransferResource.receiveFlowFiles(DataTransferResource.java:277)
> ...
> 
> Then many more similar messages:
> 2021-07-08 13:13:14,015 ERROR [NiFi Web Server-47691]
> 0.a.nifi.web.api.ApplicationResource Exception detail:
> org.apache.nifi.processor.exception.ProcessException:
> org.apache.nifi.processor.exception.ProcessException: Interrupted while
> waiting for site-to-site request to be serviced
>         at
> org.apache.nifi.remote.StandardPublicPort.receiveFlowFiles(StandardPublicPort.java:588)
>         at
> org.apache.nifi.web.api.DataTransferResource.receiveFlowFiles(DataTransferResource.java:277)
> ...
> 
> It's unclear what is causing the exception (possibly some network
> instability), but the only way we have been able to get data flowing again
> is to restart the NiFi node. Even more concerning is that when NIFi is
> restarted, there are many thousand messages indicating:
> 
> 2021-07-08 15:29:12,097 INFO [main] o.a.n.c.repository.FileSystemRepository
> Found unknown file /cont_repo/content/336/1625700387433-161104 (1333153
> bytes) in File System Repository; removing file
> 
> I suspect the failed site-to-site transfer completed writing data (content)
> to disk, but was interrupted prior to creating a flowfile and
> committing the Process Session. If this is true, this could cause the repo
> to fill with data that will never get cleaned up until a NiFi restart.
> 
> I'm looking for someone with detailed knowledge of the internals of
> site-to-site to comment on this issue - either the hard stop on receiving
> additional data via site-to-site, or the orphaned content.
> 
> NiFi Version: 1.12.1
> 
> Thanks,
> Mark
>