You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "B O (JIRA)" <ji...@apache.org> on 2018/05/06 00:20:00 UTC

[jira] [Created] (NIFI-5157) ListSFTP for Massive Folders (without freezing)

B O created NIFI-5157:
-------------------------

             Summary: ListSFTP for Massive Folders (without freezing)
                 Key: NIFI-5157
                 URL: https://issues.apache.org/jira/browse/NIFI-5157
             Project: Apache NiFi
          Issue Type: Improvement
          Components: Core Framework
    Affects Versions: 1.3.0
            Reporter: B O


Currently, if ListSFTP is used on a folder with millions and millions of files and Primary Node has only 32GB of RAM, then to create millions of flowfiles above say 40 million, it could result in frozen threads for ListSFTP, resulting in having to restart Primary Node.

This happens when say another system sends files to your system and eventually builds up a backlog of 10s of millions of files. Recursion won't work either even if you separated by folder, or otherwise you'd need some sort of "controlRate" like processor that can pass in flowfiles into ListSFTP resulting in ListSFTP knowing when to get files (triggering on its own).

Also, there seems to be situations where Nifi kind of assumes a stable environment, but in unstable ones, where memory hardware failures happen, SFTP transmission problems, internet outages, it becomes difficult to recover an ingest or know where you left off (which might be useful for ListSFTP):

Batch-processing usually requires a system to say separate things out into X amount of files/folders that can fit into the RAM of the primary-node. We may need some kind of feature like SQL's Transaction "Commit" and "Rollback in case of error". There needs to be an efficient way for small systems to take in large volumes of data without crashing or if crashes are inevitable then it needs some sort of batch transaction that can tell you where it left off so that you don't have to pull the same folder again but only say after File Age = some-number. I should be able to login tomorrow and say "oh my ingest totally collapsed, but at least I know where it left off somewhat." Especially when WAL recovery is impossible due to socket connection issues between nodes (or site-to-site active connections) causing some Nifi nodes to refuse to load or recover its state.

I would like the ability to be able to customize ListSFTP in a way that tracks things better even in situations of disaster in the nifi cluster recovery. Perhaps Inputs into ListSFTP utilizing the expression language for timestamps of folders.

I always have to place a control rate after listsftp, but i can never do control-rating within the ListSFTP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)