You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@nifi.apache.org by "Aldrin Piri (JIRA)" <ji...@apache.org> on 2015/04/11 00:29:12 UTC

[jira] [Commented] (NIFI-512) Allow GetFile to pull in data without deleting the local file

    [ https://issues.apache.org/jira/browse/NIFI-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490472#comment-14490472 ] 

Aldrin Piri commented on NIFI-512:
----------------------------------

The big challenge that is hidden here is the inherent state this creates in a clustered environment and the potential changing of a primary node should one go offline. The approach overall sounds good but I'm not sure if the above state is more of an edge case or non-marginal consideration. 

> Allow GetFile to pull in data without deleting the local file
> -------------------------------------------------------------
>
>                 Key: NIFI-512
>                 URL: https://issues.apache.org/jira/browse/NIFI-512
>             Project: Apache NiFi
>          Issue Type: Task
>          Components: Extensions
>            Reporter: Mark Payne
>
> There have been several people asking for this capability. Currently, when we do a file listing, it's placed into a HashSet, so there is no ordering for how we pull the files in. My proposal is that we instead order the files such that we pull the oldest file first and keep track of the latest timestamp that we've pulled in. This way on restart we can resume where we left off.
> I would create a FileOutputStream and keep it open. Write out the timestamp each time we pull data in. Then periodically flush the data to disk. Perhaps every second or so - maybe this should be configurable. We need a tradeoff between how much possible duplication we get and how much time we spend persisting the timestamp.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)