You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Sivaprasanna Sethuraman (JIRA)" <ji...@apache.org> on 2018/03/11 13:16:00 UTC

[jira] [Commented] (NIFI-2853) Improve ListHDFS state tracking

    [ https://issues.apache.org/jira/browse/NIFI-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394497#comment-16394497 ] 

Sivaprasanna Sethuraman commented on NIFI-2853:
-----------------------------------------------

Although not a critical one, I believe this is an important feature that is needed. And I also think it is better to have not just the root level directory name appended to the "listing.timestamp" and "emitted.timestamp" but also include the sub directories, like "listing.timestamp.dir1.subdir2", "listing.timestamp.dir1.subdir3.subdir3_1" to avoid edgecase scenarios. The reason is, if we don't do that, files might not get picked up in some scenario. Ex:
 # Create a directory "/tmp/sub-dir1"
 # Create a file "file1.txt" under "/tmp/sub-dir1"
 # Create a couple of files under "/tmp"
 # Create another file "file2.txt" under "/tmp/sub-dir1"

Now set ListHDFS as "Directory" : /tmp/sub-dir1. Run the flow. It will set the timestamp to the last accessed file which is "/tmp/sub-dir1/file2.txt". Now change the directory of ListHDFS to "/tmp", it won't pull in the files that were created in step 3 because those files modified time would be lesser than the timestamp stored as part of the processor's state. It will not happen with the said approach. Thoughts?

> Improve ListHDFS state tracking
> -------------------------------
>
>                 Key: NIFI-2853
>                 URL: https://issues.apache.org/jira/browse/NIFI-2853
>             Project: Apache NiFi
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Bryan Bende
>            Priority: Minor
>
> Currently ListHDFS tracks two properties in state management, "listing.timestamp" and "emitted.timestamp". In the 1.0.0 release, the directory property now supports expression language which means the directory being listed could dynamically change on any execution of the processor. 
> The processor should be changed to store state specific to the directory that was listed, for example "listing.timestamp.dir1" and "emitted.timestamp.dir1".
> This would also help in a clustered scenario... currently ListHDFS has to be run on primary node only, otherwise each node will be overwriting each others state and producing unexpected results. With the above improvement, if the directory evaluated to a unique path for each node, it would store the state of each of those path separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)