You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2018/11/29 09:32:00 UTC

[jira] [Commented] (NIFI-5406) Add new listing strategy by tracking listed entities to ListXXXX processors

    [ https://issues.apache.org/jira/browse/NIFI-5406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702923#comment-16702923 ] 

ASF subversion and git services commented on NIFI-5406:
-------------------------------------------------------

Commit 30f2f4205121113c26bb00ac5a8697dffaeb8206 in nifi's branch refs/heads/master from [~ijokarumawak]
[ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=30f2f42 ]

NIFI-5849: ListXXX can lose cluster state on processor restart

NIFI-5406 introduced the issue by trying to use the resetState variable for
different purposes. AbstractListProcessor should have had a different variable
to control whether to clear state for tracking entity strategy.

Signed-off-by: Pierre Villard <pi...@gmail.com>

This closes #3189.


> Add new listing strategy by tracking listed entities to ListXXXX processors
> ---------------------------------------------------------------------------
>
>                 Key: NIFI-5406
>                 URL: https://issues.apache.org/jira/browse/NIFI-5406
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Koji Kawamura
>            Assignee: Koji Kawamura
>            Priority: Major
>             Fix For: 1.8.0
>
>
> Current List processors (ListFile, ListFTP, ListSFTP ... etc) implementation relies on file last modified timestamp to pick new or updated files. This approach is efficient and lightweight in terms of state management, because it only tracks latest modified timestamp and last executed timestamp. However, timestamps do not work as expected in some file systems, causing List processors missing files periodically. See NIFI-3332 comments for details.
> In order to pick every entity that has not seen before or has been updated since it had seen last time, we need another set of processors using different approach, that is by tracking listed entities:
>  * Add new abstract processor AbstractWatchEntries similar to AbstractListProcessor but uses different approach
>  * Target entities have: name (path), size and last-modified-timestamp
>  * Implementation Processors have following properties:
>  ** 'Watch Time Window' to limit the maximum time period to hold the already listed entries. E.g. if set as '30min', the processor keeps entities listed in the last 30 mins.
>  ** 'Minimum File Age' to defer listing entities potentially being written
>  * Any entity added but not listed ever having last-modified-timestamp older than configured 'Watch Time Window' will not be listed. If user needs to pick these items, they have to make 'Watch Time Window' longer. It also increases the size of data the processor has to persist in the K/V store. Efficiency vs reliability trade-off.
>  * The already-listed entities are persisted into one of supported K/V store through DistributedMapCacheClient service. User can chose what KVS to use from HBase, Redis, Couchbase and File (DistributedMapCacheServer with persistence file).
>  * The reason to use KVS instead of ManagedState is, to avoid hammering Zookeeper too much with frequently updating Zk node with large amount of data. The number of already-listed entries can be huge depending on use-cases. Also, we can compress entities with DistributedMapCacheClient as it supports putting byte array, while ManagedState only supports Map<String, String>.
>  * On each onTrigger:
>  ** Processor performs listing. Listed entries meeting any of the following condition will be written to the 'success' output FlowFile:
>  *** Not exists in the already-listed entities
>  *** Having newer last-modified-timestamp
>  *** Having different size
>  ** Already listed entries those are old enough compared to 'Watch Time Window' are discarded from the already-listed entries.
>  * Initial supporting target is Local file system, FTP and SFTP



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)