You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Tomislav Novosel <to...@clearpeaks.com> on 2021/02/02 15:56:38 UTC

Monitoring big directory tree

Hi guys,

I have following situation:

There is SMB mounted folder on one Nifi worker and it has many subfolders with subfolders (the depth of nesting is not known in advance).
If new files in that directory tree appears or file is moved or old file is copied/moved, modification timestamp changes. It is achieved
with some other tools, configs etc.

What is the best way to list new/updated/new-old files with NiFi if we take into account there are enormous number of subfolders and files in them (let's say milions).

In case of ListFile and using 'Tracking Entities' I am concerned about the following:


  *   How big is the I/O if ListFile constantly checks the directory tree and all files?
It can be CRON based not to do it all the time, but if it is not, how ListFile is doing that in the background?
  *   How big is the cahce of listed entities, what is stored in fact in the cache, just metadata or?
  *   What if cahce is not persisted and NiFi restart occurs? Will cache be incosistent?
In case it is persisted, what if restart occurs in the moment when ListFile is checking new/old entities and their size, name etc?
What is the interval of persisting the cache, is it related to snapshots NiFi takes in configured intervals?

What is the best and the most efficient way to do this? Maybe some extra tools or engines to use for finding difference and to
persist last known state, like elastic, some DB maybe?
Or to construct list of paths which need to be fecthed using some python sctipts?

The constarint here is shared(mounted) folders and even if modification date is changed for every new/updated file,
how to efficiently monitor big directory tree or how to efficiently trigger ListFile (NiFi flow) to fetch new/old-new files?

In case of 'Tracking entities', maybe having separated standalone NiFi instance on separated server with configured CacheServer
to serve as cache is not bad idea?

Thanks in advance,

Tom