You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Ed B <bd...@gmail.com> on 2018/08/05 04:21:27 UTC

Re: Question on ingesting HDFS batches

Hi Sudhindra,

I think that current implementation of ListHDFS already gives you required
functionality.
I'll assume for a moment, that your "success" markers are just another
files, having the same (or partial) name as a data file, just with some
extension, like "*.fin", "*.done" or "*.success".
You could use ListHDFS. It has a regex filter on a file name. So, having it
like "^.+\.success$" will always bring new files (since last listing)
having extension "*.success" (e.g. 201808050012.success).
If you schedule to run ListHDFS processor daily (using timer for 1 day or
using crontab expression for very specific hour, then it will wake up only
once a day, will find all the success files for that day, and then your
flow can find data files for success ones and upload to S3 using PutS3
processor).

Another story with directories.
If you need a listing of directories, you could use GetHDFSFileInfo (can
work recursively, having filters separately for dirs and for files). But
this processor doesn't maintain a state, so you will need to maintain it
yourself (zookeeper or hbase, or even distributed cache map).

Regards,
Ed.

On Mon, Jul 30, 2018 at 6:34 PM Sudhindra Tirupati Nagaraj <
sutirupa@tetrationanalytics.com> wrote:

> Hi,
>
>
>
> We just came across NIFI as a possible option for backing up our data lake
> periodically into S3. We have our pipelines that dump batches of data at
> some granularity. For example, our one-minute dumps are of the form
> “201807210617”, “201807210618”, “201807210619” etc. We are looking for a
> simple configuration based solution that reads these incoming batches
> periodically and creates a workflow for backing these up. Also, these
> batches have a “success” marker inside them that indicates that the batches
> are full and ready to be backed up. We came across the ListHDFS processor
> that can do this, without duplication, but we are not sure if it has the
> ability to only copy batches that have a particular state (that is, like
> having a success marker in them). We are not sure if it also works on
> “folders” and not files directly.
>
>
>
> Can I get some recommendations on whether NIFI can be used at for such a
> ingestion use-case/alternative? Thank you.
>
>
>
> Kind Regards,
>
> Sudhindra.
>