You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Sudhindra Tirupati Nagaraj <su...@tetrationanalytics.com> on 2018/07/30 22:34:20 UTC

Question on ingesting HDFS batches

Hi,

 

We just came across NIFI as a possible option for backing up our data lake periodically into S3. We have our pipelines that dump batches of data at some granularity. For example, our one-minute dumps are of the form “201807210617”, “201807210618”, “201807210619” etc. We are looking for a simple configuration based solution that reads these incoming batches periodically and creates a workflow for backing these up. Also, these batches have a “success” marker inside them that indicates that the batches are full and ready to be backed up. We came across the ListHDFS processor that can do this, without duplication, but we are not sure if it has the ability to only copy batches that have a particular state (that is, like having a success marker in them). We are not sure if it also works on “folders” and not files directly. 

 

Can I get some recommendations on whether NIFI can be used at for such a ingestion use-case/alternative? Thank you.

 

Kind Regards,

Sudhindra.

Re: Question on ingesting HDFS batches

Posted by Sudhindra Tirupati Nagaraj <su...@tetrationanalytics.com>.

Thanks a lot Joe for answering my query! 

Sudhindra.

On 7/30/18, 3:47 PM, "Joe Witt" <jo...@gmail.com> wrote:

    Sudhindra
    
    The current ListFile processor scans through the configured directory
    including any subdirectories and looks for files.  It does this by
    generating a listing, comparing it to what it has seen already
    (largely based on mod time) then sending out resulting listings.
    These can be sent to a FetchFile process which pulls the files.
    
    We do not offer a facility to look for the presence of a given special
    'success' file.  We could, and probably at this point should since it
    is a common ask, have a JIRA to add a filter to only select files in a
    folder if we see a file meeting a certain name such as 'success'.
    
    Thanks
    Joe
    
    On Mon, Jul 30, 2018 at 6:34 PM, Sudhindra Tirupati Nagaraj
    <su...@tetrationanalytics.com> wrote:
    > Hi,
    >
    >
    >
    > We just came across NIFI as a possible option for backing up our data lake
    > periodically into S3. We have our pipelines that dump batches of data at
    > some granularity. For example, our one-minute dumps are of the form
    > “201807210617”, “201807210618”, “201807210619” etc. We are looking for a
    > simple configuration based solution that reads these incoming batches
    > periodically and creates a workflow for backing these up. Also, these
    > batches have a “success” marker inside them that indicates that the batches
    > are full and ready to be backed up. We came across the ListHDFS processor
    > that can do this, without duplication, but we are not sure if it has the
    > ability to only copy batches that have a particular state (that is, like
    > having a success marker in them). We are not sure if it also works on
    > “folders” and not files directly.
    >
    >
    >
    > Can I get some recommendations on whether NIFI can be used at for such a
    > ingestion use-case/alternative? Thank you.
    >
    >
    >
    > Kind Regards,
    >
    > Sudhindra.

Re: Question on ingesting HDFS batches

Posted by Joe Witt <jo...@gmail.com>.

Sudhindra

The current ListFile processor scans through the configured directory
including any subdirectories and looks for files.  It does this by
generating a listing, comparing it to what it has seen already
(largely based on mod time) then sending out resulting listings.
These can be sent to a FetchFile process which pulls the files.

We do not offer a facility to look for the presence of a given special
'success' file.  We could, and probably at this point should since it
is a common ask, have a JIRA to add a filter to only select files in a
folder if we see a file meeting a certain name such as 'success'.

Thanks
Joe

On Mon, Jul 30, 2018 at 6:34 PM, Sudhindra Tirupati Nagaraj
<su...@tetrationanalytics.com> wrote:
> Hi,
>
>
>
> We just came across NIFI as a possible option for backing up our data lake
> periodically into S3. We have our pipelines that dump batches of data at
> some granularity. For example, our one-minute dumps are of the form
> “201807210617”, “201807210618”, “201807210619” etc. We are looking for a
> simple configuration based solution that reads these incoming batches
> periodically and creates a workflow for backing these up. Also, these
> batches have a “success” marker inside them that indicates that the batches
> are full and ready to be backed up. We came across the ListHDFS processor
> that can do this, without duplication, but we are not sure if it has the
> ability to only copy batches that have a particular state (that is, like
> having a success marker in them). We are not sure if it also works on
> “folders” and not files directly.
>
>
>
> Can I get some recommendations on whether NIFI can be used at for such a
> ingestion use-case/alternative? Thank you.
>
>
>
> Kind Regards,
>
> Sudhindra.

Re: Question on ingesting HDFS batches

Posted by Ed B <bd...@gmail.com>.

Hi Sudhindra,

I think that current implementation of ListHDFS already gives you required
functionality.
I'll assume for a moment, that your "success" markers are just another
files, having the same (or partial) name as a data file, just with some
extension, like "*.fin", "*.done" or "*.success".
You could use ListHDFS. It has a regex filter on a file name. So, having it
like "^.+\.success$" will always bring new files (since last listing)
having extension "*.success" (e.g. 201808050012.success).
If you schedule to run ListHDFS processor daily (using timer for 1 day or
using crontab expression for very specific hour, then it will wake up only
once a day, will find all the success files for that day, and then your
flow can find data files for success ones and upload to S3 using PutS3
processor).

Another story with directories.
If you need a listing of directories, you could use GetHDFSFileInfo (can
work recursively, having filters separately for dirs and for files). But
this processor doesn't maintain a state, so you will need to maintain it
yourself (zookeeper or hbase, or even distributed cache map).

Regards,
Ed.

On Mon, Jul 30, 2018 at 6:34 PM Sudhindra Tirupati Nagaraj <
sutirupa@tetrationanalytics.com> wrote:

> Hi,
>
>
>
> We just came across NIFI as a possible option for backing up our data lake
> periodically into S3. We have our pipelines that dump batches of data at
> some granularity. For example, our one-minute dumps are of the form
> “201807210617”, “201807210618”, “201807210619” etc. We are looking for a
> simple configuration based solution that reads these incoming batches
> periodically and creates a workflow for backing these up. Also, these
> batches have a “success” marker inside them that indicates that the batches
> are full and ready to be backed up. We came across the ListHDFS processor
> that can do this, without duplication, but we are not sure if it has the
> ability to only copy batches that have a particular state (that is, like
> having a success marker in them). We are not sure if it also works on
> “folders” and not files directly.
>
>
>
> Can I get some recommendations on whether NIFI can be used at for such a
> ingestion use-case/alternative? Thank you.
>
>
>
> Kind Regards,
>
> Sudhindra.
>