You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Pierre Villard <pi...@gmail.com> on 2021/03/01 13:20:01 UTC

Re: Questions about the GetFile processor

Using the Record Writer will also be much better as you won't output one
flow file per listed file. You'll have one flow file with one record per
listed file, and you can then use multiple SplitRecord processors to make
sure the number of flow files at one point alway remains OK.

Le sam. 27 févr. 2021 à 07:19, Jean-Sebastien Vachon <js...@brizodata.com>
a écrit :

> Thanks for the hint
>
> Télécharger Outlook pour Android <https://aka.ms/ghei36>
>
> ------------------------------
> *From:* Joe Witt <jo...@gmail.com>
> *Sent:* Friday, February 26, 2021 10:13:20 PM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Questions about the GetFile processor
>
> Hello
>
> Yeah when there are a ton (50k or more) of files in a directory
> performance is *horrible*.   If you can put them into some subdirs to
> divide it up then it will go a lot faster.
>
> Thanks
>
> On Fri, Feb 26, 2021 at 7:30 PM Jean-Sebastien Vachon <
> jsvachon@brizodata.com> wrote:
>
> Hi again,
>
> I need to reprocess all my files after we discovered a problem. My folder
> contains 3,906,135 JSON files (590GB total size).
> I tried the ListFile strategy, and it works fine on a small subset but on
> the whole dataset not a single flow was queued after many hours of waiting.
>
> Is it normal that it takes so long to do something?
>
> I am using the following settings:
>
>   Tracking Timestamps,
>   no recurse,
>   file filter is set to the default ([^\.].*),
>   the minimal size is 0b and the min age is 0s,
>   track performance is off,
>   max number of files is set to 5,000,000
>   max disk op time is 10 s
>   max directory listing time is 3 hours
>
> Am I doing something wrong? my server is quite capable with 512GB of Ram
> and 128 cores.
>
> Thanks
>
>
> *Jean-Sébastien Vachon *
> Co-Founder & Architect
>
>
> *Brizo Data, Inc. www.brizodata.com
> <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com>
> *
> ------------------------------
> *From:* Jean-Sebastien Vachon <js...@brizodata.com>
> *Sent:* Thursday, February 18, 2021 8:59 AM
>
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Questions about the GetFile processor
>
> OK thanks
>
> I missed that part of the documentation. Stupid me
>
>
> *Jean-Sébastien Vachon *
> Co-Founder & Architect
>
>
> *Brizo Data, Inc. www.brizodata.com
> <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com>
> *
> ------------------------------
> *From:* Arpad Boda <ab...@apache.org>
> *Sent:* Thursday, February 18, 2021 8:46 AM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Questions about the GetFile processor
>
> GetFile has no persistence.
> Actually it has, but it's called your hard drive. :)
>
> If you take a look at the documentation:
> *Keep Source File - *"If true, the file is not deleted after it has been
> copied to the Content Repository; this causes the file to be picked up
> continually and is useful for testing purposes. If not keeping original
> NiFi will need write permissions on the directory it is pulling from
> otherwise it will ignore the file."
>
> You can see that it's going to get the same files over and over again
> unless you configure it to delete the already processed ones.
>
> The reason I suggested the combination above is that listfile can be
> triggered once, the metadata (filenames) are stored in your queue and
> fetchfile can process them later.
>
> On Thu, Feb 18, 2021 at 2:39 PM Jean-Sebastien Vachon <
> jsvachon@brizodata.com> wrote:
>
> OK I understand your point.. sorry (early morning) 😉
>
> I am kind of stuck with the GetFile processor for now. Is there a way to
> know how many files are left to process?
>
> Will it go forever? or will it stops streaming once all files have been
> processed? (there are no new files in the folder... everything was there at
> the beginning)
>
> Thanks
>
>
> *Jean-Sébastien Vachon *
> Co-Founder & Architect
>
>
> *Brizo Data, Inc. www.brizodata.com
> <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com>
> *
> ------------------------------
> *From:* Jean-Sebastien Vachon <js...@brizodata.com>
> *Sent:* Thursday, February 18, 2021 8:34 AM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Questions about the GetFile processor
>
> Thanks for your comment. However, I can't queue everything as the total
> size of the data is around 560GB.
> Right now, I am using a GetFile processor and it has been running for a
> few days. If I look at my end point, it looks like it should be done pretty
> soon but data is still
> streaming in at the same rate so I was wondering if the processor
> remembers every single file it has already processed or if it is simply
> going through all the files alphabetically or in whatever order it decides.
>
> Thanks
>
>
> *Jean-Sébastien Vachon *
> Co-Founder & Architect
>
>
> *Brizo Data, Inc. www.brizodata.com
> <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com>
> *
> ------------------------------
> *From:* Arpad Boda <ab...@apache.org>
> *Sent:* Thursday, February 18, 2021 8:29 AM
> *To:* users@nifi.apache.org <us...@nifi.apache.org>
> *Subject:* Re: Questions about the GetFile processor
>
> You can use the combination of listfile and fetchfile.
> In the queue between the two you are going to see the number of
> (flow)files left to be processed.
>
> On Thu, Feb 18, 2021 at 2:14 PM Jean-Sebastien Vachon <
> jsvachon@brizodata.com> wrote:
>
> Hi all,
>
> If I configure a GetFile processor to list all JSON files under a given
> folder, will it stops sending flows once it has processed all files?
> My folder contains thousands of files and the processor reads them by
> small batch (10) every 30s.
>
> Is there a way to know how many files are left to processed?
>
> Thanks
>
>
> *Jean-Sébastien Vachon *
> Co-Founder & Architect
>
>
> *Brizo Data, Inc. www.brizodata.com
> <https://outlook.office365.com/mail/options/mail/messageContent/www.brizodata.com>
> *
>
>