You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Jeremy Pemberton-Pigott <fu...@gmail.com> on 2020/03/09 06:34:25 UTC

Listing a folder with millions of files

Hi,

I need to list a sub-set (few 100,000) of files in a folder with millions
of files (to do some historical processing).  What's the best way I can do
that?  ListFiles is taking way too long and seems to try to dump the entire
list to the flow when I test it on a smaller folder list.  It would be good
if the listing emitted files in smaller chunks so that the flow can start
working on them.

Regards,

Jeremy

Re: Listing a folder with millions of files

Posted by Jeremy Pemberton-Pigott <fu...@gmail.com>.
Thanks for the suggestions guys. The pre-filtered list is possibly one I can use. 

Regards,

Jeremy


> On 9 Mar 2020, at 20:16, Shawn Weeks <sw...@weeksconsulting.us> wrote:
> 


When I’ve had to do this I just skip trying to use ListFile and instead create a text file containing a list of all the files that can be used with the SplitFile and FetchFile processors to pull things in in batches. Even with filtering ListFile will iterate through a lot of files.
 
Thanks
 
From: Edward Armes <ed...@gmail.com>
Reply-To: "users@nifi.apache.org" <us...@nifi.apache.org>
Date: Monday, March 9, 2020 at 4:43 AM
To: "users@nifi.apache.org" <us...@nifi.apache.org>
Subject: Re: Listing a folder with millions of files
 
Hi Jeremy,
 
In this case I don't think there is an easy answer here.
 
You may have some luck with adjusting the max runtime of the processor but without checking the the processors implementation I couldn't know for certain if that would have any effect.
 
Edward

On Mon, 9 Mar 2020, 06:34 Jeremy Pemberton-Pigott, <fu...@gmail.com> wrote:
Hi,
 
I need to list a sub-set (few 100,000) of files in a folder with millions of files (to do some historical processing).  What's the best way I can do that?  ListFiles is taking way too long and seems to try to dump the entire list to the flow when I test it on a smaller folder list.  It would be good if the listing emitted files in smaller chunks so that the flow can start working on them.
 
Regards,
 
Jeremy

Re: Listing a folder with millions of files

Posted by Shawn Weeks <sw...@weeksconsulting.us>.
When I’ve had to do this I just skip trying to use ListFile and instead create a text file containing a list of all the files that can be used with the SplitFile and FetchFile processors to pull things in in batches. Even with filtering ListFile will iterate through a lot of files.

Thanks

From: Edward Armes <ed...@gmail.com>
Reply-To: "users@nifi.apache.org" <us...@nifi.apache.org>
Date: Monday, March 9, 2020 at 4:43 AM
To: "users@nifi.apache.org" <us...@nifi.apache.org>
Subject: Re: Listing a folder with millions of files

Hi Jeremy,

In this case I don't think there is an easy answer here.

You may have some luck with adjusting the max runtime of the processor but without checking the the processors implementation I couldn't know for certain if that would have any effect.

Edward
On Mon, 9 Mar 2020, 06:34 Jeremy Pemberton-Pigott, <fu...@gmail.com>> wrote:
Hi,

I need to list a sub-set (few 100,000) of files in a folder with millions of files (to do some historical processing).  What's the best way I can do that?  ListFiles is taking way too long and seems to try to dump the entire list to the flow when I test it on a smaller folder list.  It would be good if the listing emitted files in smaller chunks so that the flow can start working on them.

Regards,

Jeremy

Re: Listing a folder with millions of files

Posted by Edward Armes <ed...@gmail.com>.
Hi Jeremy,

In this case I don't think there is an easy answer here.

You may have some luck with adjusting the max runtime of the processor but
without checking the the processors implementation I couldn't know for
certain if that would have any effect.

Edward

On Mon, 9 Mar 2020, 06:34 Jeremy Pemberton-Pigott, <fu...@gmail.com>
wrote:

> Hi,
>
> I need to list a sub-set (few 100,000) of files in a folder with millions
> of files (to do some historical processing).  What's the best way I can do
> that?  ListFiles is taking way too long and seems to try to dump the entire
> list to the flow when I test it on a smaller folder list.  It would be good
> if the listing emitted files in smaller chunks so that the flow can start
> working on them.
>
> Regards,
>
> Jeremy
>