You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Jeremy Pemberton-Pigott <fu...@gmail.com> on 2020/03/09 06:34:25 UTC
Listing a folder with millions of files
Hi,
I need to list a sub-set (few 100,000) of files in a folder with millions
of files (to do some historical processing). What's the best way I can do
that? ListFiles is taking way too long and seems to try to dump the entire
list to the flow when I test it on a smaller folder list. It would be good
if the listing emitted files in smaller chunks so that the flow can start
working on them.
Regards,
Jeremy
Re: Listing a folder with millions of files
Posted by Jeremy Pemberton-Pigott <fu...@gmail.com>.
Thanks for the suggestions guys. The pre-filtered list is possibly one I can use.
Regards,
Jeremy
> On 9 Mar 2020, at 20:16, Shawn Weeks <sw...@weeksconsulting.us> wrote:
>
When I’ve had to do this I just skip trying to use ListFile and instead create a text file containing a list of all the files that can be used with the SplitFile and FetchFile processors to pull things in in batches. Even with filtering ListFile will iterate through a lot of files.
Thanks
From: Edward Armes <ed...@gmail.com>
Reply-To: "users@nifi.apache.org" <us...@nifi.apache.org>
Date: Monday, March 9, 2020 at 4:43 AM
To: "users@nifi.apache.org" <us...@nifi.apache.org>
Subject: Re: Listing a folder with millions of files
Hi Jeremy,
In this case I don't think there is an easy answer here.
You may have some luck with adjusting the max runtime of the processor but without checking the the processors implementation I couldn't know for certain if that would have any effect.
Edward
On Mon, 9 Mar 2020, 06:34 Jeremy Pemberton-Pigott, <fu...@gmail.com> wrote:
Hi,
I need to list a sub-set (few 100,000) of files in a folder with millions of files (to do some historical processing). What's the best way I can do that? ListFiles is taking way too long and seems to try to dump the entire list to the flow when I test it on a smaller folder list. It would be good if the listing emitted files in smaller chunks so that the flow can start working on them.
Regards,
Jeremy
Re: Listing a folder with millions of files
Posted by Shawn Weeks <sw...@weeksconsulting.us>.
When I’ve had to do this I just skip trying to use ListFile and instead create a text file containing a list of all the files that can be used with the SplitFile and FetchFile processors to pull things in in batches. Even with filtering ListFile will iterate through a lot of files.
Thanks
From: Edward Armes <ed...@gmail.com>
Reply-To: "users@nifi.apache.org" <us...@nifi.apache.org>
Date: Monday, March 9, 2020 at 4:43 AM
To: "users@nifi.apache.org" <us...@nifi.apache.org>
Subject: Re: Listing a folder with millions of files
Hi Jeremy,
In this case I don't think there is an easy answer here.
You may have some luck with adjusting the max runtime of the processor but without checking the the processors implementation I couldn't know for certain if that would have any effect.
Edward
On Mon, 9 Mar 2020, 06:34 Jeremy Pemberton-Pigott, <fu...@gmail.com>> wrote:
Hi,
I need to list a sub-set (few 100,000) of files in a folder with millions of files (to do some historical processing). What's the best way I can do that? ListFiles is taking way too long and seems to try to dump the entire list to the flow when I test it on a smaller folder list. It would be good if the listing emitted files in smaller chunks so that the flow can start working on them.
Regards,
Jeremy
Re: Listing a folder with millions of files
Posted by Edward Armes <ed...@gmail.com>.
Hi Jeremy,
In this case I don't think there is an easy answer here.
You may have some luck with adjusting the max runtime of the processor but
without checking the the processors implementation I couldn't know for
certain if that would have any effect.
Edward
On Mon, 9 Mar 2020, 06:34 Jeremy Pemberton-Pigott, <fu...@gmail.com>
wrote:
> Hi,
>
> I need to list a sub-set (few 100,000) of files in a folder with millions
> of files (to do some historical processing). What's the best way I can do
> that? ListFiles is taking way too long and seems to try to dump the entire
> list to the flow when I test it on a smaller folder list. It would be good
> if the listing emitted files in smaller chunks so that the flow can start
> working on them.
>
> Regards,
>
> Jeremy
>