You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Mike Sofen <MS...@ansunbiopharma.com> on 2021/03/19 16:57:22 UTC

speeding up ListFile

I've built a document processing solution in Nifi, using the ListFile/FetchFile model hitting a large document repository on our Windows file server.  It's nearly a million files ranging in size from 100kb to 300mb, and files types of pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf, png, tiff and some specialized binary files.  The million files is distributed across tens of thousands of folders.

The challenge is, for an example subfolder that has 25k files in 11k folders totalling 17gb, it took upwards of 30 minutes for a single ListFile to generate a list and send it downstream to the next processor.  It's running on a PC with the latest gen core i7 with 32gb ram and a 1TB SSD - plenty of horsepower and speed.  My bootstrap.cnf has the java.arg.2=-Xms4g and java.arg.3=-Xmx16g.

Is there any way to speed up ListFile?

Also, is there any way to detect that a file is encrypted?  I'm sending these for processing by Tika and Tika generates an error when it receives an encrypted file (we have just a few of those, but enough to be annoying).

Mike Sofen

RE: [EXTERNAL] speeding up ListFile

Posted by Mike Sofen <MS...@ansunbiopharma.com>.

Mark,

Fantastic info.  Since I’m a fan of minifi, that was a great suggestion.    I’ve built a minifi before to do something quite similar, so all I need now is access to that fileserver, will give it a go.  It IS an on-prem server we own.  And this is a one-time operation to process 20 years of research, regulatory and FDA documents, so I can chip away at it over time.

Mike

From: Mark Payne <ma...@hotmail.com>
Sent: Saturday, March 20, 2021 7:54 AM
To: users@nifi.apache.org
Subject: Re: [EXTERNAL] speeding up ListFile

Mike,

Good to know. In short: yes, absolutely, the Windows file server will slow it down that much. You did mention a 100 mb network. Generally, what will be more important is the network latency (because performing the listing and gathering filenames, sizes, etc. can require many tiny requests) and the performance of the server itself (if it’s busy handling tons of other clients, it may be slow to respond).

The reason we added the performance metrics in the first place is because we had a user who was upset by the poor performance on a network mounted drive (I think a Windows file server but I’m not sure). Every time they used ‘ls’ or equivalent it was blazing fast. But after instrumenting all of the metrics we were able to find that after doing 50,000 disk operations, even though the typical request was perhaps < 1 ms, some would block for many seconds, even minutes. Not sure if it was a network glitch or the file server itself. That then led us to adding the ability to turn off fetching file attributes, as that made a massive difference for them.

I don’t know anything about configuring a Windows file server, so I won’t be help there. But if you own the Windows file server, perhaps this is a situation where it would make sense to run minifi on the file server and have it ship the data to NiFi instead of having NiFi polling. That way, minifi would have local disk access and could then push the data to nifi more quickly. (If this seems like something that would be doable for you, I would recommend you ask for details from someone with more experience in the minifi part of the code base to ensure that all necessary functionality is there, i haven’t looked at minifi in a while. But I think it is).

Thanks
-Mark

On Mar 20, 2021, at 10:41 AM, Mike Sofen <MS...@ansunbiopharma.com>> wrote:

It’s NOT ListFile that is slow, or at least for local file systems.

I re-ran a test into a folder tree local to the PC running Nifi (with an SSD).  It had 667 files in 129 folders, from which it found 117 matching file types to list (but it still had to read every folder and file).  Very VERY fast.

248ms ListFile  (.37ms per file)
23 ms UpdateAttribute (add 8 attributes)
12 ms RouteOnAttribute (3 paths)

Is it possible that a Windows file server on a 100mb network can slow it down so much? Anyone find a way to speed up remote Windows file access?

Mike

From: Mike Sofen <ms...@runbox.com>>
Sent: Friday, March 19, 2021 6:54 PM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: RE: [EXTERNAL] Re: speeding up ListFile

Someone help me here: the 157 file listing averaged 46ms, so the total duration SHOULD have been 7.2 seconds, not nearly 4 minutes (227 seconds).  What could be going on for the other 220 seconds?  Something is amiss.

Mike

From: Mike Sofen <MS...@ansunbiopharma.com>>
Sent: Friday, March 19, 2021 3:47 PM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: RE: [EXTERNAL] Re: speeding up ListFile

Hopes dashed on the rocks of reality...dang.  I just retested my folder with 25k files and 11k subfolders (many nesting levels deep – perhaps 15 levels), after clearing state, with the Include File Attributes set to false and it took the same amount of time to produce the listing – about 30 minutes.

For some reason my debug setting isn’t writing to the log file (I set debug from within the ListFile processor).  But it did pop up that red error square on the processor.  So to save time, I re-ran it again for just a deep child folder that had 2 subfolders with a total of 157 files.  Here’s my transcription of the debug:

“Over the past 227 seconds, For Operation ‘RETRIEVE_NEXT_FILE_FROM_OS’ there were 157 operations performed with an average time of 46.229 milliseconds; STD Deviation = 34ms; Min Time = 0ms; Max Time = 170ms; 12 significant outliers.”

To state the obvious, this tiny listing of 157 files averaged more than 1 second per file.  That mirrors the speed from my 25k trial which averaged a bit over 1 second per file – that is really slow.  What might be going on with the “significant outliers”?

Mike

From: Olson, Eric <Er...@adm.com>>
Sent: Friday, March 19, 2021 11:45 AM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: RE: [EXTERNAL] Re: speeding up ListFile

I’ve observed the same thing. I’m also monitoring directories of large numbers of files and noticed this morning that ListFile took about 30 min to process one directory of about 800,000 files. This is under Linux, but the folder in question is a shared Windows network folder that has been mounted to the Linux machine. (I don’t know how that was done; it’s something my Linux admin set up for me.)

I just ran a quick test on a folder with about 75,000 files. ListFile with Include File Attributes set to false took about 10 s to emit the 75,000 FlowFiles. ListFile including file attributes took about 70 s. At the OS level, “ls -lR | wc” takes 2 seconds.

Interestingly, in the downstream queue, the two sets of files have the same lineage duration. I guess that’s measured starting at when the ListFile processor was started.

From: Mark Payne <ma...@hotmail.com>>
Sent: Friday, March 19, 2021 12:08 PM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: [EXTERNAL] Re: speeding up ListFile

It’s hard to say without knowing what’s taking so long. Is it simply crawling the directory structure that takes forever? If so, there’s not a lot that can be done, as accessing tons of files just tends to be slow. One way to verify this, on Linux, would be to run:

ls -laR

I.e., a recursive listing of all files. Not sure what the analogous command would be on Windows.

The “Track Performance” property of the processor can be used to understand more about the performance characteristics of the disk access. Set that to true and enable DEBUG logging for the processor.

If there are heap concerns, generating a million FlowFiles, then you can set a Record Writer on the processor so that only a single FlowFile gets created. That can then be split up using a tiered approach (SplitRecord to split into 10,000 Record chunks, and then another SplitRecord to split each 10,000 Record chunk into a 1-Record chunk, and then EvaluateJsonPath, for instance, to pull the actual filename into an attribute). I suspect this is not the issue, with that mean heap and given that it’s approximately 1 million files. But it may be a factor.

Also, setting the “Include File Attributes” to false can significantly improve performance, especially on a remote network drive, or some specific types of drives/OS’s.

Would recommend you play around with the above options to better understand the performance characteristics of your particular environment.

Thanks
-Mark

On Mar 19, 2021, at 12:57 PM, Mike Sofen <MS...@ansunbiopharma.com>> wrote:

I’ve built a document processing solution in Nifi, using the ListFile/FetchFile model hitting a large document repository on our Windows file server.  It’s nearly a million files ranging in size from 100kb to 300mb, and files types of pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf, png, tiff and some specialized binary files.  The million files is distributed across tens of thousands of folders.

The challenge is, for an example subfolder that has 25k files in 11k folders totalling 17gb, it took upwards of 30 minutes for a single ListFile to generate a list and send it downstream to the next processor.  It’s running on a PC with the latest gen core i7 with 32gb ram and a 1TB SSD – plenty of horsepower and speed.  My bootstrap.cnf has the java.arg.2=-Xms4g and java.arg.3=-Xmx16g.

Is there any way to speed up ListFile?

Also, is there any way to detect that a file is encrypted?  I’m sending these for processing by Tika and Tika generates an error when it receives an encrypted file (we have just a few of those, but enough to be annoying).

Mike Sofen

Confidentiality Notice:
This message may contain confidential or privileged information, or information that is otherwise exempt from disclosure. If you are not the intended recipient, you should promptly delete it and should not disclose, copy or distribute it to others.

Re: [EXTERNAL] speeding up ListFile

Posted by Mark Payne <ma...@hotmail.com>.

Mike,

Good to know. In short: yes, absolutely, the Windows file server will slow it down that much. You did mention a 100 mb network. Generally, what will be more important is the network latency (because performing the listing and gathering filenames, sizes, etc. can require many tiny requests) and the performance of the server itself (if it’s busy handling tons of other clients, it may be slow to respond).

The reason we added the performance metrics in the first place is because we had a user who was upset by the poor performance on a network mounted drive (I think a Windows file server but I’m not sure). Every time they used ‘ls’ or equivalent it was blazing fast. But after instrumenting all of the metrics we were able to find that after doing 50,000 disk operations, even though the typical request was perhaps < 1 ms, some would block for many seconds, even minutes. Not sure if it was a network glitch or the file server itself. That then led us to adding the ability to turn off fetching file attributes, as that made a massive difference for them.

I don’t know anything about configuring a Windows file server, so I won’t be help there. But if you own the Windows file server, perhaps this is a situation where it would make sense to run minifi on the file server and have it ship the data to NiFi instead of having NiFi polling. That way, minifi would have local disk access and could then push the data to nifi more quickly. (If this seems like something that would be doable for you, I would recommend you ask for details from someone with more experience in the minifi part of the code base to ensure that all necessary functionality is there, i haven’t looked at minifi in a while. But I think it is).

Thanks
-Mark



On Mar 20, 2021, at 10:41 AM, Mike Sofen <MS...@ansunbiopharma.com>> wrote:

It’s NOT ListFile that is slow, or at least for local file systems.

I re-ran a test into a folder tree local to the PC running Nifi (with an SSD).  It had 667 files in 129 folders, from which it found 117 matching file types to list (but it still had to read every folder and file).  Very VERY fast.

248ms ListFile  (.37ms per file)
23 ms UpdateAttribute (add 8 attributes)
12 ms RouteOnAttribute (3 paths)

Is it possible that a Windows file server on a 100mb network can slow it down so much? Anyone find a way to speed up remote Windows file access?

Mike

From: Mike Sofen <ms...@runbox.com>>
Sent: Friday, March 19, 2021 6:54 PM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: RE: [EXTERNAL] Re: speeding up ListFile

Someone help me here: the 157 file listing averaged 46ms, so the total duration SHOULD have been 7.2 seconds, not nearly 4 minutes (227 seconds).  What could be going on for the other 220 seconds?  Something is amiss.

Mike

From: Mike Sofen <MS...@ansunbiopharma.com>>
Sent: Friday, March 19, 2021 3:47 PM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: RE: [EXTERNAL] Re: speeding up ListFile

Hopes dashed on the rocks of reality...dang.  I just retested my folder with 25k files and 11k subfolders (many nesting levels deep – perhaps 15 levels), after clearing state, with the Include File Attributes set to false and it took the same amount of time to produce the listing – about 30 minutes.

For some reason my debug setting isn’t writing to the log file (I set debug from within the ListFile processor).  But it did pop up that red error square on the processor.  So to save time, I re-ran it again for just a deep child folder that had 2 subfolders with a total of 157 files.  Here’s my transcription of the debug:

“Over the past 227 seconds, For Operation ‘RETRIEVE_NEXT_FILE_FROM_OS’ there were 157 operations performed with an average time of 46.229 milliseconds; STD Deviation = 34ms; Min Time = 0ms; Max Time = 170ms; 12 significant outliers.”

To state the obvious, this tiny listing of 157 files averaged more than 1 second per file.  That mirrors the speed from my 25k trial which averaged a bit over 1 second per file – that is really slow.  What might be going on with the “significant outliers”?

Mike

From: Olson, Eric <Er...@adm.com>>
Sent: Friday, March 19, 2021 11:45 AM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: RE: [EXTERNAL] Re: speeding up ListFile

I’ve observed the same thing. I’m also monitoring directories of large numbers of files and noticed this morning that ListFile took about 30 min to process one directory of about 800,000 files. This is under Linux, but the folder in question is a shared Windows network folder that has been mounted to the Linux machine. (I don’t know how that was done; it’s something my Linux admin set up for me.)

I just ran a quick test on a folder with about 75,000 files. ListFile with Include File Attributes set to false took about 10 s to emit the 75,000 FlowFiles. ListFile including file attributes took about 70 s. At the OS level, “ls -lR | wc” takes 2 seconds.

Interestingly, in the downstream queue, the two sets of files have the same lineage duration. I guess that’s measured starting at when the ListFile processor was started.


From: Mark Payne <ma...@hotmail.com>>
Sent: Friday, March 19, 2021 12:08 PM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: [EXTERNAL] Re: speeding up ListFile

It’s hard to say without knowing what’s taking so long. Is it simply crawling the directory structure that takes forever? If so, there’s not a lot that can be done, as accessing tons of files just tends to be slow. One way to verify this, on Linux, would be to run:

ls -laR

I.e., a recursive listing of all files. Not sure what the analogous command would be on Windows.

The “Track Performance” property of the processor can be used to understand more about the performance characteristics of the disk access. Set that to true and enable DEBUG logging for the processor.

If there are heap concerns, generating a million FlowFiles, then you can set a Record Writer on the processor so that only a single FlowFile gets created. That can then be split up using a tiered approach (SplitRecord to split into 10,000 Record chunks, and then another SplitRecord to split each 10,000 Record chunk into a 1-Record chunk, and then EvaluateJsonPath, for instance, to pull the actual filename into an attribute). I suspect this is not the issue, with that mean heap and given that it’s approximately 1 million files. But it may be a factor.

Also, setting the “Include File Attributes” to false can significantly improve performance, especially on a remote network drive, or some specific types of drives/OS’s.

Would recommend you play around with the above options to better understand the performance characteristics of your particular environment.

Thanks
-Mark

On Mar 19, 2021, at 12:57 PM, Mike Sofen <MS...@ansunbiopharma.com>> wrote:

I’ve built a document processing solution in Nifi, using the ListFile/FetchFile model hitting a large document repository on our Windows file server.  It’s nearly a million files ranging in size from 100kb to 300mb, and files types of pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf, png, tiff and some specialized binary files.  The million files is distributed across tens of thousands of folders.

The challenge is, for an example subfolder that has 25k files in 11k folders totalling 17gb, it took upwards of 30 minutes for a single ListFile to generate a list and send it downstream to the next processor.  It’s running on a PC with the latest gen core i7 with 32gb ram and a 1TB SSD – plenty of horsepower and speed.  My bootstrap.cnf has the java.arg.2=-Xms4g and java.arg.3=-Xmx16g.

Is there any way to speed up ListFile?

Also, is there any way to detect that a file is encrypted?  I’m sending these for processing by Tika and Tika generates an error when it receives an encrypted file (we have just a few of those, but enough to be annoying).

Mike Sofen



Confidentiality Notice:
This message may contain confidential or privileged information, or information that is otherwise exempt from disclosure. If you are not the intended recipient, you should promptly delete it and should not disclose, copy or distribute it to others.

RE: [EXTERNAL] Re: speeding up ListFile

Posted by Mike Sofen <MS...@ansunbiopharma.com>.

It’s NOT ListFile that is slow, or at least for local file systems.

I re-ran a test into a folder tree local to the PC running Nifi (with an SSD).  It had 667 files in 129 folders, from which it found 117 matching file types to list (but it still had to read every folder and file).  Very VERY fast.

248ms ListFile  (.37ms per file)
23 ms UpdateAttribute (add 8 attributes)
12 ms RouteOnAttribute (3 paths)

Is it possible that a Windows file server on a 100mb network can slow it down so much? Anyone find a way to speed up remote Windows file access?

Mike

From: Mike Sofen <ms...@runbox.com>
Sent: Friday, March 19, 2021 6:54 PM
To: users@nifi.apache.org
Subject: RE: [EXTERNAL] Re: speeding up ListFile

Someone help me here: the 157 file listing averaged 46ms, so the total duration SHOULD have been 7.2 seconds, not nearly 4 minutes (227 seconds).  What could be going on for the other 220 seconds?  Something is amiss.

Mike

From: Mike Sofen <MS...@ansunbiopharma.com>>
Sent: Friday, March 19, 2021 3:47 PM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: RE: [EXTERNAL] Re: speeding up ListFile

Hopes dashed on the rocks of reality...dang.  I just retested my folder with 25k files and 11k subfolders (many nesting levels deep – perhaps 15 levels), after clearing state, with the Include File Attributes set to false and it took the same amount of time to produce the listing – about 30 minutes.

For some reason my debug setting isn’t writing to the log file (I set debug from within the ListFile processor).  But it did pop up that red error square on the processor.  So to save time, I re-ran it again for just a deep child folder that had 2 subfolders with a total of 157 files.  Here’s my transcription of the debug:

“Over the past 227 seconds, For Operation ‘RETRIEVE_NEXT_FILE_FROM_OS’ there were 157 operations performed with an average time of 46.229 milliseconds; STD Deviation = 34ms; Min Time = 0ms; Max Time = 170ms; 12 significant outliers.”

To state the obvious, this tiny listing of 157 files averaged more than 1 second per file.  That mirrors the speed from my 25k trial which averaged a bit over 1 second per file – that is really slow.  What might be going on with the “significant outliers”?

Mike

From: Olson, Eric <Er...@adm.com>>
Sent: Friday, March 19, 2021 11:45 AM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: RE: [EXTERNAL] Re: speeding up ListFile

I’ve observed the same thing. I’m also monitoring directories of large numbers of files and noticed this morning that ListFile took about 30 min to process one directory of about 800,000 files. This is under Linux, but the folder in question is a shared Windows network folder that has been mounted to the Linux machine. (I don’t know how that was done; it’s something my Linux admin set up for me.)

I just ran a quick test on a folder with about 75,000 files. ListFile with Include File Attributes set to false took about 10 s to emit the 75,000 FlowFiles. ListFile including file attributes took about 70 s. At the OS level, “ls -lR | wc” takes 2 seconds.

Interestingly, in the downstream queue, the two sets of files have the same lineage duration. I guess that’s measured starting at when the ListFile processor was started.


From: Mark Payne <ma...@hotmail.com>>
Sent: Friday, March 19, 2021 12:08 PM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: [EXTERNAL] Re: speeding up ListFile

It’s hard to say without knowing what’s taking so long. Is it simply crawling the directory structure that takes forever? If so, there’s not a lot that can be done, as accessing tons of files just tends to be slow. One way to verify this, on Linux, would be to run:

ls -laR

I.e., a recursive listing of all files. Not sure what the analogous command would be on Windows.

The “Track Performance” property of the processor can be used to understand more about the performance characteristics of the disk access. Set that to true and enable DEBUG logging for the processor.

If there are heap concerns, generating a million FlowFiles, then you can set a Record Writer on the processor so that only a single FlowFile gets created. That can then be split up using a tiered approach (SplitRecord to split into 10,000 Record chunks, and then another SplitRecord to split each 10,000 Record chunk into a 1-Record chunk, and then EvaluateJsonPath, for instance, to pull the actual filename into an attribute). I suspect this is not the issue, with that mean heap and given that it’s approximately 1 million files. But it may be a factor.

Also, setting the “Include File Attributes” to false can significantly improve performance, especially on a remote network drive, or some specific types of drives/OS’s.

Would recommend you play around with the above options to better understand the performance characteristics of your particular environment.

Thanks
-Mark

On Mar 19, 2021, at 12:57 PM, Mike Sofen <MS...@ansunbiopharma.com>> wrote:

I’ve built a document processing solution in Nifi, using the ListFile/FetchFile model hitting a large document repository on our Windows file server.  It’s nearly a million files ranging in size from 100kb to 300mb, and files types of pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf, png, tiff and some specialized binary files.  The million files is distributed across tens of thousands of folders.

The challenge is, for an example subfolder that has 25k files in 11k folders totalling 17gb, it took upwards of 30 minutes for a single ListFile to generate a list and send it downstream to the next processor.  It’s running on a PC with the latest gen core i7 with 32gb ram and a 1TB SSD – plenty of horsepower and speed.  My bootstrap.cnf has the java.arg.2=-Xms4g and java.arg.3=-Xmx16g.

Is there any way to speed up ListFile?

Also, is there any way to detect that a file is encrypted?  I’m sending these for processing by Tika and Tika generates an error when it receives an encrypted file (we have just a few of those, but enough to be annoying).

Mike Sofen



Confidentiality Notice:
This message may contain confidential or privileged information, or information that is otherwise exempt from disclosure. If you are not the intended recipient, you should promptly delete it and should not disclose, copy or distribute it to others.

RE: [EXTERNAL] Re: speeding up ListFile

Posted by Mike Sofen <ms...@runbox.com>.

Someone help me here: the 157 file listing averaged 46ms, so the total duration SHOULD have been 7.2 seconds, not nearly 4 minutes (227 seconds).  What could be going on for the other 220 seconds?  Something is amiss.

 

Mike

 

From: Mike Sofen <MS...@ansunbiopharma.com> 
Sent: Friday, March 19, 2021 3:47 PM
To: users@nifi.apache.org
Subject: RE: [EXTERNAL] Re: speeding up ListFile

 

Hopes dashed on the rocks of reality...dang.  I just retested my folder with 25k files and 11k subfolders (many nesting levels deep – perhaps 15 levels), after clearing state, with the Include File Attributes set to false and it took the same amount of time to produce the listing – about 30 minutes.

 

For some reason my debug setting isn’t writing to the log file (I set debug from within the ListFile processor).  But it did pop up that red error square on the processor.  So to save time, I re-ran it again for just a deep child folder that had 2 subfolders with a total of 157 files.  Here’s my transcription of the debug:

 

“Over the past 227 seconds, For Operation ‘RETRIEVE_NEXT_FILE_FROM_OS’ there were 157 operations performed with an average time of 46.229 milliseconds; STD Deviation = 34ms; Min Time = 0ms; Max Time = 170ms; 12 significant outliers.”

 

To state the obvious, this tiny listing of 157 files averaged more than 1 second per file.  That mirrors the speed from my 25k trial which averaged a bit over 1 second per file – that is really slow.  What might be going on with the “significant outliers”?  

 

Mike

 

From: Olson, Eric <Eric.Olson@adm.com <ma...@adm.com> > 
Sent: Friday, March 19, 2021 11:45 AM
To: users@nifi.apache.org <ma...@nifi.apache.org> 
Subject: RE: [EXTERNAL] Re: speeding up ListFile

 

I’ve observed the same thing. I’m also monitoring directories of large numbers of files and noticed this morning that ListFile took about 30 min to process one directory of about 800,000 files. This is under Linux, but the folder in question is a shared Windows network folder that has been mounted to the Linux machine. (I don’t know how that was done; it’s something my Linux admin set up for me.)

 

I just ran a quick test on a folder with about 75,000 files. ListFile with Include File Attributes set to false took about 10 s to emit the 75,000 FlowFiles. ListFile including file attributes took about 70 s. At the OS level, “ls -lR | wc” takes 2 seconds.

 

Interestingly, in the downstream queue, the two sets of files have the same lineage duration. I guess that’s measured starting at when the ListFile processor was started.

 

 

From: Mark Payne <markap14@hotmail.com <ma...@hotmail.com> > 
Sent: Friday, March 19, 2021 12:08 PM
To: users@nifi.apache.org <ma...@nifi.apache.org> 
Subject: [EXTERNAL] Re: speeding up ListFile

 

It’s hard to say without knowing what’s taking so long. Is it simply crawling the directory structure that takes forever? If so, there’s not a lot that can be done, as accessing tons of files just tends to be slow. One way to verify this, on Linux, would be to run: 

 

ls -laR

 

I.e., a recursive listing of all files. Not sure what the analogous command would be on Windows.

 

The “Track Performance” property of the processor can be used to understand more about the performance characteristics of the disk access. Set that to true and enable DEBUG logging for the processor.

 

If there are heap concerns, generating a million FlowFiles, then you can set a Record Writer on the processor so that only a single FlowFile gets created. That can then be split up using a tiered approach (SplitRecord to split into 10,000 Record chunks, and then another SplitRecord to split each 10,000 Record chunk into a 1-Record chunk, and then EvaluateJsonPath, for instance, to pull the actual filename into an attribute). I suspect this is not the issue, with that mean heap and given that it’s approximately 1 million files. But it may be a factor.

 

Also, setting the “Include File Attributes” to false can significantly improve performance, especially on a remote network drive, or some specific types of drives/OS’s.

 

Would recommend you play around with the above options to better understand the performance characteristics of your particular environment.

 

Thanks

-Mark

 

On Mar 19, 2021, at 12:57 PM, Mike Sofen <MSofen@ansunbiopharma.com <ma...@ansunbiopharma.com> > wrote:

 

I’ve built a document processing solution in Nifi, using the ListFile/FetchFile model hitting a large document repository on our Windows file server.  It’s nearly a million files ranging in size from 100kb to 300mb, and files types of pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf, png, tiff and some specialized binary files.  The million files is distributed across tens of thousands of folders.

 

The challenge is, for an example subfolder that has 25k files in 11k folders totalling 17gb, it took upwards of 30 minutes for a single ListFile to generate a list and send it downstream to the next processor.  It’s running on a PC with the latest gen core i7 with 32gb ram and a 1TB SSD – plenty of horsepower and speed.  My bootstrap.cnf has the java.arg.2=-Xms4g and java.arg.3=-Xmx16g.

 

Is there any way to speed up ListFile?  

 

Also, is there any way to detect that a file is encrypted?  I’m sending these for processing by Tika and Tika generates an error when it receives an encrypted file (we have just a few of those, but enough to be annoying).

 

Mike Sofen

 



Confidentiality Notice:
This message may contain confidential or privileged information, or information that is otherwise exempt from disclosure. If you are not the intended recipient, you should promptly delete it and should not disclose, copy or distribute it to others.

RE: [EXTERNAL] Re: speeding up ListFile

Posted by Mike Sofen <MS...@ansunbiopharma.com>.

Hopes dashed on the rocks of reality...dang.  I just retested my folder with 25k files and 11k subfolders (many nesting levels deep – perhaps 15 levels), after clearing state, with the Include File Attributes set to false and it took the same amount of time to produce the listing – about 30 minutes.

For some reason my debug setting isn’t writing to the log file (I set debug from within the ListFile processor).  But it did pop up that red error square on the processor.  So to save time, I re-ran it again for just a deep child folder that had 2 subfolders with a total of 157 files.  Here’s my transcription of the debug:

“Over the past 227 seconds, For Operation ‘RETRIEVE_NEXT_FILE_FROM_OS’ there were 157 operations performed with an average time of 46.229 milliseconds; STD Deviation = 34ms; Min Time = 0ms; Max Time = 170ms; 12 significant outliers.”

To state the obvious, this tiny listing of 157 files averaged more than 1 second per file.  That mirrors the speed from my 25k trial which averaged a bit over 1 second per file – that is really slow.  What might be going on with the “significant outliers”?

Mike

From: Olson, Eric <Er...@adm.com>
Sent: Friday, March 19, 2021 11:45 AM
To: users@nifi.apache.org
Subject: RE: [EXTERNAL] Re: speeding up ListFile

I’ve observed the same thing. I’m also monitoring directories of large numbers of files and noticed this morning that ListFile took about 30 min to process one directory of about 800,000 files. This is under Linux, but the folder in question is a shared Windows network folder that has been mounted to the Linux machine. (I don’t know how that was done; it’s something my Linux admin set up for me.)

I just ran a quick test on a folder with about 75,000 files. ListFile with Include File Attributes set to false took about 10 s to emit the 75,000 FlowFiles. ListFile including file attributes took about 70 s. At the OS level, “ls -lR | wc” takes 2 seconds.

Interestingly, in the downstream queue, the two sets of files have the same lineage duration. I guess that’s measured starting at when the ListFile processor was started.


From: Mark Payne <ma...@hotmail.com>>
Sent: Friday, March 19, 2021 12:08 PM
To: users@nifi.apache.org<ma...@nifi.apache.org>
Subject: [EXTERNAL] Re: speeding up ListFile

It’s hard to say without knowing what’s taking so long. Is it simply crawling the directory structure that takes forever? If so, there’s not a lot that can be done, as accessing tons of files just tends to be slow. One way to verify this, on Linux, would be to run:

ls -laR

I.e., a recursive listing of all files. Not sure what the analogous command would be on Windows.

The “Track Performance” property of the processor can be used to understand more about the performance characteristics of the disk access. Set that to true and enable DEBUG logging for the processor.

If there are heap concerns, generating a million FlowFiles, then you can set a Record Writer on the processor so that only a single FlowFile gets created. That can then be split up using a tiered approach (SplitRecord to split into 10,000 Record chunks, and then another SplitRecord to split each 10,000 Record chunk into a 1-Record chunk, and then EvaluateJsonPath, for instance, to pull the actual filename into an attribute). I suspect this is not the issue, with that mean heap and given that it’s approximately 1 million files. But it may be a factor.

Also, setting the “Include File Attributes” to false can significantly improve performance, especially on a remote network drive, or some specific types of drives/OS’s.

Would recommend you play around with the above options to better understand the performance characteristics of your particular environment.

Thanks
-Mark

On Mar 19, 2021, at 12:57 PM, Mike Sofen <MS...@ansunbiopharma.com>> wrote:

I’ve built a document processing solution in Nifi, using the ListFile/FetchFile model hitting a large document repository on our Windows file server.  It’s nearly a million files ranging in size from 100kb to 300mb, and files types of pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf, png, tiff and some specialized binary files.  The million files is distributed across tens of thousands of folders.

The challenge is, for an example subfolder that has 25k files in 11k folders totalling 17gb, it took upwards of 30 minutes for a single ListFile to generate a list and send it downstream to the next processor.  It’s running on a PC with the latest gen core i7 with 32gb ram and a 1TB SSD – plenty of horsepower and speed.  My bootstrap.cnf has the java.arg.2=-Xms4g and java.arg.3=-Xmx16g.

Is there any way to speed up ListFile?

Also, is there any way to detect that a file is encrypted?  I’m sending these for processing by Tika and Tika generates an error when it receives an encrypted file (we have just a few of those, but enough to be annoying).

Mike Sofen



Confidentiality Notice:
This message may contain confidential or privileged information, or information that is otherwise exempt from disclosure. If you are not the intended recipient, you should promptly delete it and should not disclose, copy or distribute it to others.

RE: [EXTERNAL] Re: speeding up ListFile

Posted by "Olson, Eric" <Er...@adm.com>.

I’ve observed the same thing. I’m also monitoring directories of large numbers of files and noticed this morning that ListFile took about 30 min to process one directory of about 800,000 files. This is under Linux, but the folder in question is a shared Windows network folder that has been mounted to the Linux machine. (I don’t know how that was done; it’s something my Linux admin set up for me.)

I just ran a quick test on a folder with about 75,000 files. ListFile with Include File Attributes set to false took about 10 s to emit the 75,000 FlowFiles. ListFile including file attributes took about 70 s. At the OS level, “ls -lR | wc” takes 2 seconds.

Interestingly, in the downstream queue, the two sets of files have the same lineage duration. I guess that’s measured starting at when the ListFile processor was started.

From: Mark Payne <ma...@hotmail.com>
Sent: Friday, March 19, 2021 12:08 PM
To: users@nifi.apache.org
Subject: [EXTERNAL] Re: speeding up ListFile

It’s hard to say without knowing what’s taking so long. Is it simply crawling the directory structure that takes forever? If so, there’s not a lot that can be done, as accessing tons of files just tends to be slow. One way to verify this, on Linux, would be to run:

ls -laR

I.e., a recursive listing of all files. Not sure what the analogous command would be on Windows.

The “Track Performance” property of the processor can be used to understand more about the performance characteristics of the disk access. Set that to true and enable DEBUG logging for the processor.

If there are heap concerns, generating a million FlowFiles, then you can set a Record Writer on the processor so that only a single FlowFile gets created. That can then be split up using a tiered approach (SplitRecord to split into 10,000 Record chunks, and then another SplitRecord to split each 10,000 Record chunk into a 1-Record chunk, and then EvaluateJsonPath, for instance, to pull the actual filename into an attribute). I suspect this is not the issue, with that mean heap and given that it’s approximately 1 million files. But it may be a factor.

Also, setting the “Include File Attributes” to false can significantly improve performance, especially on a remote network drive, or some specific types of drives/OS’s.

Would recommend you play around with the above options to better understand the performance characteristics of your particular environment.

Thanks
-Mark

On Mar 19, 2021, at 12:57 PM, Mike Sofen <MS...@ansunbiopharma.com>> wrote:

I’ve built a document processing solution in Nifi, using the ListFile/FetchFile model hitting a large document repository on our Windows file server.  It’s nearly a million files ranging in size from 100kb to 300mb, and files types of pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf, png, tiff and some specialized binary files.  The million files is distributed across tens of thousands of folders.

The challenge is, for an example subfolder that has 25k files in 11k folders totalling 17gb, it took upwards of 30 minutes for a single ListFile to generate a list and send it downstream to the next processor.  It’s running on a PC with the latest gen core i7 with 32gb ram and a 1TB SSD – plenty of horsepower and speed.  My bootstrap.cnf has the java.arg.2=-Xms4g and java.arg.3=-Xmx16g.

Is there any way to speed up ListFile?

Also, is there any way to detect that a file is encrypted?  I’m sending these for processing by Tika and Tika generates an error when it receives an encrypted file (we have just a few of those, but enough to be annoying).

Mike Sofen

Confidentiality Notice:
This message may contain confidential or privileged information, or information that is otherwise exempt from disclosure. If you are not the intended recipient, you should promptly delete it and should not disclose, copy or distribute it to others.

RE: speeding up ListFile

Posted by Mike Sofen <MS...@ansunbiopharma.com>.

I’m pretty sure it’s the directory crawling that is the issue.  And I’m not trying to process the whole thing at once but instead taking small slices like the 25k files in 11k folders and testing that.  On Windows, it also takes a long time (many minutes) just to generate a file and folder count.

I will use the Track Performance setting as you mentioned and try to get some additional data points.

Re Include File Attributes:  I need the filename and path for downstream processing, I will have to test that those are still included somewhere if I turn off that flag.

Thanks for the tips, will update with more data.

Mike

From: Mark Payne <ma...@hotmail.com>
Sent: Friday, March 19, 2021 10:08 AM
To: users@nifi.apache.org
Subject: Re: speeding up ListFile

It’s hard to say without knowing what’s taking so long. Is it simply crawling the directory structure that takes forever? If so, there’s not a lot that can be done, as accessing tons of files just tends to be slow. One way to verify this, on Linux, would be to run:

ls -laR

I.e., a recursive listing of all files. Not sure what the analogous command would be on Windows.

The “Track Performance” property of the processor can be used to understand more about the performance characteristics of the disk access. Set that to true and enable DEBUG logging for the processor.

If there are heap concerns, generating a million FlowFiles, then you can set a Record Writer on the processor so that only a single FlowFile gets created. That can then be split up using a tiered approach (SplitRecord to split into 10,000 Record chunks, and then another SplitRecord to split each 10,000 Record chunk into a 1-Record chunk, and then EvaluateJsonPath, for instance, to pull the actual filename into an attribute). I suspect this is not the issue, with that mean heap and given that it’s approximately 1 million files. But it may be a factor.

Also, setting the “Include File Attributes” to false can significantly improve performance, especially on a remote network drive, or some specific types of drives/OS’s.

Would recommend you play around with the above options to better understand the performance characteristics of your particular environment.

Thanks
-Mark


On Mar 19, 2021, at 12:57 PM, Mike Sofen <MS...@ansunbiopharma.com>> wrote:

I’ve built a document processing solution in Nifi, using the ListFile/FetchFile model hitting a large document repository on our Windows file server.  It’s nearly a million files ranging in size from 100kb to 300mb, and files types of pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf, png, tiff and some specialized binary files.  The million files is distributed across tens of thousands of folders.

The challenge is, for an example subfolder that has 25k files in 11k folders totalling 17gb, it took upwards of 30 minutes for a single ListFile to generate a list and send it downstream to the next processor.  It’s running on a PC with the latest gen core i7 with 32gb ram and a 1TB SSD – plenty of horsepower and speed.  My bootstrap.cnf has the java.arg.2=-Xms4g and java.arg.3=-Xmx16g.

Is there any way to speed up ListFile?

Also, is there any way to detect that a file is encrypted?  I’m sending these for processing by Tika and Tika generates an error when it receives an encrypted file (we have just a few of those, but enough to be annoying).

Mike Sofen

Re: speeding up ListFile

Posted by James McMahon <js...@gmail.com>.

Thank you. Excluding those other attributes seems like a good choice to
improve performance unless we absolutely need those other attributes.

On Mon, Mar 22, 2021 at 9:00 AM Mike Sofen <MS...@ansunbiopharma.com>
wrote:

> I just retested, to be sure, and there is no impact from setting “include
> file attributes” to False – stopping a flow pointed at a folder tree that
> had already processed the files, adding one file, then restarting it, the
> flow only picked up the new file.  And it still includes the critical
> attributes of filename, path and creation date.  For my use case, this is
> an appropriate and valuable setting.  Mike
>
>
>
> *From:* James McMahon <js...@gmail.com>
> *Sent:* Saturday, March 20, 2021 5:26 PM
> *To:* users@nifi.apache.org
> *Subject:* Re: speeding up ListFile
>
>
>
> When we set “include file attributes” to False, does that in any way
> impact ListFile’s ability to track and retrieve new files by state?
>
>
>
> On Fri, Mar 19, 2021 at 1:08 PM Mark Payne <ma...@hotmail.com> wrote:
>
> It’s hard to say without knowing what’s taking so long. Is it simply
> crawling the directory structure that takes forever? If so, there’s not a
> lot that can be done, as accessing tons of files just tends to be slow. One
> way to verify this, on Linux, would be to run:
>
>
>
> ls -laR
>
>
>
> I.e., a recursive listing of all files. Not sure what the analogous
> command would be on Windows.
>
>
>
> The “Track Performance” property of the processor can be used to
> understand more about the performance characteristics of the disk access.
> Set that to true and enable DEBUG logging for the processor.
>
>
>
> If there are heap concerns, generating a million FlowFiles, then you can
> set a Record Writer on the processor so that only a single FlowFile gets
> created. That can then be split up using a tiered approach (SplitRecord to
> split into 10,000 Record chunks, and then another SplitRecord to split each
> 10,000 Record chunk into a 1-Record chunk, and then EvaluateJsonPath, for
> instance, to pull the actual filename into an attribute). I suspect this is
> not the issue, with that mean heap and given that it’s approximately 1
> million files. But it may be a factor.
>
>
>
> Also, setting the “Include File Attributes” to false can significantly
> improve performance, especially on a remote network drive, or some specific
> types of drives/OS’s.
>
>
>
> Would recommend you play around with the above options to better
> understand the performance characteristics of your particular environment.
>
>
>
> Thanks
>
> -Mark
>
>
>
> On Mar 19, 2021, at 12:57 PM, Mike Sofen <MS...@ansunbiopharma.com>
> wrote:
>
>
>
> I’ve built a document processing solution in Nifi, using the
> ListFile/FetchFile model hitting a large document repository on our Windows
> file server.  It’s nearly a million files ranging in size from 100kb to
> 300mb, and files types of pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf,
> png, tiff and some specialized binary files.  The million files is
> distributed across tens of thousands of folders.
>
>
>
> The challenge is, for an example subfolder that has 25k files in 11k
> folders totalling 17gb, it took upwards of 30 minutes for a single ListFile
> to generate a list and send it downstream to the next processor.  It’s
> running on a PC with the latest gen core i7 with 32gb ram and a 1TB SSD –
> plenty of horsepower and speed.  My bootstrap.cnf has the java.arg.2=-Xms4g
> and java.arg.3=-Xmx16g.
>
>
>
> Is there any way to speed up ListFile?
>
>
>
> Also, is there any way to detect that a file is encrypted?  I’m sending
> these for processing by Tika and Tika generates an error when it receives
> an encrypted file (we have just a few of those, but enough to be annoying).
>
>
>
> Mike Sofen
>
>
>
>

RE: speeding up ListFile

Posted by Mike Sofen <MS...@ansunbiopharma.com>.

I just retested, to be sure, and there is no impact from setting “include file attributes” to False – stopping a flow pointed at a folder tree that had already processed the files, adding one file, then restarting it, the flow only picked up the new file.  And it still includes the critical attributes of filename, path and creation date.  For my use case, this is an appropriate and valuable setting.  Mike

From: James McMahon <js...@gmail.com>
Sent: Saturday, March 20, 2021 5:26 PM
To: users@nifi.apache.org
Subject: Re: speeding up ListFile

When we set “include file attributes” to False, does that in any way impact ListFile’s ability to track and retrieve new files by state?

On Fri, Mar 19, 2021 at 1:08 PM Mark Payne <ma...@hotmail.com>> wrote:
It’s hard to say without knowing what’s taking so long. Is it simply crawling the directory structure that takes forever? If so, there’s not a lot that can be done, as accessing tons of files just tends to be slow. One way to verify this, on Linux, would be to run:

ls -laR

I.e., a recursive listing of all files. Not sure what the analogous command would be on Windows.

The “Track Performance” property of the processor can be used to understand more about the performance characteristics of the disk access. Set that to true and enable DEBUG logging for the processor.

If there are heap concerns, generating a million FlowFiles, then you can set a Record Writer on the processor so that only a single FlowFile gets created. That can then be split up using a tiered approach (SplitRecord to split into 10,000 Record chunks, and then another SplitRecord to split each 10,000 Record chunk into a 1-Record chunk, and then EvaluateJsonPath, for instance, to pull the actual filename into an attribute). I suspect this is not the issue, with that mean heap and given that it’s approximately 1 million files. But it may be a factor.

Also, setting the “Include File Attributes” to false can significantly improve performance, especially on a remote network drive, or some specific types of drives/OS’s.

Would recommend you play around with the above options to better understand the performance characteristics of your particular environment.

Thanks
-Mark

On Mar 19, 2021, at 12:57 PM, Mike Sofen <MS...@ansunbiopharma.com>> wrote:

I’ve built a document processing solution in Nifi, using the ListFile/FetchFile model hitting a large document repository on our Windows file server.  It’s nearly a million files ranging in size from 100kb to 300mb, and files types of pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf, png, tiff and some specialized binary files.  The million files is distributed across tens of thousands of folders.

The challenge is, for an example subfolder that has 25k files in 11k folders totalling 17gb, it took upwards of 30 minutes for a single ListFile to generate a list and send it downstream to the next processor.  It’s running on a PC with the latest gen core i7 with 32gb ram and a 1TB SSD – plenty of horsepower and speed.  My bootstrap.cnf has the java.arg.2=-Xms4g and java.arg.3=-Xmx16g.

Is there any way to speed up ListFile?

Also, is there any way to detect that a file is encrypted?  I’m sending these for processing by Tika and Tika generates an error when it receives an encrypted file (we have just a few of those, but enough to be annoying).

Mike Sofen

Re: speeding up ListFile

Posted by James McMahon <js...@gmail.com>.

When we set “include file attributes” to False, does that in any way impact
ListFile’s ability to track and retrieve new files by state?

On Fri, Mar 19, 2021 at 1:08 PM Mark Payne <ma...@hotmail.com> wrote:

> It’s hard to say without knowing what’s taking so long. Is it simply
> crawling the directory structure that takes forever? If so, there’s not a
> lot that can be done, as accessing tons of files just tends to be slow. One
> way to verify this, on Linux, would be to run:
>
> ls -laR
>
> I.e., a recursive listing of all files. Not sure what the analogous
> command would be on Windows.
>
> The “Track Performance” property of the processor can be used to
> understand more about the performance characteristics of the disk access.
> Set that to true and enable DEBUG logging for the processor.
>
> If there are heap concerns, generating a million FlowFiles, then you can
> set a Record Writer on the processor so that only a single FlowFile gets
> created. That can then be split up using a tiered approach (SplitRecord to
> split into 10,000 Record chunks, and then another SplitRecord to split each
> 10,000 Record chunk into a 1-Record chunk, and then EvaluateJsonPath, for
> instance, to pull the actual filename into an attribute). I suspect this is
> not the issue, with that mean heap and given that it’s approximately 1
> million files. But it may be a factor.
>
> Also, setting the “Include File Attributes” to false can significantly
> improve performance, especially on a remote network drive, or some specific
> types of drives/OS’s.
>
> Would recommend you play around with the above options to better
> understand the performance characteristics of your particular environment.
>
> Thanks
> -Mark
>
> On Mar 19, 2021, at 12:57 PM, Mike Sofen <MS...@ansunbiopharma.com>
> wrote:
>
> I’ve built a document processing solution in Nifi, using the
> ListFile/FetchFile model hitting a large document repository on our Windows
> file server.  It’s nearly a million files ranging in size from 100kb to
> 300mb, and files types of pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf,
> png, tiff and some specialized binary files.  The million files is
> distributed across tens of thousands of folders.
>
> The challenge is, for an example subfolder that has 25k files in 11k
> folders totalling 17gb, it took upwards of 30 minutes for a single ListFile
> to generate a list and send it downstream to the next processor.  It’s
> running on a PC with the latest gen core i7 with 32gb ram and a 1TB SSD –
> plenty of horsepower and speed.  My bootstrap.cnf has the java.arg.2=-Xms4g
> and java.arg.3=-Xmx16g.
>
> Is there any way to speed up ListFile?
>
> Also, is there any way to detect that a file is encrypted?  I’m sending
> these for processing by Tika and Tika generates an error when it receives
> an encrypted file (we have just a few of those, but enough to be annoying).
>
> Mike Sofen
>
>
>

Re: speeding up ListFile

Posted by Mark Payne <ma...@hotmail.com>.

It’s hard to say without knowing what’s taking so long. Is it simply crawling the directory structure that takes forever? If so, there’s not a lot that can be done, as accessing tons of files just tends to be slow. One way to verify this, on Linux, would be to run:

ls -laR

I.e., a recursive listing of all files. Not sure what the analogous command would be on Windows.

The “Track Performance” property of the processor can be used to understand more about the performance characteristics of the disk access. Set that to true and enable DEBUG logging for the processor.

If there are heap concerns, generating a million FlowFiles, then you can set a Record Writer on the processor so that only a single FlowFile gets created. That can then be split up using a tiered approach (SplitRecord to split into 10,000 Record chunks, and then another SplitRecord to split each 10,000 Record chunk into a 1-Record chunk, and then EvaluateJsonPath, for instance, to pull the actual filename into an attribute). I suspect this is not the issue, with that mean heap and given that it’s approximately 1 million files. But it may be a factor.

Also, setting the “Include File Attributes” to false can significantly improve performance, especially on a remote network drive, or some specific types of drives/OS’s.

Would recommend you play around with the above options to better understand the performance characteristics of your particular environment.

Thanks
-Mark

On Mar 19, 2021, at 12:57 PM, Mike Sofen <MS...@ansunbiopharma.com>> wrote:

I’ve built a document processing solution in Nifi, using the ListFile/FetchFile model hitting a large document repository on our Windows file server. It’s nearly a million files ranging in size from 100kb to 300mb, and files types of pdf, doc/docx, xsl/xslx, pptx, text, xml, rtf, png, tiff and some specialized binary files. The million files is distributed across tens of thousands of folders.

The challenge is, for an example subfolder that has 25k files in 11k folders totalling 17gb, it took upwards of 30 minutes for a single ListFile to generate a list and send it downstream to the next processor. It’s running on a PC with the latest gen core i7 with 32gb ram and a 1TB SSD – plenty of horsepower and speed. My bootstrap.cnf has the java.arg.2=-Xms4g and java.arg.3=-Xmx16g.

Is there any way to speed up ListFile?

Also, is there any way to detect that a file is encrypted? I’m sending these for processing by Tika and Tika generates an error when it receives an encrypted file (we have just a few of those, but enough to be annoying).

Mike Sofen