You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Stephen Kestle (JIRA)" <ji...@apache.org> on 2011/05/05 13:44:03 UTC

[jira] [Created] (IO-271) FileUtils.copyDirectory should be able to handle arbitrary number of files

FileUtils.copyDirectory should be able to handle arbitrary number of files
--------------------------------------------------------------------------

                 Key: IO-271
                 URL: https://issues.apache.org/jira/browse/IO-271
             Project: Commons IO
          Issue Type: Improvement
          Components: Utilities
    Affects Versions: 2.0.1
            Reporter: Stephen Kestle


File.listFiles() uses up to a bit over 2 times as much memory as File.list().  The latter should be used in doCopyDirectory where there is no filter specified.

This memory usage is a problem when copying directories with hundreds of thousands of files.

I was also thinking of the option of implementing a file filter (that could be composed with the inputted filter) that would batch the file copy operation; copy the first 10000 (that match), then the next 10000 etc etc.

Because of the lack of ordering consistency (between runs) of File.listFiles(), there would need to be a final file filter that would accept files that have not successfully been copied.

I'm primarily concerned about copying into an empty directory (I validate this beforehand), but for general operation where it's a merge, the modification date re-writing should only be done in the final run of copies so that while batching occurs (and indeed the final "missed" filtering) files do not get copied if they have been modified after the start time. (I presume that I'm reading FileUtils correctly in that it overrides files...)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (IO-271) FileUtils.copyDirectory should be able to handle arbitrary number of files

Posted by "Sebb (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/IO-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13030476#comment-13030476 ] 

Sebb commented on IO-271:
-------------------------

I'm not sure the memory usage checking strategy is appropriate, If you are near the limits of memory, creating the original list may well tip you over the limit anyway.

Further, for very large directories, even a String[] array may be too much.

As I wrote earlier, the only sure way to fix this is to process the file entries one by one, but Java does not seem to provide this.

As already explained, listFiles() is more efficient at creating the File entries than list() plus new File(), so I don't think the general case should be changed even in the non-filter case.

AFAICT, your use case is very unusual. Given the difficulties that such large directories are likely to cause other applications, and the fact that it is not possible to support arbitrarily large numbers of files, I would look to see if I could reduce the directory size, e.g. by splitting into subdirectories. That would probably improve file system performance too.

> FileUtils.copyDirectory should be able to handle arbitrary number of files
> --------------------------------------------------------------------------
>
>                 Key: IO-271
>                 URL: https://issues.apache.org/jira/browse/IO-271
>             Project: Commons IO
>          Issue Type: Improvement
>          Components: Utilities
>    Affects Versions: 2.0.1
>            Reporter: Stephen Kestle
>            Priority: Minor
>
> File.listFiles() uses up to a bit over 2 times as much memory as File.list().  The latter should be used in doCopyDirectory where there is no filter specified.
> This memory usage is a problem when copying directories with hundreds of thousands of files.
> I was also thinking of the option of implementing a file filter (that could be composed with the inputted filter) that would batch the file copy operation; copy the first 10000 (that match), then the next 10000 etc etc.
> Because of the lack of ordering consistency (between runs) of File.listFiles(), there would need to be a final file filter that would accept files that have not successfully been copied.
> I'm primarily concerned about copying into an empty directory (I validate this beforehand), but for general operation where it's a merge, the modification date re-writing should only be done in the final run of copies so that while batching occurs (and indeed the final "missed" filtering) files do not get copied if they have been modified after the start time. (I presume that I'm reading FileUtils correctly in that it overrides files...)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (IO-271) FileUtils.copyDirectory should be able to handle arbitrary number of files

Posted by "Sebb (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/IO-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13030354#comment-13030354 ] 

Sebb commented on IO-271:
-------------------------

If using String[] list() instead of File[] listFile():
* when using a filter, each String has to be turned into a File.
* the copy stage also requires the String to be turned into a File.

Using String[] does reduce the maximum memory requirements as the File lifetime is very short.
However in the filtered case it can double the number of File instances that need to be created.

Also, the listFiles() methods are more efficient, because they take advantage of the fact that the list() entries have already been normalised.

I'm not sure these trade-offs are worth it for the general case.

> FileUtils.copyDirectory should be able to handle arbitrary number of files
> --------------------------------------------------------------------------
>
>                 Key: IO-271
>                 URL: https://issues.apache.org/jira/browse/IO-271
>             Project: Commons IO
>          Issue Type: Improvement
>          Components: Utilities
>    Affects Versions: 2.0.1
>            Reporter: Stephen Kestle
>            Priority: Minor
>
> File.listFiles() uses up to a bit over 2 times as much memory as File.list().  The latter should be used in doCopyDirectory where there is no filter specified.
> This memory usage is a problem when copying directories with hundreds of thousands of files.
> I was also thinking of the option of implementing a file filter (that could be composed with the inputted filter) that would batch the file copy operation; copy the first 10000 (that match), then the next 10000 etc etc.
> Because of the lack of ordering consistency (between runs) of File.listFiles(), there would need to be a final file filter that would accept files that have not successfully been copied.
> I'm primarily concerned about copying into an empty directory (I validate this beforehand), but for general operation where it's a merge, the modification date re-writing should only be done in the final run of copies so that while batching occurs (and indeed the final "missed" filtering) files do not get copied if they have been modified after the start time. (I presume that I'm reading FileUtils correctly in that it overrides files...)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (IO-271) FileUtils.copyDirectory should be able to handle arbitrary number of files

Posted by "Sebb (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/IO-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13030828#comment-13030828 ] 

Sebb commented on IO-271:
-------------------------

As already noted, listFiles() uses a private File constructor to create the File instances.
This is able to bypass the normalisation which the public ctors have to perform, so list() + new File() is less efficient than listFiles().

By the way, most OSes will have a backup tool which is likely to be considerably more efficient than Ant or Commons IO.

> FileUtils.copyDirectory should be able to handle arbitrary number of files
> --------------------------------------------------------------------------
>
>                 Key: IO-271
>                 URL: https://issues.apache.org/jira/browse/IO-271
>             Project: Commons IO
>          Issue Type: Improvement
>          Components: Utilities
>    Affects Versions: 2.0.1
>            Reporter: Stephen Kestle
>            Priority: Minor
>
> File.listFiles() uses up to a bit over 2 times as much memory as File.list().  The latter should be used in doCopyDirectory where there is no filter specified.
> This memory usage is a problem when copying directories with hundreds of thousands of files.
> I was also thinking of the option of implementing a file filter (that could be composed with the inputted filter) that would batch the file copy operation; copy the first 10000 (that match), then the next 10000 etc etc.
> Because of the lack of ordering consistency (between runs) of File.listFiles(), there would need to be a final file filter that would accept files that have not successfully been copied.
> I'm primarily concerned about copying into an empty directory (I validate this beforehand), but for general operation where it's a merge, the modification date re-writing should only be done in the final run of copies so that while batching occurs (and indeed the final "missed" filtering) files do not get copied if they have been modified after the start time. (I presume that I'm reading FileUtils correctly in that it overrides files...)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (IO-271) FileUtils.copyDirectory should be able to handle arbitrary number of files

Posted by "Sebb (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/IO-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147959#comment-13147959 ] 

Sebb commented on IO-271:
-------------------------

Yes, I think WontFix is appropriate.
                
> FileUtils.copyDirectory should be able to handle arbitrary number of files
> --------------------------------------------------------------------------
>
>                 Key: IO-271
>                 URL: https://issues.apache.org/jira/browse/IO-271
>             Project: Commons IO
>          Issue Type: Improvement
>          Components: Utilities
>    Affects Versions: 2.0.1
>            Reporter: Stephen Kestle
>            Priority: Minor
>
> File.listFiles() uses up to a bit over 2 times as much memory as File.list().  The latter should be used in doCopyDirectory where there is no filter specified.
> This memory usage is a problem when copying directories with hundreds of thousands of files.
> I was also thinking of the option of implementing a file filter (that could be composed with the inputted filter) that would batch the file copy operation; copy the first 10000 (that match), then the next 10000 etc etc.
> Because of the lack of ordering consistency (between runs) of File.listFiles(), there would need to be a final file filter that would accept files that have not successfully been copied.
> I'm primarily concerned about copying into an empty directory (I validate this beforehand), but for general operation where it's a merge, the modification date re-writing should only be done in the final run of copies so that while batching occurs (and indeed the final "missed" filtering) files do not get copied if they have been modified after the start time. (I presume that I'm reading FileUtils correctly in that it overrides files...)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (IO-271) FileUtils.copyDirectory should be able to handle arbitrary number of files

Posted by "Sebb (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/IO-271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebb resolved IO-271.
---------------------

    Resolution: Won't Fix
    
> FileUtils.copyDirectory should be able to handle arbitrary number of files
> --------------------------------------------------------------------------
>
>                 Key: IO-271
>                 URL: https://issues.apache.org/jira/browse/IO-271
>             Project: Commons IO
>          Issue Type: Improvement
>          Components: Utilities
>    Affects Versions: 2.0.1
>            Reporter: Stephen Kestle
>            Priority: Minor
>
> File.listFiles() uses up to a bit over 2 times as much memory as File.list().  The latter should be used in doCopyDirectory where there is no filter specified.
> This memory usage is a problem when copying directories with hundreds of thousands of files.
> I was also thinking of the option of implementing a file filter (that could be composed with the inputted filter) that would batch the file copy operation; copy the first 10000 (that match), then the next 10000 etc etc.
> Because of the lack of ordering consistency (between runs) of File.listFiles(), there would need to be a final file filter that would accept files that have not successfully been copied.
> I'm primarily concerned about copying into an empty directory (I validate this beforehand), but for general operation where it's a merge, the modification date re-writing should only be done in the final run of copies so that while batching occurs (and indeed the final "missed" filtering) files do not get copied if they have been modified after the start time. (I presume that I'm reading FileUtils correctly in that it overrides files...)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (IO-271) FileUtils.copyDirectory should be able to handle arbitrary number of files

Posted by "Henri Yandell (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/IO-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147904#comment-13147904 ] 

Henri Yandell commented on IO-271:
----------------------------------

Should this be resolved as WontFix?
                
> FileUtils.copyDirectory should be able to handle arbitrary number of files
> --------------------------------------------------------------------------
>
>                 Key: IO-271
>                 URL: https://issues.apache.org/jira/browse/IO-271
>             Project: Commons IO
>          Issue Type: Improvement
>          Components: Utilities
>    Affects Versions: 2.0.1
>            Reporter: Stephen Kestle
>            Priority: Minor
>
> File.listFiles() uses up to a bit over 2 times as much memory as File.list().  The latter should be used in doCopyDirectory where there is no filter specified.
> This memory usage is a problem when copying directories with hundreds of thousands of files.
> I was also thinking of the option of implementing a file filter (that could be composed with the inputted filter) that would batch the file copy operation; copy the first 10000 (that match), then the next 10000 etc etc.
> Because of the lack of ordering consistency (between runs) of File.listFiles(), there would need to be a final file filter that would accept files that have not successfully been copied.
> I'm primarily concerned about copying into an empty directory (I validate this beforehand), but for general operation where it's a merge, the modification date re-writing should only be done in the final run of copies so that while batching occurs (and indeed the final "missed" filtering) files do not get copied if they have been modified after the start time. (I presume that I'm reading FileUtils correctly in that it overrides files...)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (IO-271) FileUtils.copyDirectory should be able to handle arbitrary number of files

Posted by "Sebb (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/IO-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029669#comment-13029669 ] 

Sebb commented on IO-271:
-------------------------

Using list() instead of listFiles() would be possible, but would only double the size of a directory that could be processed.
The only way to truly fix the problem would be to use a method that provided access to the file names one by one, but there does not appear to be a method to do this.

AFAICT FileUtils does not override anything - anyway, why would it be necessary to delay updating the mod. date on the target file?

Personally, I don't think this is worth implementing. Users can always implement their own filtering to split the transfer into chunks. Or just make sure that directories don't contain so many files - this is likely to cause problems elsewhere as well.

> FileUtils.copyDirectory should be able to handle arbitrary number of files
> --------------------------------------------------------------------------
>
>                 Key: IO-271
>                 URL: https://issues.apache.org/jira/browse/IO-271
>             Project: Commons IO
>          Issue Type: Improvement
>          Components: Utilities
>    Affects Versions: 2.0.1
>            Reporter: Stephen Kestle
>
> File.listFiles() uses up to a bit over 2 times as much memory as File.list().  The latter should be used in doCopyDirectory where there is no filter specified.
> This memory usage is a problem when copying directories with hundreds of thousands of files.
> I was also thinking of the option of implementing a file filter (that could be composed with the inputted filter) that would batch the file copy operation; copy the first 10000 (that match), then the next 10000 etc etc.
> Because of the lack of ordering consistency (between runs) of File.listFiles(), there would need to be a final file filter that would accept files that have not successfully been copied.
> I'm primarily concerned about copying into an empty directory (I validate this beforehand), but for general operation where it's a merge, the modification date re-writing should only be done in the final run of copies so that while batching occurs (and indeed the final "missed" filtering) files do not get copied if they have been modified after the start time. (I presume that I'm reading FileUtils correctly in that it overrides files...)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (IO-271) FileUtils.copyDirectory should be able to handle arbitrary number of files

Posted by "Stephen Kestle (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/IO-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13030470#comment-13030470 ] 

Stephen Kestle commented on IO-271:
-----------------------------------

Yeah, I tend to agree in the general case. Perhaps it's a case of providing an override switch to be low memory instead of fast(er). Although I think if this were to be done, I'd check memory usage every 100k files and evaluate whether reversion to names is necessary.

Of course, the chances of hitting this sort of issue when using a filter is even less likely: so why not just use an Object array for {{list()}} and {{listFiles(filter)}}? Resolving {{File originFile = filter == null ? new File(srcDir, files[i]) : files[i];}} isn't so bad is it?



> FileUtils.copyDirectory should be able to handle arbitrary number of files
> --------------------------------------------------------------------------
>
>                 Key: IO-271
>                 URL: https://issues.apache.org/jira/browse/IO-271
>             Project: Commons IO
>          Issue Type: Improvement
>          Components: Utilities
>    Affects Versions: 2.0.1
>            Reporter: Stephen Kestle
>            Priority: Minor
>
> File.listFiles() uses up to a bit over 2 times as much memory as File.list().  The latter should be used in doCopyDirectory where there is no filter specified.
> This memory usage is a problem when copying directories with hundreds of thousands of files.
> I was also thinking of the option of implementing a file filter (that could be composed with the inputted filter) that would batch the file copy operation; copy the first 10000 (that match), then the next 10000 etc etc.
> Because of the lack of ordering consistency (between runs) of File.listFiles(), there would need to be a final file filter that would accept files that have not successfully been copied.
> I'm primarily concerned about copying into an empty directory (I validate this beforehand), but for general operation where it's a merge, the modification date re-writing should only be done in the final run of copies so that while batching occurs (and indeed the final "missed" filtering) files do not get copied if they have been modified after the start time. (I presume that I'm reading FileUtils correctly in that it overrides files...)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (IO-271) FileUtils.copyDirectory should be able to handle arbitrary number of files

Posted by "Stephen Kestle (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/IO-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029703#comment-13029703 ] 

Stephen Kestle commented on IO-271:
-----------------------------------

I created a main method which will create a Thread that will either {{list()}} or {{listFiles()}} for 500,000 files under the following conditions:
* the full canonical directory
* the relative "log" directory when running main from my application dir
* the "." directory when running main from my application's log dir

{{list()}} required a constant {{-Xmx41m}} for all invocations
{{listFiles()}} required:
* 91MB for "."
* 94MB for "log" (which is 3 chars * 2 bytes * 2 copies * 500000 = 3MB difference)
* A whopping 181MB for the full canonical Program Files path (which is the most likely path we'd be using)

_Note that the jvm needs somewhere between 1000-1500k to launch_

So the memory usage is something like 4.5 times which I think is significant enough to fix.

I'd suggest that when the file filter is {{null}} that {{list()}} is used, and when it a filter is given, use {{list(FileNameFilter)}} where the filter:
# takes the string
# creates a file object
# delegates to the given {{FileFilter}}
# throws away the File and accepts or rejects the String based on the {{FileFilter}} result

Extra for experts (that's you guys :)); switch the above FileFilter behaviour based on the amount of free memory in the system when processing the files by retaining the {{File}} array, starting memory stats and a count etc. That is, if memory's getting low, and the number of Files in the (Object) array high, run through and replace the Files with their name, and continue by name.

> FileUtils.copyDirectory should be able to handle arbitrary number of files
> --------------------------------------------------------------------------
>
>                 Key: IO-271
>                 URL: https://issues.apache.org/jira/browse/IO-271
>             Project: Commons IO
>          Issue Type: Improvement
>          Components: Utilities
>    Affects Versions: 2.0.1
>            Reporter: Stephen Kestle
>            Priority: Minor
>
> File.listFiles() uses up to a bit over 2 times as much memory as File.list().  The latter should be used in doCopyDirectory where there is no filter specified.
> This memory usage is a problem when copying directories with hundreds of thousands of files.
> I was also thinking of the option of implementing a file filter (that could be composed with the inputted filter) that would batch the file copy operation; copy the first 10000 (that match), then the next 10000 etc etc.
> Because of the lack of ordering consistency (between runs) of File.listFiles(), there would need to be a final file filter that would accept files that have not successfully been copied.
> I'm primarily concerned about copying into an empty directory (I validate this beforehand), but for general operation where it's a merge, the modification date re-writing should only be done in the final run of copies so that while batching occurs (and indeed the final "missed" filtering) files do not get copied if they have been modified after the start time. (I presume that I'm reading FileUtils correctly in that it overrides files...)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (IO-271) FileUtils.copyDirectory should be able to handle arbitrary number of files

Posted by "Stephen Kestle (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/IO-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029690#comment-13029690 ] 

Stephen Kestle commented on IO-271:
-----------------------------------

Late mod. date updating would be needed in edge cases around merging directories and detecting if a file had successfully been copied. This is due to "holes" that could form between batches.

After talking with others today, we came up with the idea of using a Incremental file filter that does the copy operation, and then returns false, so that the list of Files does not grow.

My estimation of memory usage is actually fully incorrect - listFiles() is far worse:

# It calls {{list()}} (everything does, it's a native method)
# It allocates a new Array for the files
# It creates the files and (on linux) resolves a new string for the full path of the file.  So the deeper this directory is that has many files, the longer the path will be (I was only doing one short directory name when I said double memory usage)
* If you're using the {{listFiles(FileFilter)}} method, an {{ArrayList}} is populated, and then copied to an array at the end, using more memory.

*Notes:*
* Trying to find out how much memory is used *while* {{File}} is performing it's internal copies and resolves is not trivial
* my memory use calculations (107 bytes vs 60 bytes for 10 char files in a 4 char directory) were after I'd done {{System.gc()}}.  
* If I skipped the {{gc}} the Files took 167 bytes at the point of measuring after a 5 second sleep
* Our ant tests (where this all started) seems to indicate that (for 500,000 files, under the same conditions as my test above)
** {{File.list()}} (which ant's copy initially uses) requires around 30Mb
** {{File.listFiles()}} (which commons-io uses) requires around 150Mb
** These requirements were found by limiting the JVM Xmx settings until the respective {{File.list*()}} passed without a OOME.

I will post more conclusive results soon once I've done some more tests using Xmx with only the directory listing methods.

> FileUtils.copyDirectory should be able to handle arbitrary number of files
> --------------------------------------------------------------------------
>
>                 Key: IO-271
>                 URL: https://issues.apache.org/jira/browse/IO-271
>             Project: Commons IO
>          Issue Type: Improvement
>          Components: Utilities
>    Affects Versions: 2.0.1
>            Reporter: Stephen Kestle
>            Priority: Minor
>
> File.listFiles() uses up to a bit over 2 times as much memory as File.list().  The latter should be used in doCopyDirectory where there is no filter specified.
> This memory usage is a problem when copying directories with hundreds of thousands of files.
> I was also thinking of the option of implementing a file filter (that could be composed with the inputted filter) that would batch the file copy operation; copy the first 10000 (that match), then the next 10000 etc etc.
> Because of the lack of ordering consistency (between runs) of File.listFiles(), there would need to be a final file filter that would accept files that have not successfully been copied.
> I'm primarily concerned about copying into an empty directory (I validate this beforehand), but for general operation where it's a merge, the modification date re-writing should only be done in the final run of copies so that while batching occurs (and indeed the final "missed" filtering) files do not get copied if they have been modified after the start time. (I presume that I'm reading FileUtils correctly in that it overrides files...)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (IO-271) FileUtils.copyDirectory should be able to handle arbitrary number of files

Posted by "Sebb (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/IO-271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebb updated IO-271:
--------------------

    Priority: Minor  (was: Major)

> FileUtils.copyDirectory should be able to handle arbitrary number of files
> --------------------------------------------------------------------------
>
>                 Key: IO-271
>                 URL: https://issues.apache.org/jira/browse/IO-271
>             Project: Commons IO
>          Issue Type: Improvement
>          Components: Utilities
>    Affects Versions: 2.0.1
>            Reporter: Stephen Kestle
>            Priority: Minor
>
> File.listFiles() uses up to a bit over 2 times as much memory as File.list().  The latter should be used in doCopyDirectory where there is no filter specified.
> This memory usage is a problem when copying directories with hundreds of thousands of files.
> I was also thinking of the option of implementing a file filter (that could be composed with the inputted filter) that would batch the file copy operation; copy the first 10000 (that match), then the next 10000 etc etc.
> Because of the lack of ordering consistency (between runs) of File.listFiles(), there would need to be a final file filter that would accept files that have not successfully been copied.
> I'm primarily concerned about copying into an empty directory (I validate this beforehand), but for general operation where it's a merge, the modification date re-writing should only be done in the final run of copies so that while batching occurs (and indeed the final "missed" filtering) files do not get copied if they have been modified after the start time. (I presume that I'm reading FileUtils correctly in that it overrides files...)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (IO-271) FileUtils.copyDirectory should be able to handle arbitrary number of files

Posted by "Stephen Kestle (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/IO-271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13030697#comment-13030697 ] 

Stephen Kestle commented on IO-271:
-----------------------------------

Did I mention I'm trying to do an automatic upgrade on a legacy application, and this is the first backup of the message archive (initially thought to be logs)? (I was using ant's copy except that it's horrifically slow and bloated - it can use 100% of a cpu copying no files and run out of memory in the almost the same time that copyDirectory finishes!).

So yeah, you can close the ticket... however, on Windows and Linux, the only native operation is {{list()}}, so I see no performance loss iterating over that array at copy time instead of in the {{listFiles()}} method

> FileUtils.copyDirectory should be able to handle arbitrary number of files
> --------------------------------------------------------------------------
>
>                 Key: IO-271
>                 URL: https://issues.apache.org/jira/browse/IO-271
>             Project: Commons IO
>          Issue Type: Improvement
>          Components: Utilities
>    Affects Versions: 2.0.1
>            Reporter: Stephen Kestle
>            Priority: Minor
>
> File.listFiles() uses up to a bit over 2 times as much memory as File.list().  The latter should be used in doCopyDirectory where there is no filter specified.
> This memory usage is a problem when copying directories with hundreds of thousands of files.
> I was also thinking of the option of implementing a file filter (that could be composed with the inputted filter) that would batch the file copy operation; copy the first 10000 (that match), then the next 10000 etc etc.
> Because of the lack of ordering consistency (between runs) of File.listFiles(), there would need to be a final file filter that would accept files that have not successfully been copied.
> I'm primarily concerned about copying into an empty directory (I validate this beforehand), but for general operation where it's a merge, the modification date re-writing should only be done in the final run of copies so that while batching occurs (and indeed the final "missed" filtering) files do not get copied if they have been modified after the start time. (I presume that I'm reading FileUtils correctly in that it overrides files...)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira