You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ant.apache.org by Pa...@nokia.com on 2006/01/20 01:18:02 UTC

Performance of fileset related operations with a large number of files

Hello,

I have been trialling Ant as a driver for a large scale build execution.
The preparation before the build involves copying and unzipping >100,000
files spread across >20,000 directories. When using Ant's built in copy
task with filesets selecting large parts of these files, a long time is
spent building the list of files to copy, which also takes a lot of
memory. This is my understanding of how Ant works with filesets after
browsing the source. 

Is there any way to avoid this high memory usage and time spent building
a list?

Has there ever been any consideration of refactoring the way Ant
processes filesets and similar constructs such that each selected file
is processed once read in an iterative fashion, rather than building a
complete list and then processing?

thanks

paul

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
For additional commands, e-mail: dev-help@ant.apache.org

Re: Performance of fileset related operations with a large number of files

Posted by Steve Loughran <st...@apache.org>.

Stefan Bodewig wrote:
> On Thu, 19 Jan 2006, Paul Mackay <Pa...@nokia.com> wrote:
> 
>> Is there any way to avoid this high memory usage and time spent
>> building a list?
> 
> No.  And you'll see that Ant 1.7 is both a bit better and a bit worse
> than it used to be.  Directoryscanner has probably become a bit faster
> but at the same time we've broadened the concept of FileSets to
> ResourceCollections which means the copy task now works on more
> complex structures and even non-Files which probably leads to a
> further slowdown.
> 
>> Has there ever been any consideration of refactoring the way Ant
>> processes filesets and similar constructs such that each selected
>> file is processed once read in an iterative fashion, rather than
>> building a complete list and then processing?
> 
> Apart from cosmetics like printing the number of files to copy (before
> actually copying them) and backwards compatibility that Jeffrey
> mentioned this would also break optimizations in Move, which checks
> whether a fileset matches a whole directory tree and then simply moves
> the root of that tree instead of the individual files.  To do that
> Move has to complete the directory scans before it starts to move
> anything.
> 

Its interesting to note that even scp, my favourite dependency-driven 
copy command (I often use it locally for its logic) does build up a file 
list, so it has startup costs too.

Maybe a custom <dumbcopy> task could do optimal copying, one with its 
own file collection type and a runtime what would do bulk nio copying 
while still enumerating what other things to copy. It'd be hard to 
optimise as you get such different behaviour from different filesystems, 
hdds, network cables, etc.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
For additional commands, e-mail: dev-help@ant.apache.org

Re: Performance of fileset related operations with a large number of files

Posted by Stefan Bodewig <bo...@apache.org>.

On Thu, 19 Jan 2006, Paul Mackay <Pa...@nokia.com> wrote:

> Is there any way to avoid this high memory usage and time spent
> building a list?

No.  And you'll see that Ant 1.7 is both a bit better and a bit worse
than it used to be.  Directoryscanner has probably become a bit faster
but at the same time we've broadened the concept of FileSets to
ResourceCollections which means the copy task now works on more
complex structures and even non-Files which probably leads to a
further slowdown.

> Has there ever been any consideration of refactoring the way Ant
> processes filesets and similar constructs such that each selected
> file is processed once read in an iterative fashion, rather than
> building a complete list and then processing?

Apart from cosmetics like printing the number of files to copy (before
actually copying them) and backwards compatibility that Jeffrey
mentioned this would also break optimizations in Move, which checks
whether a fileset matches a whole directory tree and then simply moves
the root of that tree instead of the individual files.  To do that
Move has to complete the directory scans before it starts to move
anything.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
For additional commands, e-mail: dev-help@ant.apache.org

Re: Performance of fileset related operations with a large number of files

Posted by Jeffrey E Care <ca...@us.ibm.com>.

I don't profess to know the code as well as the committers, but I don't 
see how this would be feasible. 

First off, how would an iterative approach actually be faster? 
DirectoryScanner (or its iterative equivalent) would still have to do the 
same amount of work to select the files, so even though you're spreading 
the time out over the entire operation, it should still take approximately 
the same amount of time overall to perform the file selection. The only 
way I can think that an iterative approach would be faster is if you could 
write a fileset iterator that had a background thread making the 
selections, leaving the "main" thread to process the files.

Of course, BC is always a concern as well. We certainly could not get rid 
of DirectoryScanner, or any of the its public methods. Another BC concern 
is that, IIRC, <copy> tells you how many files it's going to copy: an 
iterative implementation could not do that; we'd have to add an extra 
attribute to the copy task to use the iterative behavior, which would also 
then likely mean entirely different blocks of logic...

Maybe instead of such a massive refactoring (which even if accepted would 
not be delivered in an official Ant driver for a long time) it would be 
better to <exec> some native processes?

JEC
-- 
Jeffrey E. Care (carej@us.ibm.com)
WebSphere v7 Release Engineer
WebSphere Build Tooling Lead (Project Mantis)


<Pa...@nokia.com> wrote on 01/19/2006 07:18:02 PM:

> Hello,
> 
> I have been trialling Ant as a driver for a large scale build execution.
> The preparation before the build involves copying and unzipping >100,000
> files spread across >20,000 directories. When using Ant's built in copy
> task with filesets selecting large parts of these files, a long time is
> spent building the list of files to copy, which also takes a lot of
> memory. This is my understanding of how Ant works with filesets after
> browsing the source. 
> 
> Is there any way to avoid this high memory usage and time spent building
> a list?
> 
> Has there ever been any consideration of refactoring the way Ant
> processes filesets and similar constructs such that each selected file
> is processed once read in an iterative fashion, rather than building a
> complete list and then processing?
> 
> thanks
> 
> paul
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
> For additional commands, e-mail: dev-help@ant.apache.org
>