You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tomcat.apache.org by Jeremy Boynes <jb...@apache.org> on 2015/03/04 06:20:36 UTC

WAR FileSystem for fast nested JAR access?

In https://bz.apache.org/bugzilla/show_bug.cgi?id=57251, Mark Thomas wrote:

> The fix for bug 57472 might shave a few seconds of the deployment time but
> it doesn't appear to make a significant difference.
>
> The fundamental problem when running from a packed WAR is that to access any
> resource in a JAR, Tomcat has to do the following:
> - open the WAR
> - get the entry for the JAR
> - get the InputStream for the JAR entry
> - Create a JarInputStream
> - Read the JarInputStream until it finds the entry it wants
>
> This is always going to be slow.
>
> The reason that it is fast in Tomcat 7 and earlier took some digging. In
> unpackWARs is false in Tomcat 7, it unpacks the JARs anyway into the work
> directory and uses them from there. Performance is therefore comparable with
> unpackWARs="true".

Has anyone looked into using a NIO2 FileSystem for this? It may offer a way to avoid having to stream the entry in order to be able to locate a resource. ZipFile is fast, I believe, because it has random access to the archive and can seek directly to an entry's location based on the zip index; the jar: FileSystem seems to be able to do the same.

However, neither can cope with nested entries: ZipFile because its constructor takes a File rather than a Path and uses native code, and ZipFS because it relies on URIs and can't cope with a jar: URI based on another jar: URI (ye olde problem with jar: URL syntax).

What a FileSystem can do differently is return a FileChannel which supports seek operations over the archive's content. IOW, if ZipFS can work given a random access channel to bytes on disk, the same approach could be adopted with a random access channel to bytes on a virtual FileSystem.

I imagine that would get pretty hairy for write operations but fortunately we would not need to deal with that.

If no-one’s looked at it yet I'll take a shot.
Cheers
Jeremy

FWIW, this could also be exposed to web applications e.g.
  FileSystem webappFS = servletContext.getFileSystem();
  Path resource = webappFS.getPath(request.getPathInfo());
  Files.copy(resource, response.getOutputStream());

Re: WAR FileSystem for fast nested JAR access?

Posted by Jeremy Boynes <jb...@apache.org>.

On Mar 8, 2015, at 5:28 AM, Christopher Schultz <ch...@christopherschultz.net> wrote:
> 
> Jeremy,
> 
> On 3/7/15 1:13 PM, Jeremy Boynes wrote:
>> On Mar 6, 2015, at 7:43 AM, Mark Thomas <ma...@apache.org> wrote:
>>> Interesting. The deciding factor for me will be performance. Keep
>>> in mind that we might not need all the API. As long as there is
>>> enough to implement WebResourceSet and WebResource, we probably
>>> have all we need.
>> 
>> I ran a micro-benchmark using the greenhouse WAR associated with the
>> original bug. I instrumented JarWarResource to log all resources
>> opened during startup and record the time. On my system it took
>> ~21,000ms to start the application of which ~16,000ms was spent in
>> getJarInputStreamWrapper(). 2935 resources were opened, primarily
>> class files.
>> 
>> I then replayed the log against the sandbox FS. With the current
>> implementation it took ~300ms to open the war, ~350ms to open all the
>> jars, and ~5ms to open all the entries with newInputStream().
>> 
>> I interpret that to mean that there is pretty constant time taken to
>> inflate 15MB of data - the 300ms to scan the archive and the ~350ms
>> to scan each of the jars within (each one that was used at least).
>> The speed up here comes because we only scan each archive once, the
>> downside is the extra memory used to store the inflated data.
>> 
>> This is promising enough to me that I’m going to keep exploring.
>> 
>> Konstantin’s patch, AIUI, creates an index for each jar which
>> eliminates the need to scan jars on the classpath that don’t contain
>> the class being requested. However, once the classloader has
>> determined the jar file to use we still need to stream through that
>> jar until we reach the desired entry.
> 
> I have a dumb question about this: why does the JAR file have to be
> /searched/ for a particular entry? Opening the JAR file should seek to
> the end of the file to read the TOC, and then the file offsets should be
> immediately available. Need file #27? Look in entries[27] and the offset
> into the file should be available.
> 
> Is the problem that, because the JAR is inside a WAR file, the "offset"
> into the JAR file is meaningless because there isn't an easy way to
> determine the mapping from uncompressed-offset to compressed-offset?

It’s limitation of the classical API. We can open the WAR as a JarFile which allows this type of random access using just this mechanism. You can access any entry and retrieve an InputStream which we wrap as a JarInputStream,. That’s not seekable so the only way to locate an entry in that JarInputStream (e.g. a resource or class) is to search through that stream.

JarFile though can only open a File (i.e. something on the default filesystem). This is why it is much faster when the JAR is extracted to the file system where it can then be opened with JarFile to give random access to its contents.

There are many code paths in our code and in the JDK (e.g. URLClassLoader/URLClassPath) to detect whether something is on disk and can be optimized (i.e. is it a directory which allows path manipulation, or is it file:// URL that could be opened using JarFile).

> 
>> I think we can avoid that here by digging into the zip file’s
>> internal metadata. Where I am currently  streaming the jar to build
>> the directory, with random access I can build an index just by
>> reading the central directory structure. An index entry would contain
>> the name, metadata, and the offset in the archive of the entry’s
>> data.
> 
> Which archive do you mean, here? The inner JAR or the outer WAR?

Either, basically whatever archive is being mounted. During the mount, it builds an index of the entries in the jar. As a quick hack, it currently does that by scanning all the content using a JarInputStream. What I’m planning on doing next is building that index by reading the archive’s structure directly. Doing that requires seeking backwards through the zip data structures which is what the current APIs don’t support.

> 
>> When an entry is opened would we inflate the data so that it
>> could be used to underpin the channel. When the channel is closed the
>> memory would be released.
>> 
>> In general, I don’t think there’s a need for the FileSystem to retain
>> inflated data after the channel is closed. This would be particularly
>> true for the leaf resources which are not likely to be reused; for
>> example, once a ClassLoader has used the .class file to define the
>> Class or once a framework has processed a .xml config file then
>> neither will need it again.
> 
> You could use a small LRU cache or something if you wanted to get fancy.
> Once the majority of class loading is done, it might help for other
> resources that are requested with some regularity.
> 
>> However, I think the WAR ClassLoader would benefit from keeping the
>> JAR files on the classpath open to avoid re-inflating them. The
>> pattern though would be bursty e.g. lots of class loads during
>> startup followed by quiescence. I can think of two ways to handle
>> that: 1) FileSystem has maintains a cache of inflated entries much
>> like a disk filesystem has buffers The FileSystem would be
>> responsible for evictions, perhaps on a LRU or timed basis. 2) Having
>> the classloader keep the JARs opened/mounted after loading a resource
>> until such time as it thinks quiescence is reached. It would then
>> unmount JARs to free the memory. We could do both as they don’t
>> conflict.
> 
> I like both LRU + timed expiration. If a 1GiB resource is requested a
> single time and then nothing else for a long time (days?), that file
> will sit there hogging-up heap space.
> 
>> Next step will be to look into building the index directly from the
>> archive’s central directory rather than by streaming it.
> 
> Is this possible using java.util.zip.ZipFile? Skimming the API, it
> doesn't seem so. This kind of thing really ought to exist already.
> Perhaps there is an ASL-compatible tool available we could use.

Right, the current API does not allow it. Even if we proposed a patch to OpenJDK and it was accepted it wouldn’t help in the short term.

There are a couple of ASL-compatible tools I’ve looked at for inspiration:
* Sun’s FileSystem demo, which became the jar: FileSystem bundled in the JDK (BSD license)
* Apache Commons VFS, an earlier VFS implementation with Zip support. It predates the NIO2 APIs.

There’s also JZlib, a BSD licensed (de-)compressor but I’ve not looked at that yet as I think j.u.z.Inflater may be sufficient.

Re: WAR FileSystem for fast nested JAR access?

Posted by Christopher Schultz <ch...@christopherschultz.net>.

Jeremy,

On 3/7/15 1:13 PM, Jeremy Boynes wrote:
> On Mar 6, 2015, at 7:43 AM, Mark Thomas <ma...@apache.org> wrote:
>> Interesting. The deciding factor for me will be performance. Keep
>> in mind that we might not need all the API. As long as there is
>> enough to implement WebResourceSet and WebResource, we probably
>> have all we need.
> 
> I ran a micro-benchmark using the greenhouse WAR associated with the
> original bug. I instrumented JarWarResource to log all resources
> opened during startup and record the time. On my system it took
> ~21,000ms to start the application of which ~16,000ms was spent in
> getJarInputStreamWrapper(). 2935 resources were opened, primarily
> class files.
> 
> I then replayed the log against the sandbox FS. With the current
> implementation it took ~300ms to open the war, ~350ms to open all the
> jars, and ~5ms to open all the entries with newInputStream().
> 
> I interpret that to mean that there is pretty constant time taken to
> inflate 15MB of data - the 300ms to scan the archive and the ~350ms
> to scan each of the jars within (each one that was used at least).
> The speed up here comes because we only scan each archive once, the
> downside is the extra memory used to store the inflated data.
> 
> This is promising enough to me that I’m going to keep exploring.
> 
> Konstantin’s patch, AIUI, creates an index for each jar which 
> eliminates the need to scan jars on the classpath that don’t contain 
> the class being requested. However, once the classloader has 
> determined the jar file to use we still need to stream through that 
> jar until we reach the desired entry.

I have a dumb question about this: why does the JAR file have to be
/searched/ for a particular entry? Opening the JAR file should seek to
the end of the file to read the TOC, and then the file offsets should be
immediately available. Need file #27? Look in entries[27] and the offset
into the file should be available.

Is the problem that, because the JAR is inside a WAR file, the "offset"
into the JAR file is meaningless because there isn't an easy way to
determine the mapping from uncompressed-offset to compressed-offset?

> I think we can avoid that here by digging into the zip file’s
> internal metadata. Where I am currently  streaming the jar to build
> the directory, with random access I can build an index just by
> reading the central directory structure. An index entry would contain
> the name, metadata, and the offset in the archive of the entry’s
> data.

Which archive do you mean, here? The inner JAR or the outer WAR?

> When an entry is opened would we inflate the data so that it
> could be used to underpin the channel. When the channel is closed the
> memory would be released.
> 
> In general, I don’t think there’s a need for the FileSystem to retain
> inflated data after the channel is closed. This would be particularly
> true for the leaf resources which are not likely to be reused; for
> example, once a ClassLoader has used the .class file to define the
> Class or once a framework has processed a .xml config file then
> neither will need it again.

You could use a small LRU cache or something if you wanted to get fancy.
Once the majority of class loading is done, it might help for other
resources that are requested with some regularity.

> However, I think the WAR ClassLoader would benefit from keeping the
> JAR files on the classpath open to avoid re-inflating them. The
> pattern though would be bursty e.g. lots of class loads during
> startup followed by quiescence. I can think of two ways to handle
> that: 1) FileSystem has maintains a cache of inflated entries much
> like a disk filesystem has buffers The FileSystem would be
> responsible for evictions, perhaps on a LRU or timed basis. 2) Having
> the classloader keep the JARs opened/mounted after loading a resource
> until such time as it thinks quiescence is reached. It would then
> unmount JARs to free the memory. We could do both as they don’t
> conflict.

I like both LRU + timed expiration. If a 1GiB resource is requested a
single time and then nothing else for a long time (days?), that file
will sit there hogging-up heap space.

> Next step will be to look into building the index directly from the
> archive’s central directory rather than by streaming it.

Is this possible using java.util.zip.ZipFile? Skimming the API, it
doesn't seem so. This kind of thing really ought to exist already.
Perhaps there is an ASL-compatible tool available we could use.

-chris

Re: WAR FileSystem for fast nested JAR access?

Posted by Jeremy Boynes <jb...@apache.org>.

On Mar 17, 2015, at 9:01 AM, Christopher Schultz <ch...@christopherschultz.net> wrote:
> 
> Jeremy,
> 
> On 3/17/15 2:39 AM, Jeremy Boynes wrote:
>> On Mar 7, 2015, at 10:13 AM, Jeremy Boynes <jb...@apache.org> wrote:
>>> 
>>> On Mar 6, 2015, at 7:43 AM, Mark Thomas <ma...@apache.org> wrote:
>>>> Interesting. The deciding factor for me will be performance. Keep in
>>>> mind that we might not need all the API. As long as there is enough to
>>>> implement WebResourceSet and WebResource, we probably have all we need.
>>> 
>>> I ran a micro-benchmark using the greenhouse WAR associated with the original bug. I instrumented JarWarResource to log all resources opened during startup and record the time. On my system it took ~21,000ms to start the application of which ~16,000ms was spent in getJarInputStreamWrapper(). 2935 resources were opened, primarily class files.
>>> 
>>> I then replayed the log against the sandbox FS. With the current implementation it took ~300ms to open the war, ~350ms to open all the jars, and ~5ms to open all the entries with newInputStream().
>>> 
>>> I interpret that to mean that there is pretty constant time taken to inflate 15MB of data - the 300ms to scan the archive and the ~350ms to scan each of the jars within (each one that was used at least). The speed up here comes because we only scan each archive once, the downside is the extra memory used to store the inflated data.
>>> 
>>> This is promising enough to me that I’m going to keep exploring.
>>> 
>>> Konstantin’s patch, AIUI, creates an index for each jar which eliminates the need to scan jars on the classpath that don’t contain the class being requested. However, once the classloader has determined the jar file to use we still need to stream through that jar until we reach the desired entry.
>>> 
>>> I think we can avoid that here by digging into the zip file’s internal metadata. Where I am currently  streaming the jar to build the directory, with random access I can build an index just by reading the central directory structure. An index entry would contain the name, metadata, and the offset in the archive of the entry’s data. When an entry is opened would we inflate the data so that it could be used to underpin the channel. When the channel is closed the memory would be released.
>>> 
>>> In general, I don’t think there’s a need for the FileSystem to retain inflated data after the channel is closed. This would be particularly true for the leaf resources which are not likely to be reused; for example, once a ClassLoader has used the .class file to define the Class or once a framework has processed a .xml config file then neither will need it again.
>>> 
>>> However, I think the WAR ClassLoader would benefit from keeping the JAR files on the classpath open to avoid re-inflating them. The pattern though would be bursty e.g. lots of class loads during startup followed by quiescence. I can think of two ways to handle that:
>>> 1) FileSystem has maintains a cache of inflated entries much like a disk filesystem has buffers
>>>  The FileSystem would be responsible for evictions, perhaps on a LRU or timed basis.
>>> 2) Having the classloader keep the JARs opened/mounted after loading a resource until such time as it thinks quiescence is reached. It would then unmount JARs to free the memory.
>>> We could do both as they don’t conflict.
>>> 
>>> Next step will be to look into building the index directly from the archive’s central directory rather than by streaming it.
>> 
>> Next step was actually just to verify that we could make a URLClassLoader work with this API. I got this to work by turning the path URIs into collection URLs (ending in ‘/‘) which prevented the classloader from trying to open them as JarFiles.
>> 
>> The classloader works but the classpath search is pretty inefficient relying on UrlConnection#getInputStream throwing an Exception to detect if a resource exists. Using it to load the 2935 resources from before took ~1900ms even after the jars had been indexed. getInputStream() was called ~120,000 times as the classpath was scanned, i.e. 15us per check with an average of ~40 checks per resource which seems about right for a classpath that contains 73 jars.
>> 
>> An obvious solution to avoid the repeated search would be to union the jars’ directories into a single index. I may try this with a PathClassLoader that operates using a list of Paths rather than URLs.
> 
> I just wanted to let you know that I'm reading these with interest. I'm
> anxious to find out if this is going to pan-out.

Thanks. Real-life is a bit busy at the moment so progress will be sporadic. If you or anyone would like to jump in there are a few of areas which still have unknowns:
* a way to read the zip’s central directory
* a way to seek into a deflated zip entry without inflating the entire thing
* is a ClassLoader from a list of Path helpful?
* how to deal with the locking model on Windows platform
* how to work with Paths that are directories - do we get this for free?
* how to use the WatchService to detect changes e.g. web.xml or *.jsp touched?

I think its time for a wiki page.
Cheers
Jeremy

Re: WAR FileSystem for fast nested JAR access?

Posted by Christopher Schultz <ch...@christopherschultz.net>.

Jeremy,

On 3/17/15 2:39 AM, Jeremy Boynes wrote:
> On Mar 7, 2015, at 10:13 AM, Jeremy Boynes <jb...@apache.org> wrote:
>>
>> On Mar 6, 2015, at 7:43 AM, Mark Thomas <ma...@apache.org> wrote:
>>> Interesting. The deciding factor for me will be performance. Keep in
>>> mind that we might not need all the API. As long as there is enough to
>>> implement WebResourceSet and WebResource, we probably have all we need.
>>
>> I ran a micro-benchmark using the greenhouse WAR associated with the original bug. I instrumented JarWarResource to log all resources opened during startup and record the time. On my system it took ~21,000ms to start the application of which ~16,000ms was spent in getJarInputStreamWrapper(). 2935 resources were opened, primarily class files.
>>
>> I then replayed the log against the sandbox FS. With the current implementation it took ~300ms to open the war, ~350ms to open all the jars, and ~5ms to open all the entries with newInputStream().
>>
>> I interpret that to mean that there is pretty constant time taken to inflate 15MB of data - the 300ms to scan the archive and the ~350ms to scan each of the jars within (each one that was used at least). The speed up here comes because we only scan each archive once, the downside is the extra memory used to store the inflated data.
>>
>> This is promising enough to me that I’m going to keep exploring.
>>
>> Konstantin’s patch, AIUI, creates an index for each jar which eliminates the need to scan jars on the classpath that don’t contain the class being requested. However, once the classloader has determined the jar file to use we still need to stream through that jar until we reach the desired entry.
>>
>> I think we can avoid that here by digging into the zip file’s internal metadata. Where I am currently  streaming the jar to build the directory, with random access I can build an index just by reading the central directory structure. An index entry would contain the name, metadata, and the offset in the archive of the entry’s data. When an entry is opened would we inflate the data so that it could be used to underpin the channel. When the channel is closed the memory would be released.
>>
>> In general, I don’t think there’s a need for the FileSystem to retain inflated data after the channel is closed. This would be particularly true for the leaf resources which are not likely to be reused; for example, once a ClassLoader has used the .class file to define the Class or once a framework has processed a .xml config file then neither will need it again.
>>
>> However, I think the WAR ClassLoader would benefit from keeping the JAR files on the classpath open to avoid re-inflating them. The pattern though would be bursty e.g. lots of class loads during startup followed by quiescence. I can think of two ways to handle that:
>> 1) FileSystem has maintains a cache of inflated entries much like a disk filesystem has buffers
>>   The FileSystem would be responsible for evictions, perhaps on a LRU or timed basis.
>> 2) Having the classloader keep the JARs opened/mounted after loading a resource until such time as it thinks quiescence is reached. It would then unmount JARs to free the memory.
>> We could do both as they don’t conflict.
>>
>> Next step will be to look into building the index directly from the archive’s central directory rather than by streaming it.
> 
> Next step was actually just to verify that we could make a URLClassLoader work with this API. I got this to work by turning the path URIs into collection URLs (ending in ‘/‘) which prevented the classloader from trying to open them as JarFiles.
> 
> The classloader works but the classpath search is pretty inefficient relying on UrlConnection#getInputStream throwing an Exception to detect if a resource exists. Using it to load the 2935 resources from before took ~1900ms even after the jars had been indexed. getInputStream() was called ~120,000 times as the classpath was scanned, i.e. 15us per check with an average of ~40 checks per resource which seems about right for a classpath that contains 73 jars.
> 
> An obvious solution to avoid the repeated search would be to union the jars’ directories into a single index. I may try this with a PathClassLoader that operates using a list of Paths rather than URLs.

I just wanted to let you know that I'm reading these with interest. I'm
anxious to find out if this is going to pan-out.

-chris

Re: WAR FileSystem for fast nested JAR access?

Posted by Jeremy Boynes <jb...@apache.org>.

On Mar 7, 2015, at 10:13 AM, Jeremy Boynes <jb...@apache.org> wrote:
> 
> On Mar 6, 2015, at 7:43 AM, Mark Thomas <ma...@apache.org> wrote:
>> Interesting. The deciding factor for me will be performance. Keep in
>> mind that we might not need all the API. As long as there is enough to
>> implement WebResourceSet and WebResource, we probably have all we need.
> 
> I ran a micro-benchmark using the greenhouse WAR associated with the original bug. I instrumented JarWarResource to log all resources opened during startup and record the time. On my system it took ~21,000ms to start the application of which ~16,000ms was spent in getJarInputStreamWrapper(). 2935 resources were opened, primarily class files.
> 
> I then replayed the log against the sandbox FS. With the current implementation it took ~300ms to open the war, ~350ms to open all the jars, and ~5ms to open all the entries with newInputStream().
> 
> I interpret that to mean that there is pretty constant time taken to inflate 15MB of data - the 300ms to scan the archive and the ~350ms to scan each of the jars within (each one that was used at least). The speed up here comes because we only scan each archive once, the downside is the extra memory used to store the inflated data.
> 
> This is promising enough to me that I’m going to keep exploring.
> 
> Konstantin’s patch, AIUI, creates an index for each jar which eliminates the need to scan jars on the classpath that don’t contain the class being requested. However, once the classloader has determined the jar file to use we still need to stream through that jar until we reach the desired entry.
> 
> I think we can avoid that here by digging into the zip file’s internal metadata. Where I am currently  streaming the jar to build the directory, with random access I can build an index just by reading the central directory structure. An index entry would contain the name, metadata, and the offset in the archive of the entry’s data. When an entry is opened would we inflate the data so that it could be used to underpin the channel. When the channel is closed the memory would be released.
> 
> In general, I don’t think there’s a need for the FileSystem to retain inflated data after the channel is closed. This would be particularly true for the leaf resources which are not likely to be reused; for example, once a ClassLoader has used the .class file to define the Class or once a framework has processed a .xml config file then neither will need it again.
> 
> However, I think the WAR ClassLoader would benefit from keeping the JAR files on the classpath open to avoid re-inflating them. The pattern though would be bursty e.g. lots of class loads during startup followed by quiescence. I can think of two ways to handle that:
> 1) FileSystem has maintains a cache of inflated entries much like a disk filesystem has buffers
>   The FileSystem would be responsible for evictions, perhaps on a LRU or timed basis.
> 2) Having the classloader keep the JARs opened/mounted after loading a resource until such time as it thinks quiescence is reached. It would then unmount JARs to free the memory.
> We could do both as they don’t conflict.
> 
> Next step will be to look into building the index directly from the archive’s central directory rather than by streaming it.

Next step was actually just to verify that we could make a URLClassLoader work with this API. I got this to work by turning the path URIs into collection URLs (ending in ‘/‘) which prevented the classloader from trying to open them as JarFiles.

The classloader works but the classpath search is pretty inefficient relying on UrlConnection#getInputStream throwing an Exception to detect if a resource exists. Using it to load the 2935 resources from before took ~1900ms even after the jars had been indexed. getInputStream() was called ~120,000 times as the classpath was scanned, i.e. 15us per check with an average of ~40 checks per resource which seems about right for a classpath that contains 73 jars.

An obvious solution to avoid the repeated search would be to union the jars’ directories into a single index. I may try this with a PathClassLoader that operates using a list of Paths rather than URLs.

Cheers
Jeremy

Re: WAR FileSystem for fast nested JAR access?

Posted by Jeremy Boynes <je...@boynes.com>.

On Mar 8, 2015, at 9:53 AM, Mark Thomas <ma...@apache.org> wrote:
> 
> On 07/03/2015 18:13, Jeremy Boynes wrote:
>> I interpret that to mean that there is pretty constant time taken to
>> inflate 15MB of data - the 300ms to scan the archive and the ~350ms
>> to scan each of the jars within (each one that was used at least).
>> The speed up here comes because we only scan each archive once, the
>> downside is the extra memory used to store the inflated data.
> 
> Do you mean the entire entire deflated WAR is in memory? If we wanted to
> go that way it is fairly easy to do with the existing WebResource
> implementation.

It does in this quick and dirty version as I wanted to get a sense of what was possible i.e. does the API support the access patterns needed. The rest of the email is exploring ways to avoid that.

> 
>> In general, I don’t think there’s a need for the FileSystem to retain
>> inflated data after the channel is closed.
> 
> Agreed. When to cache or not should be left to WebResources.
> 
>> However, I think the WAR ClassLoader would benefit from keeping the
>> JAR files on the classpath open to avoid re-inflating them.
> 
> That isn't an option unless you can solve the locked file problem on
> Windows. It always has to be possible to delete the WAR on the file
> system to trigger an undeployment.

Remember, it would be opening files on the virtual filesystem not the underlying OS so we can control the locking behaviour. IIRC, part of the problem is that ZipFile uses mmap to access the file which requires a lock on Windows; we may not want to do that (or it could an OS-specific behavioral options).
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: WAR FileSystem for fast nested JAR access?

Posted by Mark Thomas <ma...@apache.org>.

On 07/03/2015 18:13, Jeremy Boynes wrote:
> I interpret that to mean that there is pretty constant time taken to
> inflate 15MB of data - the 300ms to scan the archive and the ~350ms
> to scan each of the jars within (each one that was used at least).
> The speed up here comes because we only scan each archive once, the
> downside is the extra memory used to store the inflated data.

Do you mean the entire entire deflated WAR is in memory? If we wanted to
go that way it is fairly easy to do with the existing WebResource
implementation.

> In general, I don’t think there’s a need for the FileSystem to retain
> inflated data after the channel is closed.

Agreed. When to cache or not should be left to WebResources.

> However, I think the WAR ClassLoader would benefit from keeping the
> JAR files on the classpath open to avoid re-inflating them.

That isn't an option unless you can solve the locked file problem on
Windows. It always has to be possible to delete the WAR on the file
system to trigger an undeployment.

> The
> pattern though would be bursty e.g. lots of class loads during
> startup followed by quiescence. I can think of two ways to handle
> that: 1) FileSystem has maintains a cache of inflated entries much
> like a disk filesystem has buffers The FileSystem would be
> responsible for evictions, perhaps on a LRU or timed basis.

See comments above on caching.

 2) Having
> the classloader keep the JARs opened/mounted after loading a resource
> until such time as it thinks quiescence is reached. It would then
> unmount JARs to free the memory.

See comments above on locked files.

There are good reasons why the Tomcat 7 and earlier solution was to
extract the JARs to the work directory. That might still be the neatest
solution with that dir then mounted to WEB-INF/lib as a pre-resource.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: WAR FileSystem for fast nested JAR access?

Posted by Jeremy Boynes <jb...@apache.org>.

On Mar 6, 2015, at 7:43 AM, Mark Thomas <ma...@apache.org> wrote:
> Interesting. The deciding factor for me will be performance. Keep in
> mind that we might not need all the API. As long as there is enough to
> implement WebResourceSet and WebResource, we probably have all we need.

I ran a micro-benchmark using the greenhouse WAR associated with the original bug. I instrumented JarWarResource to log all resources opened during startup and record the time. On my system it took ~21,000ms to start the application of which ~16,000ms was spent in getJarInputStreamWrapper(). 2935 resources were opened, primarily class files.

I then replayed the log against the sandbox FS. With the current implementation it took ~300ms to open the war, ~350ms to open all the jars, and ~5ms to open all the entries with newInputStream().

I interpret that to mean that there is pretty constant time taken to inflate 15MB of data - the 300ms to scan the archive and the ~350ms to scan each of the jars within (each one that was used at least). The speed up here comes because we only scan each archive once, the downside is the extra memory used to store the inflated data.

This is promising enough to me that I’m going to keep exploring.

Konstantin’s patch, AIUI, creates an index for each jar which eliminates the need to scan jars on the classpath that don’t contain the class being requested. However, once the classloader has determined the jar file to use we still need to stream through that jar until we reach the desired entry.

I think we can avoid that here by digging into the zip file’s internal metadata. Where I am currently  streaming the jar to build the directory, with random access I can build an index just by reading the central directory structure. An index entry would contain the name, metadata, and the offset in the archive of the entry’s data. When an entry is opened would we inflate the data so that it could be used to underpin the channel. When the channel is closed the memory would be released.

In general, I don’t think there’s a need for the FileSystem to retain inflated data after the channel is closed. This would be particularly true for the leaf resources which are not likely to be reused; for example, once a ClassLoader has used the .class file to define the Class or once a framework has processed a .xml config file then neither will need it again.

However, I think the WAR ClassLoader would benefit from keeping the JAR files on the classpath open to avoid re-inflating them. The pattern though would be bursty e.g. lots of class loads during startup followed by quiescence. I can think of two ways to handle that:
1) FileSystem has maintains a cache of inflated entries much like a disk filesystem has buffers
   The FileSystem would be responsible for evictions, perhaps on a LRU or timed basis.
2) Having the classloader keep the JARs opened/mounted after loading a resource until such time as it thinks quiescence is reached. It would then unmount JARs to free the memory.
We could do both as they don’t conflict.

Next step will be to look into building the index directly from the archive’s central directory rather than by streaming it.

Cheers
Jeremy

Re: WAR FileSystem for fast nested JAR access?

Posted by Mark Thomas <ma...@apache.org>.

On 06/03/2015 15:30, Jeremy Boynes wrote:
> On Mar 4, 2015, at 9:09 AM, Jeremy Boynes <jb...@apache.org>
> wrote:
>> 
>> My suggestion for using an NIO2 FileSystem is because its API
>> provides for nesting and for random access to the entries in the
>> filesystem. Something like:
>> 
>> Path war =
>> FileSystems.getDefault().getPath(“real/path/of/application.war”); 
>> FileSystem warFS = FileSystems.newFileSystem(“war:” +
>> war.toURI()); Path nestedJar =
>> warFS.getPath(“/WEB-INF/lib/some.jar”); FileSystem jarFS =
>> FileSystems.newFileSystem(“jar:” + nestedJar.toURI()); Path
>> resource = jarFS.getPath(“some/resource.txt”); return
>> Files.newInputStream(resource); // or newFileChannel(resource)
>> etc.
>> 
>> There are two requirements on the archive FileSystem implementation
>> for this to work: * Support for nesting in the URI * Functioning
>> implementation of newByteChannel or newFileChannel
>> 
>> Unfortunately the jar: provider that comes with the JRE won’t do
>> that. It has ye olde jar: URL nesting issues and requires the
>> archive Path be provided by the default FileSystem. Its
>> newByteChannel() returns a SeekableByteChannel that is not seekable
>> (doh!) and newFileChannel() works by extracting the entry to a temp
>> file.
>> 
>> The former problem seems easy to work around. To support a seekable
>> channel without extraction would be trickier as you would need to
>> convert channel positions to the actual position in the compressed
>> data which would mean digging into the compression block structure.
>> However, I think the naive approach of scanning the entry data and
>> then caching the block offsets would still be quicker than
>> inflating to a temp file.
> 
> I started exploring this in http://svn.apache.org/r1664650
> 
> This has a (very) crude implementation of a FileSystem that allows
> nesting on top of JAR-style archives using the newByteChannel() API.
> It enables you to “mount” an archive and than access resources in it
> using the standard Files API.
> 
> Still to investigate is how Paths can represented as URIs that can be
> converted to URLs to be returned from e.g. getResource(). I’m
> exploring URL-encoding the authority component to allow the base
> Path’s URI to be stored in a regular hierarchical URI with the
> nesting simply triggering multiple levels of encoding. This is
> different to the jar scheme’s approach of using non-hierarchical URIs
> with custom paths and the “!/“ separator.
> 
> Hopefully we would be able to create a URLClassLoader from those URLs
> and have it operate normally. The URLStreamHandler would handle
> archive: URLs by locating a mounted filesystem and opening a stream
> to the Path within. We would still need to install a custom
> URLStreamHandlerFactory.

Interesting. The deciding factor for me will be performance. Keep in
mind that we might not need all the API. As long as there is enough to
implement WebResourceSet and WebResource, we probably have all we need.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org

Re: WAR FileSystem for fast nested JAR access?

Posted by Jeremy Boynes <jb...@apache.org>.

On Mar 4, 2015, at 9:09 AM, Jeremy Boynes <jb...@apache.org> wrote:
> 
> My suggestion for using an NIO2 FileSystem is because its API provides for nesting and for random access to the entries in the filesystem. Something like:
> 
>   Path war = FileSystems.getDefault().getPath(“real/path/of/application.war”);
>   FileSystem warFS = FileSystems.newFileSystem(“war:” + war.toURI());
>   Path nestedJar = warFS.getPath(“/WEB-INF/lib/some.jar”);
>   FileSystem jarFS = FileSystems.newFileSystem(“jar:” + nestedJar.toURI());
>   Path resource = jarFS.getPath(“some/resource.txt”);
>   return Files.newInputStream(resource); // or newFileChannel(resource) etc.
> 
> There are two requirements on the archive FileSystem implementation for this to work:
> * Support for nesting in the URI
> * Functioning implementation of newByteChannel or newFileChannel
> 
> Unfortunately the jar: provider that comes with the JRE won’t do that. It has ye olde jar: URL nesting issues and requires the archive Path be provided by the default FileSystem. Its newByteChannel() returns a SeekableByteChannel that is not seekable (doh!) and newFileChannel() works by extracting the entry to a temp file.
> 
> The former problem seems easy to work around. To support a seekable channel without extraction would be trickier as you would need to convert channel positions to the actual position in the compressed data which would mean digging into the compression block structure. However, I think the naive approach of scanning the entry data and then caching the block offsets would still be quicker than inflating to a temp file.

I started exploring this in http://svn.apache.org/r1664650

This has a (very) crude implementation of a FileSystem that allows nesting on top of JAR-style archives using the newByteChannel() API. It enables you to “mount” an archive and than access resources in it using the standard Files API.

Still to investigate is how Paths can represented as URIs that can be converted to URLs to be returned from e.g. getResource(). I’m exploring URL-encoding the authority component to allow the base Path’s URI to be stored in a regular hierarchical URI with the nesting simply triggering multiple levels of encoding. This is different to the jar scheme’s approach of using non-hierarchical URIs with custom paths and the “!/“ separator.

Hopefully we would be able to create a URLClassLoader from those URLs and have it operate normally. The URLStreamHandler would handle archive: URLs by locating a mounted filesystem and opening a stream to the Path within. We would still need to install a custom URLStreamHandlerFactory.

Cheers
Jeremy

Re: WAR FileSystem for fast nested JAR access?

Posted by Jeremy Boynes <jb...@apache.org>.

On Mar 4, 2015, at 3:49 AM, Konstantin Kolinko <kn...@gmail.com> wrote:
> 
> 2015-03-04 8:20 GMT+03:00 Jeremy Boynes <jb...@apache.org>:
>> In https://bz.apache.org/bugzilla/show_bug.cgi?id=57251, Mark Thomas wrote:
>> 
>>> The fix for bug 57472 might shave a few seconds of the deployment time but
>>> it doesn't appear to make a significant difference.
>>> 
>>> The fundamental problem when running from a packed WAR is that to access any
>>> resource in a JAR, Tomcat has to do the following:
>>> - open the WAR
>>> - get the entry for the JAR
>>> - get the InputStream for the JAR entry
>>> - Create a JarInputStream
>>> - Read the JarInputStream until it finds the entry it wants
>>> 
>>> This is always going to be slow.
>>> 
>>> The reason that it is fast in Tomcat 7 and earlier took some digging. In
>>> unpackWARs is false in Tomcat 7, it unpacks the JARs anyway into the work
>>> directory and uses them from there. Performance is therefore comparable with
>>> unpackWARs="true".
>> 
>> Has anyone looked into using a NIO2 FileSystem for this? It may offer a way to avoid having to stream the entry in order to be able to locate a resource. ZipFile is fast, I believe, because it has random access to the archive and can seek directly to an entry's location based on the zip index; the jar: FileSystem seems to be able to do the same.
>> 
>> However, neither can cope with nested entries: ZipFile because its constructor takes a File rather than a Path and uses native code, and ZipFS because it relies on URIs and can't cope with a jar: URI based on another jar: URI (ye olde problem with jar: URL syntax).
>> 
>> What a FileSystem can do differently is return a FileChannel which supports seek operations over the archive's content. IOW, if ZipFS can work given a random access channel to bytes on disk, the same approach could be adopted with a random access channel to bytes on a virtual FileSystem.
>> 
>> I imagine that would get pretty hairy for write operations but fortunately we would not need to deal with that.
>> 
>> If no-one’s looked at it yet I'll take a shot.
>> Cheers
>> Jeremy
>> 
>> FWIW, this could also be exposed to web applications e.g.
>>  FileSystem webappFS = servletContext.getFileSystem();
>>  Path resource = webappFS.getPath(request.getPathInfo());
>>  Files.copy(resource, response.getOutputStream());
>> 
> 
> The fundamental issue is how the data of JAR file (as a whole) is
> available via API.
> 
> To be able to use random access with the JAR you technically have to
> 
> 1) Jump to the end of the JAR file and read the ZIP index ("Central
> directory") that is located there. See the image at:
> http://en.wikipedia.org/wiki/Zip_%28file_format%29
> 
> 2) Jump to the specific file.
> 
> As JAR itself is compressed, there is no real API to jump to a
> position in it, besides maybe InputStream.skip(). This skip() will
> involve the same overhead as the current implementation that scans the
> jar, unless the war has zero compression.
> 
> 
> Also
> 1. Reading the zip index takes time and would better be cached. That
> is the issue behind
> https://bz.apache.org/bugzilla/show_bug.cgi?id=52448
> 
> 2. It makes sense to cache the list of directories (packages) in the
> zip file. Scanning the whole jar for a class that is not present there
> is the worst case.  A bonus is that it can improve handling of JARs
> that do not have explicit entries for directories.

I agree caching would help but I’m not convinced the lack thereof is the main cause of the speed issue here. From Mark’s description above, "Read the JarInputStream until it finds the entry it wants” sounds more problematic.

“Open the WAR” and “get the entry for the JAR” can use ZipFile which uses random access to locate the bytes for the nested JAR. However, ZipFile only provides access to those bytes as an InputStream so we need to stream to locate the resource entry.

As an aside, there’s also the issue that zip archives can have zombie entries left in the stream but removed from the central directory, so the only way to know if an entry should actually be returned is to read to the directory which happens to be at the end. AIUI, ZipInputStream will return those zombies as it proceeds. This is seldom an issue for JARs as they typically don’t have zombies.

My suggestion for using an NIO2 FileSystem is because its API provides for nesting and for random access to the entries in the filesystem. Something like:

   Path war = FileSystems.getDefault().getPath(“real/path/of/application.war”);
   FileSystem warFS = FileSystems.newFileSystem(“war:” + war.toURI());
   Path nestedJar = warFS.getPath(“/WEB-INF/lib/some.jar”);
   FileSystem jarFS = FileSystems.newFileSystem(“jar:” + nestedJar.toURI());
   Path resource = jarFS.getPath(“some/resource.txt”);
   return Files.newInputStream(resource); // or newFileChannel(resource) etc.

There are two requirements on the archive FileSystem implementation for this to work:
* Support for nesting in the URI
* Functioning implementation of newByteChannel or newFileChannel

Unfortunately the jar: provider that comes with the JRE won’t do that. It has ye olde jar: URL nesting issues and requires the archive Path be provided by the default FileSystem. Its newByteChannel() returns a SeekableByteChannel that is not seekable (doh!) and newFileChannel() works by extracting the entry to a temp file.

The former problem seems easy to work around. To support a seekable channel without extraction would be trickier as you would need to convert channel positions to the actual position in the compressed data which would mean digging into the compression block structure. However, I think the naive approach of scanning the entry data and then caching the block offsets would still be quicker than inflating to a temp file.

—
Jeremy

Re: WAR FileSystem for fast nested JAR access?

Posted by Konstantin Kolinko <kn...@gmail.com>.

2015-03-04 8:20 GMT+03:00 Jeremy Boynes <jb...@apache.org>:
> In https://bz.apache.org/bugzilla/show_bug.cgi?id=57251, Mark Thomas wrote:
>
>> The fix for bug 57472 might shave a few seconds of the deployment time but
>> it doesn't appear to make a significant difference.
>>
>> The fundamental problem when running from a packed WAR is that to access any
>> resource in a JAR, Tomcat has to do the following:
>> - open the WAR
>> - get the entry for the JAR
>> - get the InputStream for the JAR entry
>> - Create a JarInputStream
>> - Read the JarInputStream until it finds the entry it wants
>>
>> This is always going to be slow.
>>
>> The reason that it is fast in Tomcat 7 and earlier took some digging. In
>> unpackWARs is false in Tomcat 7, it unpacks the JARs anyway into the work
>> directory and uses them from there. Performance is therefore comparable with
>> unpackWARs="true".
>
> Has anyone looked into using a NIO2 FileSystem for this? It may offer a way to avoid having to stream the entry in order to be able to locate a resource. ZipFile is fast, I believe, because it has random access to the archive and can seek directly to an entry's location based on the zip index; the jar: FileSystem seems to be able to do the same.
>
> However, neither can cope with nested entries: ZipFile because its constructor takes a File rather than a Path and uses native code, and ZipFS because it relies on URIs and can't cope with a jar: URI based on another jar: URI (ye olde problem with jar: URL syntax).
>
> What a FileSystem can do differently is return a FileChannel which supports seek operations over the archive's content. IOW, if ZipFS can work given a random access channel to bytes on disk, the same approach could be adopted with a random access channel to bytes on a virtual FileSystem.
>
> I imagine that would get pretty hairy for write operations but fortunately we would not need to deal with that.
>
> If no-one’s looked at it yet I'll take a shot.
> Cheers
> Jeremy
>
> FWIW, this could also be exposed to web applications e.g.
>   FileSystem webappFS = servletContext.getFileSystem();
>   Path resource = webappFS.getPath(request.getPathInfo());
>   Files.copy(resource, response.getOutputStream());
>

The fundamental issue is how the data of JAR file (as a whole) is
available via API.

To be able to use random access with the JAR you technically have to

1) Jump to the end of the JAR file and read the ZIP index ("Central
directory") that is located there. See the image at:
http://en.wikipedia.org/wiki/Zip_%28file_format%29

2) Jump to the specific file.

As JAR itself is compressed, there is no real API to jump to a
position in it, besides maybe InputStream.skip(). This skip() will
involve the same overhead as the current implementation that scans the
jar, unless the war has zero compression.


Also
1. Reading the zip index takes time and would better be cached. That
is the issue behind
https://bz.apache.org/bugzilla/show_bug.cgi?id=52448

2. It makes sense to cache the list of directories (packages) in the
zip file. Scanning the whole jar for a class that is not present there
is the worst case.  A bonus is that it can improve handling of JARs
that do not have explicit entries for directories.

Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@tomcat.apache.org
For additional commands, e-mail: dev-help@tomcat.apache.org