You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by kvr000 <gi...@git.apache.org> on 2017/04/23 06:48:57 UTC

[GitHub] commons-compress pull request #21: COMPRESS-388: Fix concurrent reads perfor...

GitHub user kvr000 opened a pull request:

    https://github.com/apache/commons-compress/pull/21

    COMPRESS-388: Fix concurrent reads performance

    Ticket: https://issues.apache.org/jira/browse/COMPRESS-388
    
    Concurrent reads on the ZipFile archive is terribly slow on multiprocessor systems. On my 4 CPU laptop it shows 26 reads/s vs 2 reads/s on 100MB samples for example.
    
    The cause is the use of synchronized blocks to access the underlying file channel. This may be required for generic SeekableByteChannel but most commonly there is FileChannel implementation which supports lock-free reading from any position (i.e. using pread/pwrite system calls or their equivalent).
    
    With the fix the performance is about 10 times faster (on 4 CPU system, with more processor the difference should grow significantly).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kvr000/commons-compress feature/COMPRESS-388-concurrent-reads-performance-fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/commons-compress/pull/21.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21
    
----
commit 3c8672db0a68dcd6b59c36abc3ec676b5af02f0a
Author: Zbynek Vyskovsky <kv...@gmail.com>
Date:   2017-04-23T06:45:46Z

    COMPRESS-388: Fix concurrent reads performance

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


[GitHub] commons-compress issue #21: COMPRESS-388: Fix concurrent reads performance

Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:

    https://github.com/apache/commons-compress/pull/21
  
    
    [![Coverage Status](https://coveralls.io/builds/11199449/badge)](https://coveralls.io/builds/11199449)
    
    Coverage decreased (-0.02%) to 84.214% when pulling **3c8672db0a68dcd6b59c36abc3ec676b5af02f0a on kvr000:feature/COMPRESS-388-concurrent-reads-performance-fix** into **13a039029ca7d7fca9862cfb792f7148c555f05f on apache:master**.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


[GitHub] commons-compress pull request #21: COMPRESS-388: Fix concurrent reads perfor...

Posted by bodewig <gi...@git.apache.org>.
Github user bodewig commented on a diff in the pull request:

    https://github.com/apache/commons-compress/pull/21#discussion_r112838584
  
    --- Diff: src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java ---
    @@ -1111,14 +1122,11 @@ public int read() throws IOException {
                     }
                     return -1;
                 }
    -            synchronized (archive) {
    -                archive.position(loc++);
    -                int read = read(1);
    -                if (read < 0) {
    -                    return read;
    -                }
    -                return buffer.get() & 0xff;
    +            int read = read(loc++, 1);
    +            if (read < 0) {
    --- End diff --
    
    I think it depends on what we think `synchronized` is supposed to protect against.
    
    As it stands the `synchronized` block also protected concurrent reads from the same `BoundedInputStream` from overwriting their results (protecting `loc` and `buffer` not just the file position).
    
    If we want to keep that level of protection the changes are not going to work. If we say each input stream returned by `getInputStream` can only be used by a single thread, then I think it is fine to keep the increment outside of the block and only synchronise the code that protects the interaction between different streams.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


[GitHub] commons-compress pull request #21: COMPRESS-388: Fix concurrent reads perfor...

Posted by sebbASF <gi...@git.apache.org>.
Github user sebbASF commented on a diff in the pull request:

    https://github.com/apache/commons-compress/pull/21#discussion_r112838915
  
    --- Diff: src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java ---
    @@ -1111,14 +1122,11 @@ public int read() throws IOException {
                     }
                     return -1;
                 }
    -            synchronized (archive) {
    -                archive.position(loc++);
    -                int read = read(1);
    -                if (read < 0) {
    -                    return read;
    -                }
    -                return buffer.get() & 0xff;
    +            int read = read(loc++, 1);
    +            if (read < 0) {
    --- End diff --
    
    OK. This needs to be clarified in the Javadoc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


[GitHub] commons-compress issue #21: COMPRESS-388: Fix concurrent reads performance

Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:

    https://github.com/apache/commons-compress/pull/21
  
    
    [![Coverage Status](https://coveralls.io/builds/11201739/badge)](https://coveralls.io/builds/11201739)
    
    Coverage increased (+0.07%) to 84.303% when pulling **029a4974f81f423c1b8805f72cec9acad3069335 on kvr000:feature/COMPRESS-388-concurrent-reads-performance-fix** into **13a039029ca7d7fca9862cfb792f7148c555f05f on apache:master**.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


[GitHub] commons-compress pull request #21: COMPRESS-388: Fix concurrent reads perfor...

Posted by sebbASF <gi...@git.apache.org>.
Github user sebbASF commented on a diff in the pull request:

    https://github.com/apache/commons-compress/pull/21#discussion_r112838358
  
    --- Diff: src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java ---
    @@ -1081,16 +1082,26 @@ private boolean startsWithLocalFileHeader() throws IOException {
         }
     
         /**
    +     * Creates new BoundedInputStream, according to implementation of
    +     * underlying archive channel.
    +     */
    +    private BoundedInputStream createBoundedInputStream(long start, long remaining) {
    +        return archive instanceof FileChannel ?
    +            new BoundedFileChannelInputStream(start, remaining) :
    +            new BoundedInputStream(start, remaining);
    +    }
    +
    +    /**
          * InputStream that delegates requests to the underlying
          * SeekableByteChannel, making sure that only bytes from a certain
          * range can be read.
          */
         private class BoundedInputStream extends InputStream {
    -        private static final int MAX_BUF_LEN = 8192;
    -        private final ByteBuffer buffer;
    -        private long remaining;
    -        private long loc;
    -        private boolean addDummyByte = false;
    +        protected static final int MAX_BUF_LEN = 8192;
    +        protected final ByteBuffer buffer;
    +        protected long remaining;
    +        protected long loc;
    +        protected boolean addDummyByte = false;
     
    --- End diff --
    
    Do these really need to be protected rather than private or package-protected?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


[GitHub] commons-compress pull request #21: COMPRESS-388: Fix concurrent reads perfor...

Posted by kvr000 <gi...@git.apache.org>.
Github user kvr000 commented on a diff in the pull request:

    https://github.com/apache/commons-compress/pull/21#discussion_r112838864
  
    --- Diff: src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java ---
    @@ -1111,14 +1122,11 @@ public int read() throws IOException {
                     }
                     return -1;
                 }
    -            synchronized (archive) {
    -                archive.position(loc++);
    -                int read = read(1);
    -                if (read < 0) {
    -                    return read;
    -                }
    -                return buffer.get() & 0xff;
    +            int read = read(loc++, 1);
    +            if (read < 0) {
    --- End diff --
    
    Correct, bodewig. Additionally, note that in these terms the original implementation wasn't thread-safe - in the byte[] read method the loc incrementing is outside the synchronized too.
    
    Anyway, there are two options - either make the full methods synchronized (the single stream would hardly ever be accessed concurrently so keeping smaller scope won't make any difference).
    
    Or, better option would be to keep this implementation unsynchronized as it's wrapped into another decompressing stream anyway and this one already has to be synchronized. The only exception is STORED compression method where we would have to wrap into synchronized proxy.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


[GitHub] commons-compress pull request #21: COMPRESS-388: Fix concurrent reads perfor...

Posted by sebbASF <gi...@git.apache.org>.
Github user sebbASF commented on a diff in the pull request:

    https://github.com/apache/commons-compress/pull/21#discussion_r112838372
  
    --- Diff: src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java ---
    @@ -1111,14 +1122,11 @@ public int read() throws IOException {
                     }
                     return -1;
                 }
    -            synchronized (archive) {
    -                archive.position(loc++);
    -                int read = read(1);
    -                if (read < 0) {
    -                    return read;
    -                }
    -                return buffer.get() & 0xff;
    +            int read = read(loc++, 1);
    +            if (read < 0) {
    --- End diff --
    
    Surely the increment of loc needs to be synchronised?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


[GitHub] commons-compress pull request #21: COMPRESS-388: Fix concurrent reads perfor...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/commons-compress/pull/21


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


[GitHub] commons-compress issue #21: COMPRESS-388: Fix concurrent reads performance

Posted by kvr000 <gi...@git.apache.org>.
Github user kvr000 commented on the issue:

    https://github.com/apache/commons-compress/pull/21
  
    Additionally, I was thinking about exposing the entry raw stream starting offset and length via public API so in case of need one can either map it into memory, directly access the raw data (especially useful when zip is just kind of flat storage, being quite popular in games but not only). For me it would help to implement off-heap read-only storage, using standard file format widely supported by lot of tools.
    
    It's quite zip specific (although can be applied to similar containers too) but anyway the API already has lot of zip specific stuff... That piece of information would have to be only moved from ZipFile.Entry to ZipFileEntry. What do you think about it? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


[GitHub] commons-compress issue #21: COMPRESS-388: Fix concurrent reads performance

Posted by kvr000 <gi...@git.apache.org>.
Github user kvr000 commented on the issue:

    https://github.com/apache/commons-compress/pull/21
  
    Update, improving few things:
    - made the fields private again
    - simplified to single read(long pos, ByteBuffer buf) method
    - allocating the instance buffer only for single byte read which is rather rare (so far it seems that only Bzip2 uses it for reading the header), all other decompressors and even standard java core readers use internal cache.
    - wrapping the byte array passed by parameters instead of creating temporary ByteBuffer and copying the bytes
    
    Performance improved in comparison to previous commit:
    - 15%-20% for stored stream
    - 1%-3% for deflated stream



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


[GitHub] commons-compress issue #21: COMPRESS-388: Fix concurrent reads performance

Posted by coveralls <gi...@git.apache.org>.
Github user coveralls commented on the issue:

    https://github.com/apache/commons-compress/pull/21
  
    
    [![Coverage Status](https://coveralls.io/builds/11205671/badge)](https://coveralls.io/builds/11205671)
    
    Coverage increased (+0.04%) to 84.277% when pulling **283c5910712623351139d2bf3ea7a39065a00b13 on kvr000:feature/COMPRESS-388-concurrent-reads-performance-fix** into **13a039029ca7d7fca9862cfb792f7148c555f05f on apache:master**.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


[GitHub] commons-compress pull request #21: COMPRESS-388: Fix concurrent reads perfor...

Posted by kvr000 <gi...@git.apache.org>.
Github user kvr000 commented on a diff in the pull request:

    https://github.com/apache/commons-compress/pull/21#discussion_r112838711
  
    --- Diff: src/main/java/org/apache/commons/compress/archivers/zip/ZipFile.java ---
    @@ -1081,16 +1082,26 @@ private boolean startsWithLocalFileHeader() throws IOException {
         }
     
         /**
    +     * Creates new BoundedInputStream, according to implementation of
    +     * underlying archive channel.
    +     */
    +    private BoundedInputStream createBoundedInputStream(long start, long remaining) {
    +        return archive instanceof FileChannel ?
    +            new BoundedFileChannelInputStream(start, remaining) :
    +            new BoundedInputStream(start, remaining);
    +    }
    +
    +    /**
          * InputStream that delegates requests to the underlying
          * SeekableByteChannel, making sure that only bytes from a certain
          * range can be read.
          */
         private class BoundedInputStream extends InputStream {
    -        private static final int MAX_BUF_LEN = 8192;
    -        private final ByteBuffer buffer;
    -        private long remaining;
    -        private long loc;
    -        private boolean addDummyByte = false;
    +        protected static final int MAX_BUF_LEN = 8192;
    +        protected final ByteBuffer buffer;
    +        protected long remaining;
    +        protected long loc;
    +        protected boolean addDummyByte = false;
     
    --- End diff --
    
    They can be package protected too, it's in private class anyway. The fields are accessed by the FileChannel specialization so private won't suffice.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org