You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by GitBox <gi...@apache.org> on 2020/01/11 02:36:31 UTC

[GitHub] [commons-compress] PeterAlfreadLee opened a new pull request #90: Compress-477 : add zip64 support for split zip

PeterAlfreadLee opened a new pull request #90: Compress-477 : add zip64 support for split zip
URL: https://github.com/apache/commons-compress/pull/90
 
 
   As it's said in [#86](https://github.com/apache/commons-compress/pull/86)
   
   > While rebasing I realized that master says the disk number start is a long while this branch uses an int. Actually both master and this branch neglect the possibility of the number of disks requiring Zip64 extra handling. TBH I find the idea of a split ZIP archive that spans more than 64k files a bit disturbing and would address it with a separate change independent of this PR.
   
   This PR is about the zip64 for split zip and some bug fixing in existing code.
   
   1. For Central File Header
   As it's written in Zip Specification :
   
   > 4.4.13 disk number start: (2 bytes)
          The number of the disk on which this file begins.  If an 
          archive is in ZIP64 format and the value in this field is 
          0xFFFF, the size will be in the corresponding 4 byte zip64 
          extended information extra field.
   
   Which means the `disk number start` may also exceed the maximum value `0xFFFF` and may exist in extra field. The existing code in `createCentralFileHeader` ignored this possibility and didn't detect it.
   That's why I added `|| ze.getDiskNumberStart() >= ZIP64_MAGIC_SHORT` for this.
   
   And some testcases in `Zip64SupportIT` are modified correspodingly :
   (1) `extra field length` : 28 -> 32
   (2) `size of extra` : 24 -> 28
   
   2. For End Of Central Directory :
   There are 6 variables that may exist in Zip64 End Of Central Directory and need to be checked :
   ```
   numberOfThisDisk
   cdDiskNumberStart
   numOfEntriesOnThisDisk
   numberOfEntries/entries.size()
   cdLength
   cdOffset
   ```
   The existing code only checked the `numberOfEntries/entries.size()` and `cdOffset`. So the check for the other 4 variables are added in this PR.
   
   3. Some testcases for zip64 split zip are added(which creates 70,000+ split segments).
   
   When I was writing the testcases, I found it's not easy to test the Zip64 exceptions in End Of Central Directory - the exceptions are always thrown when writing the Central Directory Header. So the newly added exceptions are not tested - I didn't find a proper way to test them.
   
   P.S :
   When I was trying to extract the 70,000+ split segments using `ZipSplitReadOnlySeekableByteChannel`, I got an error : `too many open files`. I realized that the `ZipSplitReadOnlySeekableByteChannel` and `MultiReadOnlySeekableByteChannel` may open all the split segments before reading them - and I think this is meaningless and could be improved. I'm trying to improve them using a 'open files when needed' way. Therefore we don't need to open all the files before reading, which may use a lot of file handles of OS, and we can extract split zips that contain a lot of split segments.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-compress] PeterAlfreadLee commented on issue #90: Compress-477 : add zip64 support for split zip

Posted by GitBox <gi...@apache.org>.
PeterAlfreadLee commented on issue #90: Compress-477 : add zip64 support for split zip
URL: https://github.com/apache/commons-compress/pull/90#issuecomment-576982892
 
 
   Thank you for your review.
   I'll push the "open when actually needed" ASAP.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-compress] bodewig commented on issue #90: Compress-477 : add zip64 support for split zip

Posted by GitBox <gi...@apache.org>.
bodewig commented on issue #90: Compress-477 : add zip64 support for split zip
URL: https://github.com/apache/commons-compress/pull/90#issuecomment-576782503
 
 
   > I'm not sure if this(open file when it's needed to be read) is needed.
   
   For `ZipArchiveInputStream` I wouldn't expect "open when actually needed" would slow down things, so it would really be good to have it. 
   
   In `ZipFile` where we are performing random access reads this may hurt us, but might be worked around with caching a limited number of open files, I guess. If this cache allows for five or ten segments to be open at a time, then this should be sufficiently fast for all real-world archives.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-compress] coveralls commented on issue #90: Compress-477 : add zip64 support for split zip

Posted by GitBox <gi...@apache.org>.
coveralls commented on issue #90: Compress-477 : add zip64 support for split zip
URL: https://github.com/apache/commons-compress/pull/90#issuecomment-573272184
 
 
   
   [![Coverage Status](https://coveralls.io/builds/28041870/badge)](https://coveralls.io/builds/28041870)
   
   Coverage decreased (-0.04%) to 87.032% when pulling **27fd2033d1218a652f09f59aa3d52db2b4d839ab on PeterAlfreadLee:COMPRESS-477-zip64** into **f16125435723d143183c9c45fd3a0f92bdf97ba3 on apache:master**.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-compress] bodewig commented on issue #90: Compress-477 : add zip64 support for split zip

Posted by GitBox <gi...@apache.org>.
bodewig commented on issue #90: Compress-477 : add zip64 support for split zip
URL: https://github.com/apache/commons-compress/pull/90#issuecomment-581135612
 
 
   This is not really a new issue introduced with the split zips feature, `MultiReadOnlySeekableByteChannel` has been keeping channels open before as well.
   
   Honestly I haven't got any idea of how much "jumping around" reading split archives (zip or 7z) actually involves. In both cases we read the channels containing file meta data once and will likely never go back to them. So a small number of open channels may be sufficient. OTOH I expect most split archives to only have a pretty small number of parts, an archive with more than 100 parts is something that I don't expect to be common. Therefore making sure we don't open too many streams is probably not that important at all.
   
   I realize I'm not answering your question :-)
   
   I'd make the number of simultaneously opened streams configurable. "infinite" might even be a good default so that people only need to deal with it explicitly if they run into trouble. But in real life situations I'd expect any number bigger than say 20 to have the same effect as infinity (i.e. all channels are open as there are no more than 20 channels anyway).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-compress] coveralls edited a comment on issue #90: Compress-477 : add zip64 support for split zip

Posted by GitBox <gi...@apache.org>.
coveralls edited a comment on issue #90: Compress-477 : add zip64 support for split zip
URL: https://github.com/apache/commons-compress/pull/90#issuecomment-573272184
 
 
   
   [![Coverage Status](https://coveralls.io/builds/28041870/badge)](https://coveralls.io/builds/28041870)
   
   Coverage decreased (-0.04%) to 87.029% when pulling **27fd2033d1218a652f09f59aa3d52db2b4d839ab on PeterAlfreadLee:COMPRESS-477-zip64** into **f16125435723d143183c9c45fd3a0f92bdf97ba3 on apache:master**.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-compress] bodewig commented on issue #90: Compress-477 : add zip64 support for split zip

Posted by GitBox <gi...@apache.org>.
bodewig commented on issue #90: Compress-477 : add zip64 support for split zip
URL: https://github.com/apache/commons-compress/pull/90#issuecomment-576798783
 
 
   Many thanks @PeterAlfreadLee I've just pushed a small change to the cleanup part of the additional test reusing logic that can be found in `AbstractTestCase` already.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-compress] asfgit merged pull request #90: Compress-477 : add zip64 support for split zip

Posted by GitBox <gi...@apache.org>.
asfgit merged pull request #90: Compress-477 : add zip64 support for split zip
URL: https://github.com/apache/commons-compress/pull/90
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-compress] PeterAlfreadLee commented on issue #90: Compress-477 : add zip64 support for split zip

Posted by GitBox <gi...@apache.org>.
PeterAlfreadLee commented on issue #90: Compress-477 : add zip64 support for split zip
URL: https://github.com/apache/commons-compress/pull/90#issuecomment-581281141
 
 
   > This is not really a new issue introduced with the split zips feature, MultiReadOnlySeekableByteChannel has been keeping channels open before as well.
   
   Yes, you're right. This is not a new issue with split zips. And as we all know that `MultiReadOnlySeekableByteChannel` is also used in 7zip split segments, I believe it's also affected by this.
   
   > OTOH I expect most split archives to only have a pretty small number of parts, an archive with more than 100 parts is something that I don't expect to be common. Therefore making sure we don't open too many streams is probably not that important at all.
   
   I agree. Actually I never meet any split zips with more than 20 segments. This is nothing common but only a special use case that's introduced in the zip specification.
   
   I agree that it's a extremely special use case and may not be used in most cases. Actually I have already finished this. So I will push the PR and it's up to you to decide whether to merge it or not. @bodewig 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-compress] PeterAlfreadLee commented on issue #90: Compress-477 : add zip64 support for split zip

Posted by GitBox <gi...@apache.org>.
PeterAlfreadLee commented on issue #90: Compress-477 : add zip64 support for split zip
URL: https://github.com/apache/commons-compress/pull/90#issuecomment-579099401
 
 
   > In ZipFile where we are performing random access reads this may hurt us, but might be worked around with caching a limited number of open files, I guess. If this cache allows for five or ten segments to be open at a time, then this should be sufficiently fast for all real-world archives.
   
   I'm thinking about setting a threshold for the number of split segments. Therefore we can open the file channels when actually needed.
   Do you think this is a good idea? If so, what the threshold value should be a proper value? Currently I'm setting the value as 100, as the default maximum file handles for most Linux-like OS is 1024. WDYT?
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [commons-compress] PeterAlfreadLee commented on issue #90: Compress-477 : add zip64 support for split zip

Posted by GitBox <gi...@apache.org>.
PeterAlfreadLee commented on issue #90: Compress-477 : add zip64 support for split zip
URL: https://github.com/apache/commons-compress/pull/90#issuecomment-573317729
 
 
   > When I was trying to extract the 70,000+ split segments using ZipSplitReadOnlySeekableByteChannel, I got an error : too many open files. I realized that the ZipSplitReadOnlySeekableByteChannel and MultiReadOnlySeekableByteChannel may open all the split segments before reading them - and I think this is meaningless and could be improved. I'm trying to improve them using a 'open files when needed' way. Therefore we don't need to open all the files before reading, which may use a lot of file handles of OS, and we can extract split zips that contain a lot of split segments.
   
   I have finished this and have tested it(even through it's pretty time consuming when there're a lot split segments).
   I'm not sure if this(open file when it's needed to be read) is needed. What do you think? @bodewig 
   If this is needed, I'm not sure if I should push another seprate PR when this PR is merged(as zip64 split is needed to test it), or I can just push the code in this PR?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services