You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Michael Osipov (Jira)" <ji...@apache.org> on 2022/05/13 13:02:00 UTC

[jira] [Comment Edited] (COMPRESS-620) ArchiveInputStream fails reading filenames with ANSI characters

    [ https://issues.apache.org/jira/browse/COMPRESS-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536631#comment-17536631 ] 

Michael Osipov edited comment on COMPRESS-620 at 5/13/22 1:01 PM:
------------------------------------------------------------------

Though, I am not a Commons Compress developer, it is a bug in Commons Compress for me. Let's analyze:

The offending entry:
{noformat}
10E92 00004 50 4B 01 02 CENTRAL HEADER #10    02014B50
10E96 00001 0B          Created Zip Spec      0B '1.1'
10E97 00001 00          Created OS            00 'MS-DOS'
10E98 00001 0A          Extract Zip Spec      0A '1.0'
10E99 00001 00          Extract OS            00 'MS-DOS'
10E9A 00002 00 00       General Purpose Flag  0000
                        [Bit 1]               0 '4k Sliding Dictionary'
                        [Bit 2]               0 '2 Shannon-Fano         Trees'
10E9C 00002 06 00       Compression Method    0006 'Imploded'
10E9E 00004 EE 40 79 19 Last Mod Time         197940EE 'Wed Nov 25 08:07:28 1992'
10EA2 00004 47 B9 D7 53 CRC                   53D7B947
10EA6 00004 BE 08 00 00 Compressed Length     000008BE
10EAA 00004 4F 5E 00 00 Uncompressed Length   00005E4F
10EAE 00002 09 00       Filename Length       0009
10EB0 00002 00 00       Extra Length          0000
10EB2 00002 00 00       Comment Length        0000
10EB4 00002 00 00       Disk Start            0000
10EB6 00002 00 00       Int File Attributes   0000
                        [Bit 0]               0 'Binary Data'
10EB8 00004 20 00 00 00 Ext File Attributes   00000020
                        [Bit 5]               Archive
10EBC 00004 16 C6 00 00 Local Header Offset   0000C616
10EC0 00009 41 F9 43 F9 Filename              'A▒C▒E.ANS'
            45 2E 41 4E
            53
{noformat}

From the ZIP note:
{quote}
APPENDIX D - Language Encoding (EFS)
------------------------------------

D.1 The ZIP format has historically supported only the original IBM PC character 
encoding set, commonly referred to as IBM Code Page 437.  This limits storing 
file name characters to only those within the original MS-DOS range of values 
and does not properly support file names in other character encodings, or 
languages. To address this limitation, this specification will support the 
following change. 

D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform 
to the original ZIP character encoding.  If general purpose bit 11 is set, the 
filename and comment MUST support The Unicode Standard, Version 4.1.0 or 
greater using the character encoding form defined by the UTF-8 storage 
specification.  The Unicode Standard is published by the The Unicode
Consortium (www.unicode.org).  UTF-8 encoded data stored within ZIP files 
is expected to not include a byte order mark (BOM).
{quote}

bit 11 is not set, so we should assume CP437 here. The file is correct and not defect for me. BTW, there is NO ANSI encoding. That is an American institute. Please be precise.

Now the fauly code [here|https://commons.apache.org/proper/commons-compress/xref/org/apache/commons/compress/archivers/zip/ZipArchiveInputStream.html#L306]:
{code:java}
306         final GeneralPurposeBit gpFlag = GeneralPurposeBit.parse(lfhBuf, off);
307         final boolean hasUTF8Flag = gpFlag.usesUTF8ForNames();
308         final ZipEncoding entryEncoding = hasUTF8Flag ? ZipEncodingHelper.UTF8_ZIP_ENCODING : zipEncoding;
309         current.hasDataDescriptor = gpFlag.usesDataDescriptor();
310         current.entry.setGeneralPurposeBit(gpFlag);
{code}

Unless you specifiy {{zipEncoding}} it is [here|https://commons.apache.org/proper/commons-compress/xref/org/apache/commons/compress/archivers/zip/ZipArchiveInputStream.html#L187]:
{code:java}
187     public ZipArchiveInputStream(final InputStream inputStream) {
188         this(inputStream, ZipEncodingHelper.UTF8);
189     }
{code}

Although the note says SHOULD, I still would expect CP437 here, for UTF-8 there is bit 11. Anything else is non-sense.

This deviation is not documented which is just bad.


was (Author: michael-o):
Though, I am not a Commons Compress developer, it is a bug in Commons Compress for me. Let's analyze:

The offending entry:
{noformat}
10E92 00004 50 4B 01 02 CENTRAL HEADER #10    02014B50
10E96 00001 0B          Created Zip Spec      0B '1.1'
10E97 00001 00          Created OS            00 'MS-DOS'
10E98 00001 0A          Extract Zip Spec      0A '1.0'
10E99 00001 00          Extract OS            00 'MS-DOS'
10E9A 00002 00 00       General Purpose Flag  0000
                        [Bit 1]               0 '4k Sliding Dictionary'
                        [Bit 2]               0 '2 Shannon-Fano         Trees'
10E9C 00002 06 00       Compression Method    0006 'Imploded'
10E9E 00004 EE 40 79 19 Last Mod Time         197940EE 'Wed Nov 25 08:07:28 1992'
10EA2 00004 47 B9 D7 53 CRC                   53D7B947
10EA6 00004 BE 08 00 00 Compressed Length     000008BE
10EAA 00004 4F 5E 00 00 Uncompressed Length   00005E4F
10EAE 00002 09 00       Filename Length       0009
10EB0 00002 00 00       Extra Length          0000
10EB2 00002 00 00       Comment Length        0000
10EB4 00002 00 00       Disk Start            0000
10EB6 00002 00 00       Int File Attributes   0000
                        [Bit 0]               0 'Binary Data'
10EB8 00004 20 00 00 00 Ext File Attributes   00000020
                        [Bit 5]               Archive
10EBC 00004 16 C6 00 00 Local Header Offset   0000C616
10EC0 00009 41 F9 43 F9 Filename              'A▒C▒E.ANS'
            45 2E 41 4E
            53
{noformat}

From the ZIP note:
{quote}
APPENDIX D - Language Encoding (EFS)
------------------------------------

D.1 The ZIP format has historically supported only the original IBM PC character 
encoding set, commonly referred to as IBM Code Page 437.  This limits storing 
file name characters to only those within the original MS-DOS range of values 
and does not properly support file names in other character encodings, or 
languages. To address this limitation, this specification will support the 
following change. 

D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform 
to the original ZIP character encoding.  If general purpose bit 11 is set, the 
filename and comment MUST support The Unicode Standard, Version 4.1.0 or 
greater using the character encoding form defined by the UTF-8 storage 
specification.  The Unicode Standard is published by the The Unicode
Consortium (www.unicode.org).  UTF-8 encoded data stored within ZIP files 
is expected to not include a byte order mark (BOM).
{quote}

bit 11 is not set, so we should assume CP437 here. The file is correct and not defect for me. BTW, there is NO ANSI encoding. That is an American institute. Please be precise.

Now the fauly code [here|https://commons.apache.org/proper/commons-compress/xref/org/apache/commons/compress/archivers/zip/ZipArchiveInputStream.html#L306]:
{code:java}
306         final GeneralPurposeBit gpFlag = GeneralPurposeBit.parse(lfhBuf, off);
307         final boolean hasUTF8Flag = gpFlag.usesUTF8ForNames();
308         final ZipEncoding entryEncoding = hasUTF8Flag ? ZipEncodingHelper.UTF8_ZIP_ENCODING : zipEncoding;
309         current.hasDataDescriptor = gpFlag.usesDataDescriptor();
310         current.entry.setGeneralPurposeBit(gpFlag);
{code}

Unless you specifiy {{zipEncoding}} it is [here|https://commons.apache.org/proper/commons-compress/xref/org/apache/commons/compress/archivers/zip/ZipArchiveInputStream.html#L187]:
{code;java}
187     public ZipArchiveInputStream(final InputStream inputStream) {
188         this(inputStream, ZipEncodingHelper.UTF8);
189     }
{code}

Although the note says SHOULD, I still would expect CP437 here, for UTF-8 there is bit 11. Anything else is non-sense.

This deviation is not documented which is just bad.

> ArchiveInputStream fails reading filenames with ANSI characters
> ---------------------------------------------------------------
>
>                 Key: COMPRESS-620
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-620
>             Project: Commons Compress
>          Issue Type: Bug
>          Components: Archivers
>    Affects Versions: 1.21
>            Reporter: Avi
>            Priority: Major
>
> I attempted to extract ANSI art packs from [SixteenColors ANSI archive|https://github.com/sixteencolors/sixteencolors-archive] but many of them fail.
>  
> Upon some debugging it appears that as many of the file names contain ANSI characters which are parsed by the ArchiveInputStream as question marks, the file fails to be saved to disk as question mark is a bad character to be had in a filename.
> Specific code:
> ArchiveInputStream archiveInputStream = archiveStreamFactory.createArchiveInputStream(ArchiveStreamFactory.ZIP, inputStream);
> ArchiveEntry archiveEntry = null;
> while((archiveEntry = archiveInputStream.getNextEntry()) != null) {
> Path path = Paths.get(extractDirectory, archiveEntry.getName());
> example of a non parseable filename in an archive:
> https://github.com/sixteencolors/sixteencolors-archive/blob/master/1992/ace-r%232.zip
> A∙C∙E.ANS
> Bad ZIP file example:



--
This message was sent by Atlassian Jira
(v8.20.7#820007)