You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Michael Osipov (Jira)" <ji...@apache.org> on 2022/05/13 13:01:00 UTC
[jira] [Commented] (COMPRESS-620) ArchiveInputStream fails reading filenames with ANSI characters
[ https://issues.apache.org/jira/browse/COMPRESS-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536631#comment-17536631 ]
Michael Osipov commented on COMPRESS-620:
-----------------------------------------
Though, I am not a Commons Compress developer, it is a bug in Commons Compress for me. Let's analyze:
The offending entry:
{noformat}
10E92 00004 50 4B 01 02 CENTRAL HEADER #10 02014B50
10E96 00001 0B Created Zip Spec 0B '1.1'
10E97 00001 00 Created OS 00 'MS-DOS'
10E98 00001 0A Extract Zip Spec 0A '1.0'
10E99 00001 00 Extract OS 00 'MS-DOS'
10E9A 00002 00 00 General Purpose Flag 0000
[Bit 1] 0 '4k Sliding Dictionary'
[Bit 2] 0 '2 Shannon-Fano Trees'
10E9C 00002 06 00 Compression Method 0006 'Imploded'
10E9E 00004 EE 40 79 19 Last Mod Time 197940EE 'Wed Nov 25 08:07:28 1992'
10EA2 00004 47 B9 D7 53 CRC 53D7B947
10EA6 00004 BE 08 00 00 Compressed Length 000008BE
10EAA 00004 4F 5E 00 00 Uncompressed Length 00005E4F
10EAE 00002 09 00 Filename Length 0009
10EB0 00002 00 00 Extra Length 0000
10EB2 00002 00 00 Comment Length 0000
10EB4 00002 00 00 Disk Start 0000
10EB6 00002 00 00 Int File Attributes 0000
[Bit 0] 0 'Binary Data'
10EB8 00004 20 00 00 00 Ext File Attributes 00000020
[Bit 5] Archive
10EBC 00004 16 C6 00 00 Local Header Offset 0000C616
10EC0 00009 41 F9 43 F9 Filename 'A▒C▒E.ANS'
45 2E 41 4E
53
{noformat}
From the ZIP note:
{quote}
APPENDIX D - Language Encoding (EFS)
------------------------------------
D.1 The ZIP format has historically supported only the original IBM PC character
encoding set, commonly referred to as IBM Code Page 437. This limits storing
file name characters to only those within the original MS-DOS range of values
and does not properly support file names in other character encodings, or
languages. To address this limitation, this specification will support the
following change.
D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform
to the original ZIP character encoding. If general purpose bit 11 is set, the
filename and comment MUST support The Unicode Standard, Version 4.1.0 or
greater using the character encoding form defined by the UTF-8 storage
specification. The Unicode Standard is published by the The Unicode
Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files
is expected to not include a byte order mark (BOM).
{quote}
bit 11 is not set, so we must assume CP437 here. The file is correct and not defect for me. BTW, there is NO ANSI encoding. That is an American institute. Please be precise.
Now the fauly code [here|https://commons.apache.org/proper/commons-compress/xref/org/apache/commons/compress/archivers/zip/ZipArchiveInputStream.html#L306]:
{code:java}
306 final GeneralPurposeBit gpFlag = GeneralPurposeBit.parse(lfhBuf, off);
307 final boolean hasUTF8Flag = gpFlag.usesUTF8ForNames();
308 final ZipEncoding entryEncoding = hasUTF8Flag ? ZipEncodingHelper.UTF8_ZIP_ENCODING : zipEncoding;
309 current.hasDataDescriptor = gpFlag.usesDataDescriptor();
310 current.entry.setGeneralPurposeBit(gpFlag);
{code}
Unless you specifiy {{zipEncoding}} it is [here|https://commons.apache.org/proper/commons-compress/xref/org/apache/commons/compress/archivers/zip/ZipArchiveInputStream.html#L187]:
{code;java}
187 public ZipArchiveInputStream(final InputStream inputStream) {
188 this(inputStream, ZipEncodingHelper.UTF8);
189 }
{code}
Although the note says SHOULD, I still would expect CP437 here, for UTF-8 there is bit 11. Anything else is non-sense.
This deviation is not documented which is just bad.
> ArchiveInputStream fails reading filenames with ANSI characters
> ---------------------------------------------------------------
>
> Key: COMPRESS-620
> URL: https://issues.apache.org/jira/browse/COMPRESS-620
> Project: Commons Compress
> Issue Type: Bug
> Components: Archivers
> Affects Versions: 1.21
> Reporter: Avi
> Priority: Major
>
> I attempted to extract ANSI art packs from [SixteenColors ANSI archive|https://github.com/sixteencolors/sixteencolors-archive] but many of them fail.
>
> Upon some debugging it appears that as many of the file names contain ANSI characters which are parsed by the ArchiveInputStream as question marks, the file fails to be saved to disk as question mark is a bad character to be had in a filename.
> Specific code:
> ArchiveInputStream archiveInputStream = archiveStreamFactory.createArchiveInputStream(ArchiveStreamFactory.ZIP, inputStream);
> ArchiveEntry archiveEntry = null;
> while((archiveEntry = archiveInputStream.getNextEntry()) != null) {
> Path path = Paths.get(extractDirectory, archiveEntry.getName());
> example of a non parseable filename in an archive:
> https://github.com/sixteencolors/sixteencolors-archive/blob/master/1992/ace-r%232.zip
> A∙C∙E.ANS
> Bad ZIP file example:
--
This message was sent by Atlassian Jira
(v8.20.7#820007)