You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ryan Liu (Jira)" <ji...@apache.org> on 2021/04/28 02:23:00 UTC

[jira] [Created] (TIKA-3374) Non-Unicode archive entry name is garbled

Ryan Liu created TIKA-3374:
------------------------------

             Summary: Non-Unicode archive entry name is garbled
                 Key: TIKA-3374
                 URL: https://issues.apache.org/jira/browse/TIKA-3374
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.26
         Environment: The attachment is an example of a Non-Unicode archive entry name been used in a zip file.

The filename in the zip file should be {color:#172b4d}*集团邮件审计系统2021年自动巡检需求文档_V4.0.doc*{color}

but is gabled in TIKA 1.26 since the PackageParser treat it as unicode.
            Reporter: Ryan Liu
         Attachments: gbk.zip

PackageParser retrieves archive entry name through commons-compress archiver's ArchiveEntry#getName function and does not have automatic charset detection for entry names.
 Although one could set encoding by passing ArchiveStreamFactory(charset) into parser context,
 It is not practical since all kinds of charset could be used in an archive file.

Instead of directly calling entry.getName() in the PackageParser#parseEntry() function,

use entry.getRawName() and apply charset detection to reduce the possibility of getting garbled string is recommended.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)