You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/04/28 03:08:00 UTC

[jira] [Commented] (TIKA-3374) Non-Unicode archive entry name is garbled

    [ https://issues.apache.org/jira/browse/TIKA-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17334428#comment-17334428 ] 

ASF GitHub Bot commented on TIKA-3374:
--------------------------------------

Ryan421 opened a new pull request #433:
URL: https://github.com/apache/tika/pull/433


   Fixes #TIKA-3374


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Non-Unicode archive entry name is garbled
> -----------------------------------------
>
>                 Key: TIKA-3374
>                 URL: https://issues.apache.org/jira/browse/TIKA-3374
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.26
>            Reporter: Ryan Liu
>            Priority: Major
>         Attachments: gbk.zip
>
>
> PackageParser retrieves archive entry name through commons-compress archiver's ArchiveEntry#getName function and does not have automatic charset detection for entry names.
>  Although one could set encoding by passing ArchiveStreamFactory(charset) into parser context,
>  It is not practical since all kinds of charset could be used in an archive file.
> Instead of directly calling entry.getName() in the PackageParser#parseEntry() function,
> use entry.getRawName() and apply charset detection to reduce the possibility of getting garbled string is recommended.
>  
> The attachment is an example of a Non-Unicode archive entry name been used in a zip file.
> The filename in the zip file should be *集团邮件审计系统2021年自动巡检需求文档_V4.0.doc*
> but is gabled in TIKA 1.26 since the PackageParser treats it as Unicode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)