You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2014/04/18 16:22:15 UTC

[jira] [Updated] (TIKA-936) encoding of ZipArchiveInputStream

     [ https://issues.apache.org/jira/browse/TIKA-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-936:
-------------------------------

    Description: 
When extracting from the zip files which are zipped at Windows OS(Japanese), the file name extracted from zip is garbled.

ZipArchiveInputStream has three constructors. Modifying like the below, the file name was not garbled. I specified the encoding - SJIS.

{code:title=PackageExtractor|borderStyle=solid}
public void parse(InputStream stream)
 :
 //unpack(new ZipArchiveInputStream(stream), xhtml);  
 unpack(new ZipArchiveInputStream(stream,"SJIS",true), xhtml); 
 :
{code}

In first constructor -the platform's default encoding- UTF-8 is used.  In my case the encoding of my computer is UTF-8, the encoding of zip file is SJIS, so the file name was garbled. We will get garbled file name if there is a difference of  encoding between -platform- this constructor and zip file.

I want Tika to parse zip by giving some kind of encoding parameter per file, Where should I give the encoding, somewhere in Metadata or ParseContext? Please support this. I am using Tika via Solr(SolrCell), so when posting zip file to Solr I want to add encoding parameter to the request.

  was:
When extracting from the zip files which are zipped at Windows OS(Japanese), 
the file name extracted from zip is garbled.

ZipArchiveInputStream has three constructors. 
Modifying like the below, the file name was not garbled.
I specified the encoding - SJIS.

{code:title=PackageExtractor|borderStyle=solid}
public void parse(InputStream stream)
 :
 //unpack(new ZipArchiveInputStream(stream), xhtml);  
 unpack(new ZipArchiveInputStream(stream,"SJIS",true), xhtml); 
 :
{code}

In first constructor -the platform's default encoding- UTF-8 is used. 
In my case the encoding of my computer is UTF-8, the encoding of zip file is SJIS,
so the file name was garbled.
We will get garbled file name if there is a difference of 
encoding between -platform- this constructor and zip file.

I want Tika to parse zip by giving some kind of encoding parameter per file,
Where should I give the encoding, somewhere in Metadata 
or ParseContext? Please support this.
I am using Tika via Solr(SolrCell), so when posting zip file to Solr
I want to add encoding parameter to the request.

       Assignee: Jukka Zitting
     Issue Type: Improvement  (was: Wish)

In revision 1588474 I made it possible to pass a customized {{ArchiveStreamFactory}} instance through the parse context. The required client code looks like this:

{code}
ArchiveStreamFactory factory = new ArchiveStreamFactory();
factory.setEntryEncoding("SJIS");
context.set(ArchiveStreamFactory.class, factory);
parser.parse(..., context);
{code}

See also the test case I added in {{ZipParserTest}}.

Note that this feature applies also to the other archive types supported by Commons Compress. Also, if the UTF-8 flag of a particular zip file is set, then the given encoding is ignored and UTF-8 is used to decode entry names within that zip file.

> encoding of ZipArchiveInputStream
> ---------------------------------
>
>                 Key: TIKA-936
>                 URL: https://issues.apache.org/jira/browse/TIKA-936
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.1
>            Reporter: Shinichiro Abe
>            Assignee: Jukka Zitting
>         Attachments: x-日本語メモ.zip
>
>
> When extracting from the zip files which are zipped at Windows OS(Japanese), the file name extracted from zip is garbled.
> ZipArchiveInputStream has three constructors. Modifying like the below, the file name was not garbled. I specified the encoding - SJIS.
> {code:title=PackageExtractor|borderStyle=solid}
> public void parse(InputStream stream)
>  :
>  //unpack(new ZipArchiveInputStream(stream), xhtml);  
>  unpack(new ZipArchiveInputStream(stream,"SJIS",true), xhtml); 
>  :
> {code}
> In first constructor -the platform's default encoding- UTF-8 is used.  In my case the encoding of my computer is UTF-8, the encoding of zip file is SJIS, so the file name was garbled. We will get garbled file name if there is a difference of  encoding between -platform- this constructor and zip file.
> I want Tika to parse zip by giving some kind of encoding parameter per file, Where should I give the encoding, somewhere in Metadata or ParseContext? Please support this. I am using Tika via Solr(SolrCell), so when posting zip file to Solr I want to add encoding parameter to the request.



--
This message was sent by Atlassian JIRA
(v6.2#6252)