You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Shinichiro Abe (JIRA)" <ji...@apache.org> on 2012/05/29 10:37:23 UTC
[jira] [Created] (TIKA-936) encoding of ZipArchiveInputStream
Shinichiro Abe created TIKA-936:
-----------------------------------
Summary: encoding of ZipArchiveInputStream
Key: TIKA-936
URL: https://issues.apache.org/jira/browse/TIKA-936
Project: Tika
Issue Type: Wish
Components: parser
Affects Versions: 1.1
Reporter: Shinichiro Abe
When extracting from the zip files which are zipped at Windows OS(Japanese),
the file name extracted from zip is garbled.
ZipArchiveInputStream has three constructors.
Modifying like the below, the file name was not garbled.
I specified the encoding - SJIS.
{code:title=PackageExtractor|borderStyle=solid}
public void parse(InputStream stream)
:
//unpack(new ZipArchiveInputStream(stream), xhtml);
unpack(new ZipArchiveInputStream(stream,"SJIS",true), xhtml);
:
{code}
In first constructor the platform's default encoding is used.
In my case the encoding of my computer is UTF-8, the encoding of zip file is SJIS,
so the file name was garbled.
We will get garbled file name if there is a difference of
encoding between platform and zip file.
I want Tika to parse zip by giving some kind of encoding parameter per file,
Where should I give the encoding, somewhere in Metadata
or ParseContext? Please support this.
I am using Tika via Solr(SolrCell), so when posting zip file to Solr
I want to add encoding parameter to the request.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-936) encoding of ZipArchiveInputStream
Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285537#comment-13285537 ]
Nick Burch commented on TIKA-936:
---------------------------------
I've tried with Unzip, and I get garbage too, so it looks like Windows isn't storing the encoding used anywhere useful in the file
I think the ParseContext would be the right place for it. We probably want to check to see if any other archive formats suffer from this issue too, and fix any of those at the same time
> encoding of ZipArchiveInputStream
> ---------------------------------
>
> Key: TIKA-936
> URL: https://issues.apache.org/jira/browse/TIKA-936
> Project: Tika
> Issue Type: Wish
> Components: parser
> Affects Versions: 1.1
> Reporter: Shinichiro Abe
> Attachments: x-日本語メモ.zip
>
>
> When extracting from the zip files which are zipped at Windows OS(Japanese),
> the file name extracted from zip is garbled.
> ZipArchiveInputStream has three constructors.
> Modifying like the below, the file name was not garbled.
> I specified the encoding - SJIS.
> {code:title=PackageExtractor|borderStyle=solid}
> public void parse(InputStream stream)
> :
> //unpack(new ZipArchiveInputStream(stream), xhtml);
> unpack(new ZipArchiveInputStream(stream,"SJIS",true), xhtml);
> :
> {code}
> In first constructor -the platform's default encoding- UTF-8 is used.
> In my case the encoding of my computer is UTF-8, the encoding of zip file is SJIS,
> so the file name was garbled.
> We will get garbled file name if there is a difference of
> encoding between -platform- this constructor and zip file.
> I want Tika to parse zip by giving some kind of encoding parameter per file,
> Where should I give the encoding, somewhere in Metadata
> or ParseContext? Please support this.
> I am using Tika via Solr(SolrCell), so when posting zip file to Solr
> I want to add encoding parameter to the request.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-936) encoding of ZipArchiveInputStream
Posted by "Shinichiro Abe (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shinichiro Abe updated TIKA-936:
--------------------------------
Attachment: x-日本語メモ.zip
Here is a zip file. The file name extracted from zip is garbled.
> encoding of ZipArchiveInputStream
> ---------------------------------
>
> Key: TIKA-936
> URL: https://issues.apache.org/jira/browse/TIKA-936
> Project: Tika
> Issue Type: Wish
> Components: parser
> Affects Versions: 1.1
> Reporter: Shinichiro Abe
> Attachments: x-日本語メモ.zip
>
>
> When extracting from the zip files which are zipped at Windows OS(Japanese),
> the file name extracted from zip is garbled.
> ZipArchiveInputStream has three constructors.
> Modifying like the below, the file name was not garbled.
> I specified the encoding - SJIS.
> {code:title=PackageExtractor|borderStyle=solid}
> public void parse(InputStream stream)
> :
> //unpack(new ZipArchiveInputStream(stream), xhtml);
> unpack(new ZipArchiveInputStream(stream,"SJIS",true), xhtml);
> :
> {code}
> In first constructor the platform's default encoding is used.
> In my case the encoding of my computer is UTF-8, the encoding of zip file is SJIS,
> so the file name was garbled.
> We will get garbled file name if there is a difference of
> encoding between platform and zip file.
> I want Tika to parse zip by giving some kind of encoding parameter per file,
> Where should I give the encoding, somewhere in Metadata
> or ParseContext? Please support this.
> I am using Tika via Solr(SolrCell), so when posting zip file to Solr
> I want to add encoding parameter to the request.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-936) encoding of ZipArchiveInputStream
Posted by "Shinichiro Abe (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shinichiro Abe updated TIKA-936:
--------------------------------
Description:
When extracting from the zip files which are zipped at Windows OS(Japanese),
the file name extracted from zip is garbled.
ZipArchiveInputStream has three constructors.
Modifying like the below, the file name was not garbled.
I specified the encoding - SJIS.
{code:title=PackageExtractor|borderStyle=solid}
public void parse(InputStream stream)
:
//unpack(new ZipArchiveInputStream(stream), xhtml);
unpack(new ZipArchiveInputStream(stream,"SJIS",true), xhtml);
:
{code}
In first constructor -the platform's default encoding- UTF-8 is used.
In my case the encoding of my computer is UTF-8, the encoding of zip file is SJIS,
so the file name was garbled.
We will get garbled file name if there is a difference of
encoding between -platform- this constructor and zip file.
I want Tika to parse zip by giving some kind of encoding parameter per file,
Where should I give the encoding, somewhere in Metadata
or ParseContext? Please support this.
I am using Tika via Solr(SolrCell), so when posting zip file to Solr
I want to add encoding parameter to the request.
was:
When extracting from the zip files which are zipped at Windows OS(Japanese),
the file name extracted from zip is garbled.
ZipArchiveInputStream has three constructors.
Modifying like the below, the file name was not garbled.
I specified the encoding - SJIS.
{code:title=PackageExtractor|borderStyle=solid}
public void parse(InputStream stream)
:
//unpack(new ZipArchiveInputStream(stream), xhtml);
unpack(new ZipArchiveInputStream(stream,"SJIS",true), xhtml);
:
{code}
In first constructor the platform's default encoding is used.
In my case the encoding of my computer is UTF-8, the encoding of zip file is SJIS,
so the file name was garbled.
We will get garbled file name if there is a difference of
encoding between platform and zip file.
I want Tika to parse zip by giving some kind of encoding parameter per file,
Where should I give the encoding, somewhere in Metadata
or ParseContext? Please support this.
I am using Tika via Solr(SolrCell), so when posting zip file to Solr
I want to add encoding parameter to the request.
> encoding of ZipArchiveInputStream
> ---------------------------------
>
> Key: TIKA-936
> URL: https://issues.apache.org/jira/browse/TIKA-936
> Project: Tika
> Issue Type: Wish
> Components: parser
> Affects Versions: 1.1
> Reporter: Shinichiro Abe
> Attachments: x-日本語メモ.zip
>
>
> When extracting from the zip files which are zipped at Windows OS(Japanese),
> the file name extracted from zip is garbled.
> ZipArchiveInputStream has three constructors.
> Modifying like the below, the file name was not garbled.
> I specified the encoding - SJIS.
> {code:title=PackageExtractor|borderStyle=solid}
> public void parse(InputStream stream)
> :
> //unpack(new ZipArchiveInputStream(stream), xhtml);
> unpack(new ZipArchiveInputStream(stream,"SJIS",true), xhtml);
> :
> {code}
> In first constructor -the platform's default encoding- UTF-8 is used.
> In my case the encoding of my computer is UTF-8, the encoding of zip file is SJIS,
> so the file name was garbled.
> We will get garbled file name if there is a difference of
> encoding between -platform- this constructor and zip file.
> I want Tika to parse zip by giving some kind of encoding parameter per file,
> Where should I give the encoding, somewhere in Metadata
> or ParseContext? Please support this.
> I am using Tika via Solr(SolrCell), so when posting zip file to Solr
> I want to add encoding parameter to the request.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira