You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2010/11/02 14:12:25 UTC

[jira] Closed: (TIKA-517) java.io.UnsupportedEncodingException with Russian, Chinese, ... document

     [ https://issues.apache.org/jira/browse/TIKA-517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler closed TIKA-517.
----------------------------

    Resolution: Cannot Reproduce

> java.io.UnsupportedEncodingException with Russian, Chinese, ... document
> ------------------------------------------------------------------------
>
>                 Key: TIKA-517
>                 URL: https://issues.apache.org/jira/browse/TIKA-517
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Macosx, Java 6, Eclipse
>            Reporter: Dominique Béjean
>            Assignee: Ken Krugler
>
> When I try to extract text from PDF or DOC document in Russian, Chinese, Korean, Serbian, ..., I have an error concerning unsuported encoding.
> org.xml.sax.SAXException: java.io.UnsupportedEncodingException: 
> 	at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown Source)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
> 	...
> It works fin with English or iso-8859-1 languages.
> PDFBox extract correctly the text, so, I assume the problem is not in libraries used for various format text extraction, but after.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.