You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2010/09/20 19:54:33 UTC

[jira] Commented: (TIKA-517) java.io.UnsupportedEncodingException with Russian, Chinese, ... document

    [ https://issues.apache.org/jira/browse/TIKA-517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912617#action_12912617 ] 

Ken Krugler commented on TIKA-517:
----------------------------------

Hi Dominique,

I'm not sure there's anything Tika can do here. The issue is in the Xerces BaseMarkupSerializer.startDocument() method, where it appears to be making a call to Java's Charset class (either directly, or indirectly) and the provided charset name isn't supported.

This can happen with the platform doesn't have the support, or you've got an invalid charset name from somewhere.

We'd actually coded up our own "safeCharset" method in Tika, that's used when processing HTML documents.

Is there any way you can extract the actual charset name that's triggering this exception?

Thanks,

-- Ken

> java.io.UnsupportedEncodingException with Russian, Chinese, ... document
> ------------------------------------------------------------------------
>
>                 Key: TIKA-517
>                 URL: https://issues.apache.org/jira/browse/TIKA-517
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Macosx, Java 6, Eclipse
>            Reporter: Dominique BĂ©jean
>            Assignee: Ken Krugler
>
> When I try to extract text from PDF or DOC document in Russian, Chinese, Korean, Serbian, ..., I have an error concerning unsuported encoding.
> org.xml.sax.SAXException: java.io.UnsupportedEncodingException: 
> 	at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown Source)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
> 	...
> It works fin with English or iso-8859-1 languages.
> PDFBox extract correctly the text, so, I assume the problem is not in libraries used for various format text extraction, but after.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.