You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Dominique Béjean (JIRA)" <ji...@apache.org> on 2010/09/18 17:39:33 UTC

[jira] Created: (TIKA-517) java.io.UnsupportedEncodingException with Russian, Chinese, ... document

java.io.UnsupportedEncodingException with Russian, Chinese, ... document
------------------------------------------------------------------------

                 Key: TIKA-517
                 URL: https://issues.apache.org/jira/browse/TIKA-517
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.7
         Environment: Macosx, Java 6, Eclipse
            Reporter: Dominique Béjean


When I try to extract text from PDF or DOC document in Russian, Chinese, Korean, Serbian, ..., I have an error concerning unsuported encoding.

org.xml.sax.SAXException: java.io.UnsupportedEncodingException: 
	at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown Source)
	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
	at org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
	...

It works fin with English or iso-8859-1 languages.

PDFBox extract correctly the text, so, I assume the problem is not in libraries used for various format text extraction, but after.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (TIKA-517) java.io.UnsupportedEncodingException with Russian, Chinese, ... document

Posted by "Dominique Béjean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927367#action_12927367 ] 

Dominique Béjean edited comment on TIKA-517 at 11/2/10 8:01 AM:
----------------------------------------------------------------

Hi,

Thank you for these replies.

In order to provide the a sample of my code, I made some tests and I can't reproduce the issue anymore.

My code looks like :

input = new FileInputStream("russian.pdf");
contentType="application/pdf";
outputEncoding="UTF-8";

ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
context.set(Parser.class, parser);

Metadata metadata = new Metadata();
metadata.add("stream_content_type", contentType);
StringWriter writer = new StringWriter();
BaseMarkupSerializer serializer = null;
serializer = new TextSerializer();
serializer.setOutputCharStream(writer);
serializer.setOutputFormat(new OutputFormat("text", outputEncoding, true));
parser.parse(input, serializer, metadata, context);
writer.close();

content = writer.toString();


If I reproduce the problem later, I will provide details.

Dominique

      was (Author: dbejean):
    Hi,

Thank you for these replies.

In order to provide the a sample of my code, I made some tests and I can't reproduce the issue anymore.

My code looks like :

                                input = new FileInputStream("russian.pdf");
                                contentType="application/pdf";
                                outputEncoding="UTF-8";

				ParseContext context = new ParseContext();
				Parser parser = new AutoDetectParser();
				context.set(Parser.class, parser);

				Metadata metadata = new Metadata();
				metadata.add("stream_content_type", contentType);
				StringWriter writer = new StringWriter();
				BaseMarkupSerializer serializer = null;
				serializer = new TextSerializer();
				serializer.setOutputCharStream(writer);
				serializer.setOutputFormat(new OutputFormat("text", outputEncoding, true));
				parser.parse(input, serializer, metadata, context);
				writer.close();

				content = writer.toString();

If I reproduce the problem later, I will provide details.

Dominique
  
> java.io.UnsupportedEncodingException with Russian, Chinese, ... document
> ------------------------------------------------------------------------
>
>                 Key: TIKA-517
>                 URL: https://issues.apache.org/jira/browse/TIKA-517
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Macosx, Java 6, Eclipse
>            Reporter: Dominique Béjean
>            Assignee: Ken Krugler
>
> When I try to extract text from PDF or DOC document in Russian, Chinese, Korean, Serbian, ..., I have an error concerning unsuported encoding.
> org.xml.sax.SAXException: java.io.UnsupportedEncodingException: 
> 	at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown Source)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
> 	...
> It works fin with English or iso-8859-1 languages.
> PDFBox extract correctly the text, so, I assume the problem is not in libraries used for various format text extraction, but after.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (TIKA-517) java.io.UnsupportedEncodingException with Russian, Chinese, ... document

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler reassigned TIKA-517:
--------------------------------

    Assignee: Ken Krugler

> java.io.UnsupportedEncodingException with Russian, Chinese, ... document
> ------------------------------------------------------------------------
>
>                 Key: TIKA-517
>                 URL: https://issues.apache.org/jira/browse/TIKA-517
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Macosx, Java 6, Eclipse
>            Reporter: Dominique Béjean
>            Assignee: Ken Krugler
>
> When I try to extract text from PDF or DOC document in Russian, Chinese, Korean, Serbian, ..., I have an error concerning unsuported encoding.
> org.xml.sax.SAXException: java.io.UnsupportedEncodingException: 
> 	at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown Source)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
> 	...
> It works fin with English or iso-8859-1 languages.
> PDFBox extract correctly the text, so, I assume the problem is not in libraries used for various format text extraction, but after.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-517) java.io.UnsupportedEncodingException with Russian, Chinese, ... document

Posted by "Dominique Béjean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927367#action_12927367 ] 

Dominique Béjean commented on TIKA-517:
---------------------------------------

Hi,

Thank you for these replies.

In order to provide the a sample of my code, I made some tests and I can't reproduce the issue anymore.

My code looks like :

                                input = new FileInputStream("russian.pdf");
                                contentType="application/pdf";
                                outputEncoding="UTF-8";

				ParseContext context = new ParseContext();
				Parser parser = new AutoDetectParser();
				context.set(Parser.class, parser);

				Metadata metadata = new Metadata();
				metadata.add("stream_content_type", contentType);
				StringWriter writer = new StringWriter();
				BaseMarkupSerializer serializer = null;
				serializer = new TextSerializer();
				serializer.setOutputCharStream(writer);
				serializer.setOutputFormat(new OutputFormat("text", outputEncoding, true));
				parser.parse(input, serializer, metadata, context);
				writer.close();

				content = writer.toString();

If I reproduce the problem later, I will provide details.

Dominique

> java.io.UnsupportedEncodingException with Russian, Chinese, ... document
> ------------------------------------------------------------------------
>
>                 Key: TIKA-517
>                 URL: https://issues.apache.org/jira/browse/TIKA-517
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Macosx, Java 6, Eclipse
>            Reporter: Dominique Béjean
>            Assignee: Ken Krugler
>
> When I try to extract text from PDF or DOC document in Russian, Chinese, Korean, Serbian, ..., I have an error concerning unsuported encoding.
> org.xml.sax.SAXException: java.io.UnsupportedEncodingException: 
> 	at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown Source)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
> 	...
> It works fin with English or iso-8859-1 languages.
> PDFBox extract correctly the text, so, I assume the problem is not in libraries used for various format text extraction, but after.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (TIKA-517) java.io.UnsupportedEncodingException with Russian, Chinese, ... document

Posted by "Dominique Béjean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927367#action_12927367 ] 

Dominique Béjean edited comment on TIKA-517 at 11/2/10 8:02 AM:
----------------------------------------------------------------

Hi,

Thank you for these replies.

In order to provide a sample of my code, I made some tests and I can't reproduce the issue anymore.

My code looks like :

input = new FileInputStream("russian.pdf");
contentType="application/pdf";
outputEncoding="UTF-8";

ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
context.set(Parser.class, parser);

Metadata metadata = new Metadata();
metadata.add("stream_content_type", contentType);
StringWriter writer = new StringWriter();
BaseMarkupSerializer serializer = null;
serializer = new TextSerializer();
serializer.setOutputCharStream(writer);
serializer.setOutputFormat(new OutputFormat("text", outputEncoding, true));
parser.parse(input, serializer, metadata, context);
writer.close();

content = writer.toString();


If I reproduce the problem later, I will provide details.

Dominique

      was (Author: dbejean):
    Hi,

Thank you for these replies.

In order to provide the a sample of my code, I made some tests and I can't reproduce the issue anymore.

My code looks like :

input = new FileInputStream("russian.pdf");
contentType="application/pdf";
outputEncoding="UTF-8";

ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
context.set(Parser.class, parser);

Metadata metadata = new Metadata();
metadata.add("stream_content_type", contentType);
StringWriter writer = new StringWriter();
BaseMarkupSerializer serializer = null;
serializer = new TextSerializer();
serializer.setOutputCharStream(writer);
serializer.setOutputFormat(new OutputFormat("text", outputEncoding, true));
parser.parse(input, serializer, metadata, context);
writer.close();

content = writer.toString();


If I reproduce the problem later, I will provide details.

Dominique
  
> java.io.UnsupportedEncodingException with Russian, Chinese, ... document
> ------------------------------------------------------------------------
>
>                 Key: TIKA-517
>                 URL: https://issues.apache.org/jira/browse/TIKA-517
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Macosx, Java 6, Eclipse
>            Reporter: Dominique Béjean
>            Assignee: Ken Krugler
>
> When I try to extract text from PDF or DOC document in Russian, Chinese, Korean, Serbian, ..., I have an error concerning unsuported encoding.
> org.xml.sax.SAXException: java.io.UnsupportedEncodingException: 
> 	at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown Source)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
> 	...
> It works fin with English or iso-8859-1 languages.
> PDFBox extract correctly the text, so, I assume the problem is not in libraries used for various format text extraction, but after.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (TIKA-517) java.io.UnsupportedEncodingException with Russian, Chinese, ... document

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler closed TIKA-517.
----------------------------

    Resolution: Cannot Reproduce

> java.io.UnsupportedEncodingException with Russian, Chinese, ... document
> ------------------------------------------------------------------------
>
>                 Key: TIKA-517
>                 URL: https://issues.apache.org/jira/browse/TIKA-517
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Macosx, Java 6, Eclipse
>            Reporter: Dominique Béjean
>            Assignee: Ken Krugler
>
> When I try to extract text from PDF or DOC document in Russian, Chinese, Korean, Serbian, ..., I have an error concerning unsuported encoding.
> org.xml.sax.SAXException: java.io.UnsupportedEncodingException: 
> 	at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown Source)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
> 	...
> It works fin with English or iso-8859-1 languages.
> PDFBox extract correctly the text, so, I assume the problem is not in libraries used for various format text extraction, but after.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-517) java.io.UnsupportedEncodingException with Russian, Chinese, ... document

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912617#action_12912617 ] 

Ken Krugler commented on TIKA-517:
----------------------------------

Hi Dominique,

I'm not sure there's anything Tika can do here. The issue is in the Xerces BaseMarkupSerializer.startDocument() method, where it appears to be making a call to Java's Charset class (either directly, or indirectly) and the provided charset name isn't supported.

This can happen with the platform doesn't have the support, or you've got an invalid charset name from somewhere.

We'd actually coded up our own "safeCharset" method in Tika, that's used when processing HTML documents.

Is there any way you can extract the actual charset name that's triggering this exception?

Thanks,

-- Ken

> java.io.UnsupportedEncodingException with Russian, Chinese, ... document
> ------------------------------------------------------------------------
>
>                 Key: TIKA-517
>                 URL: https://issues.apache.org/jira/browse/TIKA-517
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Macosx, Java 6, Eclipse
>            Reporter: Dominique Béjean
>            Assignee: Ken Krugler
>
> When I try to extract text from PDF or DOC document in Russian, Chinese, Korean, Serbian, ..., I have an error concerning unsuported encoding.
> org.xml.sax.SAXException: java.io.UnsupportedEncodingException: 
> 	at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown Source)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
> 	...
> It works fin with English or iso-8859-1 languages.
> PDFBox extract correctly the text, so, I assume the problem is not in libraries used for various format text extraction, but after.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-517) java.io.UnsupportedEncodingException with Russian, Chinese, ... document

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926793#action_12926793 ] 

Jukka Zitting commented on TIKA-517:
------------------------------------

The stack trace suggests that this exception is coming from when you're serializing the output from Tika, so as Ken said this doesn't seem to be a Tika issue. How do you specify the output encoding?

> java.io.UnsupportedEncodingException with Russian, Chinese, ... document
> ------------------------------------------------------------------------
>
>                 Key: TIKA-517
>                 URL: https://issues.apache.org/jira/browse/TIKA-517
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>         Environment: Macosx, Java 6, Eclipse
>            Reporter: Dominique Béjean
>            Assignee: Ken Krugler
>
> When I try to extract text from PDF or DOC document in Russian, Chinese, Korean, Serbian, ..., I have an error concerning unsuported encoding.
> org.xml.sax.SAXException: java.io.UnsupportedEncodingException: 
> 	at org.apache.xml.serialize.BaseMarkupSerializer.startDocument(Unknown Source)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startDocument(ContentHandlerDecorator.java:84)
> 	at org.apache.tika.sax.XHTMLContentHandler.startDocument(XHTMLContentHandler.java:93)
> 	...
> It works fin with English or iso-8859-1 languages.
> PDFBox extract correctly the text, so, I assume the problem is not in libraries used for various format text extraction, but after.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.