You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2011/09/05 16:06:09 UTC

[jira] [Commented] (TIKA-698) "Invalid UTF-16 surrogate detected:" parsing PowerPoint 97-2003

    [ https://issues.apache.org/jira/browse/TIKA-698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097150#comment-13097150 ] 

Michael McCandless commented on TIKA-698:
-----------------------------------------

Hmm, I think we should replace invalid chars with the standard
unicode replacement character (U+FFFD), not space?  This way an app
can know something went wrong, ie replacing with space is trappy
because at first glance things seem OK?  It masks that there was a
problem.


> "Invalid UTF-16 surrogate detected:" parsing PowerPoint 97-2003
> ---------------------------------------------------------------
>
>                 Key: TIKA-698
>                 URL: https://issues.apache.org/jira/browse/TIKA-698
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.1-incubating, 0.9
>            Reporter: Pablo Queixalos
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>         Attachments: MS8.ppt
>
>
> Exception when parsing this MS PowerPoint file :  http://jeanferrette.free.fr/MS8.ppt
> java.io.IOException: Substitut UTF-16 non valide détecté : db00 bfff ?
>                 at com.sun.org.apache.xml.internal.serializer.ToStream.endElement(ToStream.java:2060)
>                 at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endElement(TransformerHandlerImpl.java:273)
>                 at org.apache.tika.sax.TeeContentHandler.endElement(TeeContentHandler.java:94)
>                 at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>                 at org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:215)
>                 at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>                 at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>                 at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>                 at org.apache.tika.sax.XHTMLContentHandler.lazyEndHead(XHTMLContentHandler.java:169)
>                 at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:234)
>                 at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:271)
>                 at org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:308)
>                 at org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:41)
>                 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
>                 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>                 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>                 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>                 [...]
> Parsing this file works fine with tika 0.4.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira