You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "P. Hill" <pa...@gmail.com> on 2011/12/07 02:42:34 UTC
Tika 1.0 Exception
Folks,
I was trying to upgrade to Tika 1.0 and found I could break tiak-app
with some MSG files :-(
I have a Windows (Outlook) .msg file with an attached PDF which parses
in Tika-app 0.7, 0.9, 0.10
but in Tika-app 1.0 I get a stack trace.
<error>
Apache Tika was unable to parse the document
at \\....XYZ.msg
The full exception stack trace is included below:
org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.microsoft.OfficeParser@57284c88
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
at
org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
at
org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
[rest of stack trace removed]
</error>
I note that it says ...microsoft.OfficeParser, so I'm guessing it is in
the message where it is falling over.
Is there anything I could do to configure the app?
Every version of the tika-app is started with the trivial command
similar to C:\dev\tools\Tika\1.0\tika-app-1.0.jar -g
and I drag and drop onto it.
Interestingly enough running it from the command line, results in what
looks like good output for all possible switches -m, -t, -x, -h
-Paul
Re: Tika 1.0 Exception
Posted by "P. Hill" <pa...@gmail.com>.
Mike,
See the issue [TIKA-801] which you referenced below. It was easy to
reproduce. I have attached a MSG file to the issue which blows chunks
when you drop it onto Tika-app 1.0. The example is a one -line e-mail
forwarded to myself then saved as an MSG file outside of outlook. I
suspect that a simple e-mail with an attachment is complex (compound)
enough to cause the same problem and it is not related particularly to
compound Outlook e-mails inside e-mails, because I believe I saw it on a
flat e-mail with an attachment.
Let me know if I can be of further assistance.
-Paul
On 12/7/2011 11:15 AM, Michael McCandless wrote:
> This looks just like:
>
> https://issues.apache.org/jira/browse/TIKA-801
>
> Likely Tika's parser is (incorrectly) producing invalid XHTML tags for
> your document... when you open the Jira issue can you attach the
> problematic document? Thanks.
>
> Mike McCandless
>
Re: Tika 1.0 Exception
Posted by "P. Hill" <pa...@gmail.com>.
On 12/7/2011 11:15 AM, Michael McCandless wrote:
> This looks just like:
>
> https://issues.apache.org/jira/browse/TIKA-801
>
> Likely Tika's parser is (incorrectly) producing invalid XHTML tags for
> your document... when you open the Jira issue can you attach the
> problematic document? Thanks.
>
Maybe, because ironically enough the document is an actual e-mail
exchange between the CTO of our company and an alpha test customer about
a non-discloser agreement. :-(
If I could binary edit it to drop the names referenced and drop the
actual NDA attached document I might be able to generate an example that
fails. I will experiment with things like forwarding it without the
attachment and then hacking some bytes. If it still fails, I'll send it
your way.
-Paul
Re: Tika 1.0 Exception
Posted by Michael McCandless <lu...@mikemccandless.com>.
This looks just like:
https://issues.apache.org/jira/browse/TIKA-801
Likely Tika's parser is (incorrectly) producing invalid XHTML tags for
your document... when you open the Jira issue can you attach the
problematic document? Thanks.
Mike McCandless
http://blog.mikemccandless.com
On Wed, Dec 7, 2011 at 1:31 PM, P. Hill <pa...@gmail.com> wrote:
> On 12/6/2011 6:50 PM, Nick Burch wrote:
>>
>> On Tue, 6 Dec 2011, P. Hill wrote:
>>>
>>> at
>>> org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
>>> at
>>> org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
>>> [rest of stack trace removed]
>>
>>
>> You've alas snipped the interesting bit, which is what the parser broke on
>
>
> Further note on the type of message, it was a many-level nested reply chain
> generated by I believe Outlook for all coorespondants. The attached PDF
> itself parses in all versions of tika-app.
>
> Wow, really? You wanted to see the AWT call? Probably not, but here is the
> trace to swing followed by the cause.
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser@6337bb9c
>
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
> at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
> at
> org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
> at
> org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
> at javax.swing.TransferHandler.importData(Unknown Source)
>
> OOPS Sorry I didn't see the cause way down there: :-)
>
> Caused by: java.lang.NullPointerException
> at
> com.sun.org.apache.xml.internal.serializer.ToHTMLStream.endElement(Unknown
> Source)
> at
> com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endElement(Unknown
> Source)
> at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
> at org.apache.tika.gui.TikaGUI$2.endElement(TikaGUI.java:519)
> at
> org.apache.tika.sax.TeeContentHandler.endElement(TeeContentHandler.java:94)
> at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
> at
> org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
> at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
> at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
> at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
> at
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:273)
> at
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:213)
> at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:178)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>
>
>> Try with a recent svn nightly build, and see if that fixes it. If not,
>> please post a problem file and the full stacktrace to a new issue in JIRA
>
>
> I will try to find time to check into that.
> -Paul
>
Re: Tika 1.0 Exception
Posted by "P. Hill" <pa...@gmail.com>.
On 12/6/2011 6:50 PM, Nick Burch wrote:
> On Tue, 6 Dec 2011, P. Hill wrote:
>> at
>> org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
>> at
>> org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
>> [rest of stack trace removed]
>
> You've alas snipped the interesting bit, which is what the parser
> broke on
Further note on the type of message, it was a many-level nested reply
chain generated by I believe Outlook for all coorespondants. The
attached PDF itself parses in all versions of tika-app.
Wow, really? You wanted to see the AWT call? Probably not, but here is
the trace to swing followed by the cause.
org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.microsoft.OfficeParser@6337bb9c
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
at
org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
at
org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
at javax.swing.TransferHandler.importData(Unknown Source)
OOPS Sorry I didn't see the cause way down there: :-)
Caused by: java.lang.NullPointerException
at
com.sun.org.apache.xml.internal.serializer.ToHTMLStream.endElement(Unknown
Source)
at
com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endElement(Unknown
Source)
at
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
at org.apache.tika.gui.TikaGUI$2.endElement(TikaGUI.java:519)
at
org.apache.tika.sax.TeeContentHandler.endElement(TeeContentHandler.java:94)
at
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
at
org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
at
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
at
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
at
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
at
org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:273)
at
org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:213)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:178)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> Try with a recent svn nightly build, and see if that fixes it. If not,
> please post a problem file and the full stacktrace to a new issue in JIRA
I will try to find time to check into that.
-Paul
Re: Tika 1.0 Exception
Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 6 Dec 2011, P. Hill wrote:
> at
> org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
> at
> org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
> [rest of stack trace removed]
You've alas snipped the interesting bit, which is what the parser broke on
> Is there anything I could do to configure the app?
Try with a recent svn nightly build, and see if that fixes it. If not,
please post a problem file and the full stacktrace to a new issue in JIRA
Nick