You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "P. Hill" <pa...@gmail.com> on 2011/12/07 02:42:34 UTC

Tika 1.0 Exception

Folks,

I was trying to upgrade to Tika 1.0 and found I could break tiak-app 
with some MSG files :-(
I have a Windows (Outlook) .msg file with an attached PDF which parses 
in Tika-app 0.7, 0.9, 0.10
but in Tika-app 1.0 I get a stack trace.

<error>
Apache Tika was unable to parse the document
at \\....XYZ.msg
The full exception stack trace is included below:

org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.microsoft.OfficeParser@57284c88
     at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
     at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
     at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
     at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
     at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
     at 
org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
     at 
org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
[rest of stack trace removed]
</error>
I note that it says ...microsoft.OfficeParser, so I'm guessing it is in 
the message where it is falling over.
Is there anything I could do to configure the app?
Every version of the tika-app is started with the trivial command 
similar to C:\dev\tools\Tika\1.0\tika-app-1.0.jar -g
and I drag and drop onto it.
Interestingly enough running it from the command line, results in what 
looks like good output for all possible switches -m, -t, -x, -h

-Paul

Re: Tika 1.0 Exception

Posted by "P. Hill" <pa...@gmail.com>.
Mike,

See the issue [TIKA-801] which you referenced below.  It was easy to 
reproduce.  I have attached a MSG file to the issue which blows chunks 
when you drop it onto Tika-app 1.0.  The example is a one -line e-mail 
forwarded to myself then saved as an MSG file outside of outlook.   I 
suspect that a simple e-mail with an attachment is complex (compound) 
enough to cause the same problem and it is not related particularly to 
compound Outlook e-mails inside e-mails, because I believe I saw it on a 
flat e-mail with an attachment.

Let me know if I can be of further assistance.

-Paul

On 12/7/2011 11:15 AM, Michael McCandless wrote:
> This looks just like:
>
>      https://issues.apache.org/jira/browse/TIKA-801
>
> Likely Tika's parser is (incorrectly) producing invalid XHTML tags for
> your document... when you open the Jira issue can you attach the
> problematic document?  Thanks.
>
> Mike McCandless
>


Re: Tika 1.0 Exception

Posted by "P. Hill" <pa...@gmail.com>.
On 12/7/2011 11:15 AM, Michael McCandless wrote:
> This looks just like:
>
>      https://issues.apache.org/jira/browse/TIKA-801
>
> Likely Tika's parser is (incorrectly) producing invalid XHTML tags for
> your document... when you open the Jira issue can you attach the
> problematic document?  Thanks.
>

Maybe, because ironically enough the document is an actual e-mail 
exchange between the CTO of our company and an alpha test customer about 
a non-discloser agreement. :-(

If I could binary edit it to drop the names referenced and drop the 
actual NDA attached document I might be able to generate an example that 
fails.   I will experiment with things like forwarding it without the 
attachment and then hacking some bytes.  If it still fails, I'll send it 
your way.

-Paul

Re: Tika 1.0 Exception

Posted by Michael McCandless <lu...@mikemccandless.com>.
This looks just like:

    https://issues.apache.org/jira/browse/TIKA-801

Likely Tika's parser is (incorrectly) producing invalid XHTML tags for
your document... when you open the Jira issue can you attach the
problematic document?  Thanks.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Dec 7, 2011 at 1:31 PM, P. Hill <pa...@gmail.com> wrote:
> On 12/6/2011 6:50 PM, Nick Burch wrote:
>>
>> On Tue, 6 Dec 2011, P. Hill wrote:
>>>
>>>   at
>>> org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
>>>   at
>>> org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
>>> [rest of stack trace removed]
>>
>>
>> You've alas snipped the interesting bit, which is what the parser broke on
>
>
> Further note on the type of message, it was a many-level nested reply chain
> generated by I believe Outlook for all coorespondants.  The attached PDF
> itself parses in all versions of tika-app.
>
> Wow, really?  You wanted to see the AWT call? Probably not, but here is the
> trace to swing followed by the cause.
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser@6337bb9c
>
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>    at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>    at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
>    at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
>    at
> org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
>    at
> org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
>    at javax.swing.TransferHandler.importData(Unknown Source)
>
> OOPS Sorry I didn't see the cause way down there: :-)
>
> Caused by: java.lang.NullPointerException
>    at
> com.sun.org.apache.xml.internal.serializer.ToHTMLStream.endElement(Unknown
> Source)
>    at
> com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endElement(Unknown
> Source)
>    at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>    at org.apache.tika.gui.TikaGUI$2.endElement(TikaGUI.java:519)
>    at
> org.apache.tika.sax.TeeContentHandler.endElement(TeeContentHandler.java:94)
>    at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>    at
> org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
>    at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>    at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>    at
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>    at
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:273)
>    at
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:213)
>    at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:178)
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>
>
>> Try with a recent svn nightly build, and see if that fixes it. If not,
>> please post a problem file and the full stacktrace to a new issue in JIRA
>
>
> I will try to find time to check into that.
> -Paul
>

Re: Tika 1.0 Exception

Posted by "P. Hill" <pa...@gmail.com>.
On 12/6/2011 6:50 PM, Nick Burch wrote:
> On Tue, 6 Dec 2011, P. Hill wrote:
>>    at 
>> org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
>>    at 
>> org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
>> [rest of stack trace removed]
>
> You've alas snipped the interesting bit, which is what the parser 
> broke on

Further note on the type of message, it was a many-level nested reply 
chain generated by I believe Outlook for all coorespondants.  The 
attached PDF itself parses in all versions of tika-app.

Wow, really?  You wanted to see the AWT call? Probably not, but here is 
the trace to swing followed by the cause.
org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.microsoft.OfficeParser@6337bb9c
     at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
     at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
     at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
     at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
     at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
     at 
org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
     at 
org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
     at javax.swing.TransferHandler.importData(Unknown Source)

OOPS Sorry I didn't see the cause way down there: :-)

Caused by: java.lang.NullPointerException
     at 
com.sun.org.apache.xml.internal.serializer.ToHTMLStream.endElement(Unknown 
Source)
     at 
com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endElement(Unknown 
Source)
     at 
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
     at org.apache.tika.gui.TikaGUI$2.endElement(TikaGUI.java:519)
     at 
org.apache.tika.sax.TeeContentHandler.endElement(TeeContentHandler.java:94)
     at 
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
     at 
org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
     at 
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
     at 
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
     at 
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
     at 
org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:273)
     at 
org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:213)
     at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:178)
     at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)

> Try with a recent svn nightly build, and see if that fixes it. If not, 
> please post a problem file and the full stacktrace to a new issue in JIRA

I will try to find time to check into that.
-Paul


Re: Tika 1.0 Exception

Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 6 Dec 2011, P. Hill wrote:
>    at 
> org.apache.tika.gui.ParsingTransferHandler.importFiles(ParsingTransferHandler.java:94)
>    at 
> org.apache.tika.gui.ParsingTransferHandler.importData(ParsingTransferHandler.java:77)
> [rest of stack trace removed]

You've alas snipped the interesting bit, which is what the parser broke on

> Is there anything I could do to configure the app?

Try with a recent svn nightly build, and see if that fixes it. If not, 
please post a problem file and the full stacktrace to a new issue in JIRA

Nick