You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jerome Lacoste <je...@gmail.com> on 2011/12/22 17:18:30 UTC

Parser stability and ForkParser

Hei,

I opened a couple of issues to note some parser instability:

https://issues.apache.org/jira/browse/TIKA-815
https://issues.apache.org/bugzilla/show_bug.cgi?id=52372
https://issues.apache.org/bugzilla/show_bug.cgi?id=52373
https://issues.apache.org/jira/browse/COMPRESS-169

TIKA-815 is the overall one that points to the fact that tika could
have a few more tests to ensure that the underlying parsers are more
robusts. The fact that Tika has a general interface allows those
stress testing to be applied on all parsers, which may be a good idea.
The code is simple and available on github. Feedback appreciated.




Now a question that pertains more to the user list. In TIKA-815, Nick
pointed that one could use ForkedParser to improve stability. I didn't
manage to get it to work.

When I use the command line tika app, e.g. with

java -jar /tmp/tika-app-1.0.jar -v -t -f  brokenFile.doc

then tika reports nothing.

But if I try to reproduce something similar programatically I run into
strange errors:

first because my current classLoader isn't serializable and the client
tries to serialize it.

org.apache.tika.exception.TikaException: Failed to communicate with a
forked parser process. The process has most likely crashed due to some
error like running out of memory. A new process will be started for
the next parsing request.
	at org.apache.tika.fork.ForkParser.parse(ForkParser.java:123)
	at org.apache.tika.Tika.parseToString(Tika.java:380)
	at org.apache.tika.Tika.parseToString(Tika.java:414)
	at no.finntech.tika.harderner.TikaIndexerHardenerTest.parseContent(TikaIndexerHardenerTest.java:142)
	at no.finntech.tika.harderner.TikaIndexerHardenerTest.flipBitAndIndexContent(TikaIndexerHardenerTest.java:125)
	at no.finntech.tika.harderner.TikaIndexerHardenerTest.originalFileIndexesProperly4(TikaIndexerHardenerTest.java:69)
	at no.finntech.tika.harderner.TikaIndexerHardenerTest.main(TikaIndexerHardenerTest.java:170)
Caused by: java.io.NotSerializableException: sun.misc.Launcher$AppClassLoader
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1164)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330)
	at java.util.HashMap.writeObject(HashMap.java:1001)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:945)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1469)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330)
	at org.apache.tika.fork.ForkObjectInputStream.sendObject(ForkObjectInputStream.java:84)
	at org.apache.tika.fork.ForkClient.sendObject(ForkClient.java:135)
	at org.apache.tika.fork.ForkClient.call(ForkClient.java:108)
	at org.apache.tika.fork.ForkParser.parse(ForkParser.java:120)
	... 6 more

This is because Tika tries to serialize the forkParser in the
ParseContext. I solved this by introducing

    private void setContextParser(ParseContext context) {
        Parser p = parser;
        if (parser instanceof ForkParser) {
            p = ((ForkParser)parser).getParser(); // requires exposing
the parser in ForkParser
        }
        context.set(Parser.class, p);
    }

and modifying parseToString(...) with:

            ParseContext context = new ParseContext();
            setContextParser(context);

So there's maybe a bug here.

This solves the exception but causes tika to not report any error when
parsing. It just doesn't parse anything and returns gracefully.

Any idea ?

Jerome

Re: Parser stability and ForkParser

Posted by Nick Burch <ni...@alfresco.com>.
On 22/12/11 16:18, Jerome Lacoste wrote:
> Now a question that pertains more to the user list. In TIKA-815, Nick
> pointed that one could use ForkedParser to improve stability. I didn't
> manage to get it to work.
>
> When I use the command line tika app, e.g. with
>
> java -jar /tmp/tika-app-1.0.jar -v -t -f  brokenFile.doc
>
> then tika reports nothing.

You may have hit TIKA-808 with this, it only seems to occur on some files

> But if I try to reproduce something similar programatically I run into
> strange errors:
>
> first because my current classLoader isn't serializable and the client
> tries to serialize it.

I got the same problem when I tried to write a unit test for TIKA-808

Hopefully Jukka will have some time over Christmas to work on this :)

Nick

Re: Parser stability and ForkParser

Posted by Jerome Lacoste <je...@gmail.com>.
On Thu, Dec 22, 2011 at 5:18 PM, Jerome Lacoste
<je...@gmail.com> wrote:
> Hei,
>
> I opened a couple of issues to note some parser instability:
>
> https://issues.apache.org/jira/browse/TIKA-815
> https://issues.apache.org/bugzilla/show_bug.cgi?id=52372
> https://issues.apache.org/bugzilla/show_bug.cgi?id=52373
> https://issues.apache.org/jira/browse/COMPRESS-169
>
> TIKA-815 is the overall one that points to the fact that tika could
> have a few more tests to ensure that the underlying parsers are more
> robusts. The fact that Tika has a general interface allows those
> stress testing to be applied on all parsers, which may be a good idea.
> The code is simple and available on github. Feedback appreciated.
>
>
>
>
> Now a question that pertains more to the user list. In TIKA-815, Nick
> pointed that one could use ForkedParser to improve stability. I didn't
> manage to get it to work.
>
> When I use the command line tika app, e.g. with
>
> java -jar /tmp/tika-app-1.0.jar -v -t -f  brokenFile.doc
>
> then tika reports nothing.
>
> But if I try to reproduce something similar programatically I run into
> strange errors:

[...]

> This solves the exception but causes tika to not report any error when
> parsing. It just doesn't parse anything and returns gracefully.

A bit more information (and another bug):

if I change my javaCommand to allow debugging

        parser.setJavaCommand("java -Xmx32m -Xdebug
-Xrunjdwp:transport=dt_socket,address=54321,server=y,suspend=y");

Then the forked process will write

    Listening for transport dt_socket at address: 54321

to its output. This confuses the client when it tries to ping the
forked process:

 public synchronized boolean ping() {
        try {
            output.writeByte(ForkServer.PING);
            output.flush();
            while (true) {
                consumeErrorStream();
                int type = input.read();
                if (type == ForkServer.PING) {
                    consumeErrorStream();
                    return true;
                } else {
                    return false;
                }
            }
        } catch (IOException e) {
            return false;
        }
    }

The input contains  "Listening for..." message and ping returns false,
and the client closes the communication.

The ping method assumes nothing is to be read, which is wrong. (this
is reproduceable given the aforementionned context/parser fix).


Now to got back to my problem, the reported OOM never gets read by the client.
This is caused by ForClient#waitForResponse. This method has a switch
with 2 identical values (because type == -1 and type ==
ForkServer.ERROR) are identical.

So replace
            } else if (type == ForkServer.ERROR) {
with
            } else if ((byte) type == ForkServer.ERROR) {
and the error is properly reported.

To summarize. I think tika has 3 issues:
* Tika#parseString() sends the wrong parser in the context when we fork
* waitForResponse doesn't properly handle ERRORS because of broken switch
* forking doesn't work. This one I have no fix for right now.

If needed, I will update by tika-hardener test to provide a full
working test and proper patches to the program.

Jerome