You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Nicholas DiPiazza <ni...@gmail.com> on 2020/06/25 18:10:59 UTC

Need some help understanding why this code gets stuck in timeout exceptions

I need some help a project I'm trying to port over to be a part of Tika.

I am trying to extend the existing Fork Parser to add a "Fork Parser 2.0"
which supports connection pools using commons-pool, and supports an
improved ability to "stop parsing after N characters".

Here is the latest code: https://github.com/nddipiazza/tika-fork/tree/2.3.1

When I use this project, it works great on my local environment. When I
throw it out in the world, I get intermittent errors related to timeouts:

Parse error for input:
c:\test\docs\c6f13fe7-40bb-4c64-8cfe-5d748b5c8567.xlsm
Caused by: java.util.concurrent.ExecutionException:
java.lang.RuntimeException: Failed to read content from forked Tika parser
JVM
  at java.util.concurrent.FutureTask.report(FutureTask.java:122)
~[?:1.8.0_181]
  at java.util.concurrent.FutureTask.get(FutureTask.java:206) ~[?:1.8.0_181]
  at org.apache.tika.client.TikaRunner.parseImpl(TikaRunner.java:124)
~[tika-fork-client-2.3.1.jar:?]
  at org.apache.tika.client.TikaRunner.parse(TikaRunner.java:58)
~[tika-fork-client-2.3.1.jar:?]
  at org.apache.tika.client.TikaProcess.parse(TikaProcess.java:185)
~[tika-fork-client-2.3.1.jar:?]
  at org.apache.tika.client.TikaProcessPool.parse(TikaProcessPool.java:145)
~[tika-fork-client-2.3.1.jar:?]
  at
com.lucidworks.apollo.pipeline.parse.impl.tika.TikaForkParser.parse(TikaForkParser.java:236)
~[lucid-parsing-4.2.2.jar:?]
  ... 12 more
Caused by: java.lang.RuntimeException: Failed to read content from forked
Tika parser JVM
  at
org.apache.tika.client.TikaRunner.lambda$parseImpl$2(TikaRunner.java:118)
~[tika-fork-client-2.3.1.jar:?]
  ... 4 more
Caused by: java.util.concurrent.TimeoutException: Timed out waiting 120000
ms for metadata after content was fully parsed.
  at
org.apache.tika.client.TikaRunner.lambda$parseImpl$2(TikaRunner.java:112)
~[tika-fork-client-2.3.1.jar:?]
  ... 4 more

And because each timeout requires the tika forked JVM to be killed and
respawned, this can cause some churning that leads to more timeouts because
of the amount of time it takes to start up a tika JVM.

Does anyone have any experience with the existing tika parser? I would
imagine this file contains my main issue:
https://github.com/nddipiazza/tika-fork/blob/2.3.1/tika-fork-main/src/main/java/org/apache/tika/fork/main/TikaForkMain.java

I'm attempting to use 3 executors independently. And what I'm thinking is I
should be doing this in a different way that isn't so fragile with respect
to timeouts.

Does anyone have some time to code review this and tell me what they might
think is wrong?

-Nicholas DiPiazza

Re: Need some help understanding why this code gets stuck in timeout exceptions

Posted by Tim Allison <ta...@apache.org>.
>I had to put a retry around requests to the tika api calls because
sometimes they flake

Yes.  This is an important point.  Note that it is not flaking, it is an
intended restart after catastrophic failure.  But, yes, absolutely, clients
have retry logic.

I just update the wiki to make this point.

Please let us know what else you find.

Cheers,

         Tim

On Wed, Jul 8, 2020 at 10:00 PM Nicholas DiPiazza <
nicholas.dipiazza@gmail.com> wrote:

> following up on this thread in case anyone stumbles upon it in searches - I
> am abandoning this tika-fork code and replacing it with a TikaServerPool
> that pools tika-server JVM instances with --spawnChild enabled, and a
> client that fires off /rmeta/text requests to round robin selected members
> of this pool. this has it's own set of quirks... but all-in-all the results
> are much more robust for multiple-million document crawls. I am finding way
> less timeout exceptions.
>
> I had to put a retry around requests to the tika api calls because
> sometimes they flake out for a period of time then come back. but that
> seems to be the end of it.
>
> On Thu, Jun 25, 2020 at 1:10 PM Nicholas DiPiazza <
> nicholas.dipiazza@gmail.com> wrote:
>
> > I need some help a project I'm trying to port over to be a part of Tika.
> >
> > I am trying to extend the existing Fork Parser to add a "Fork Parser 2.0"
> > which supports connection pools using commons-pool, and supports an
> > improved ability to "stop parsing after N characters".
> >
> > Here is the latest code:
> > https://github.com/nddipiazza/tika-fork/tree/2.3.1
> >
> > When I use this project, it works great on my local environment. When I
> > throw it out in the world, I get intermittent errors related to timeouts:
> >
> > Parse error for input:
> > c:\test\docs\c6f13fe7-40bb-4c64-8cfe-5d748b5c8567.xlsm
> > Caused by: java.util.concurrent.ExecutionException:
> > java.lang.RuntimeException: Failed to read content from forked Tika
> parser
> > JVM
> >   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> > ~[?:1.8.0_181]
> >   at java.util.concurrent.FutureTask.get(FutureTask.java:206)
> > ~[?:1.8.0_181]
> >   at org.apache.tika.client.TikaRunner.parseImpl(TikaRunner.java:124)
> > ~[tika-fork-client-2.3.1.jar:?]
> >   at org.apache.tika.client.TikaRunner.parse(TikaRunner.java:58)
> > ~[tika-fork-client-2.3.1.jar:?]
> >   at org.apache.tika.client.TikaProcess.parse(TikaProcess.java:185)
> > ~[tika-fork-client-2.3.1.jar:?]
> >   at
> > org.apache.tika.client.TikaProcessPool.parse(TikaProcessPool.java:145)
> > ~[tika-fork-client-2.3.1.jar:?]
> >   at
> >
> com.lucidworks.apollo.pipeline.parse.impl.tika.TikaForkParser.parse(TikaForkParser.java:236)
> > ~[lucid-parsing-4.2.2.jar:?]
> >   ... 12 more
> > Caused by: java.lang.RuntimeException: Failed to read content from forked
> > Tika parser JVM
> >   at
> > org.apache.tika.client.TikaRunner.lambda$parseImpl$2(TikaRunner.java:118)
> > ~[tika-fork-client-2.3.1.jar:?]
> >   ... 4 more
> > Caused by: java.util.concurrent.TimeoutException: Timed out waiting
> 120000
> > ms for metadata after content was fully parsed.
> >   at
> > org.apache.tika.client.TikaRunner.lambda$parseImpl$2(TikaRunner.java:112)
> > ~[tika-fork-client-2.3.1.jar:?]
> >   ... 4 more
> >
> > And because each timeout requires the tika forked JVM to be killed and
> > respawned, this can cause some churning that leads to more timeouts
> because
> > of the amount of time it takes to start up a tika JVM.
> >
> > Does anyone have any experience with the existing tika parser? I would
> > imagine this file contains my main issue:
> >
> https://github.com/nddipiazza/tika-fork/blob/2.3.1/tika-fork-main/src/main/java/org/apache/tika/fork/main/TikaForkMain.java
> >
> > I'm attempting to use 3 executors independently. And what I'm thinking is
> > I should be doing this in a different way that isn't so fragile with
> > respect to timeouts.
> >
> > Does anyone have some time to code review this and tell me what they
> might
> > think is wrong?
> >
> > -Nicholas DiPiazza
> >
> >
> >
>

Re: Need some help understanding why this code gets stuck in timeout exceptions

Posted by Nicholas DiPiazza <ni...@gmail.com>.
following up on this thread in case anyone stumbles upon it in searches - I
am abandoning this tika-fork code and replacing it with a TikaServerPool
that pools tika-server JVM instances with --spawnChild enabled, and a
client that fires off /rmeta/text requests to round robin selected members
of this pool. this has it's own set of quirks... but all-in-all the results
are much more robust for multiple-million document crawls. I am finding way
less timeout exceptions.

I had to put a retry around requests to the tika api calls because
sometimes they flake out for a period of time then come back. but that
seems to be the end of it.

On Thu, Jun 25, 2020 at 1:10 PM Nicholas DiPiazza <
nicholas.dipiazza@gmail.com> wrote:

> I need some help a project I'm trying to port over to be a part of Tika.
>
> I am trying to extend the existing Fork Parser to add a "Fork Parser 2.0"
> which supports connection pools using commons-pool, and supports an
> improved ability to "stop parsing after N characters".
>
> Here is the latest code:
> https://github.com/nddipiazza/tika-fork/tree/2.3.1
>
> When I use this project, it works great on my local environment. When I
> throw it out in the world, I get intermittent errors related to timeouts:
>
> Parse error for input:
> c:\test\docs\c6f13fe7-40bb-4c64-8cfe-5d748b5c8567.xlsm
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.RuntimeException: Failed to read content from forked Tika parser
> JVM
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> ~[?:1.8.0_181]
>   at java.util.concurrent.FutureTask.get(FutureTask.java:206)
> ~[?:1.8.0_181]
>   at org.apache.tika.client.TikaRunner.parseImpl(TikaRunner.java:124)
> ~[tika-fork-client-2.3.1.jar:?]
>   at org.apache.tika.client.TikaRunner.parse(TikaRunner.java:58)
> ~[tika-fork-client-2.3.1.jar:?]
>   at org.apache.tika.client.TikaProcess.parse(TikaProcess.java:185)
> ~[tika-fork-client-2.3.1.jar:?]
>   at
> org.apache.tika.client.TikaProcessPool.parse(TikaProcessPool.java:145)
> ~[tika-fork-client-2.3.1.jar:?]
>   at
> com.lucidworks.apollo.pipeline.parse.impl.tika.TikaForkParser.parse(TikaForkParser.java:236)
> ~[lucid-parsing-4.2.2.jar:?]
>   ... 12 more
> Caused by: java.lang.RuntimeException: Failed to read content from forked
> Tika parser JVM
>   at
> org.apache.tika.client.TikaRunner.lambda$parseImpl$2(TikaRunner.java:118)
> ~[tika-fork-client-2.3.1.jar:?]
>   ... 4 more
> Caused by: java.util.concurrent.TimeoutException: Timed out waiting 120000
> ms for metadata after content was fully parsed.
>   at
> org.apache.tika.client.TikaRunner.lambda$parseImpl$2(TikaRunner.java:112)
> ~[tika-fork-client-2.3.1.jar:?]
>   ... 4 more
>
> And because each timeout requires the tika forked JVM to be killed and
> respawned, this can cause some churning that leads to more timeouts because
> of the amount of time it takes to start up a tika JVM.
>
> Does anyone have any experience with the existing tika parser? I would
> imagine this file contains my main issue:
> https://github.com/nddipiazza/tika-fork/blob/2.3.1/tika-fork-main/src/main/java/org/apache/tika/fork/main/TikaForkMain.java
>
> I'm attempting to use 3 executors independently. And what I'm thinking is
> I should be doing this in a different way that isn't so fragile with
> respect to timeouts.
>
> Does anyone have some time to code review this and tell me what they might
> think is wrong?
>
> -Nicholas DiPiazza
>
>
>