You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Nicholas DiPiazza <ni...@gmail.com> on 2020/06/25 13:40:57 UTC

Is there a way to use Tika Fork parser along with the Tika Server?

I have an application of Tika server that I'm sure is pretty common.

I have parse nodes that download files from data sources, and will need to
parse out the content and metadata from these files. But it needs to be
resilient to OOM's and needs to time out gracefully.

Up until now. I've been using this project here:
https://github.com/nddipiazza/tika-fork to parse files. This manages a pool
of JVMs and pushes the requests through them. It makes it so if a file is a
bomb and blows up the JVM, it will not affect my program.

However, when I use this out in the wild, I get a lot of strange timeouts
that I can't reproduce locally.  Related to system resources on those local
systems I guess but I can't really figure out what the problem is.

So I'm thinking instead I will try out a different approach.

I would like to have each parser node have it's own Tika Server running,
and I'll just use the endpoint

http://localhost:9998/unpack/all

But I'm worried this will be plagued by the same problems that prompted me
to go to the tika-fork parser. Where this server will continually go down
due to OOMs because of random files in the wild that come in cause tika
bombs or cpu spikes due to infinite loops, etc.

How is everyone else managing to do this in the field? Is there a way to
configure a Tika Fork parser on the Tika server so that it does not crash
upon zip bombs, excel bombs, etc?

-Nicholas DiPiazza

Re: Is there a way to use Tika Fork parser along with the Tika Server?

Posted by Nicholas DiPiazza <ni...@gmail.com>.
Cool. I'll give this a try. Thanks!

On Thu, Jun 25, 2020 at 9:36 AM Tim Allison <ta...@apache.org> wrote:

> gracefully...well...  give  --spawnChild a try.
>
> That forks a child process that is the server.  Now, unless you put a bunch
> of these behind a loadbalancer, you're client will have to be resilient if
> the server is restarting.  The other problem with this in a multithreaded
> environment is you can't necessarily tell which file killed the
> server...threadA sends fileA which takes a while to process, threadB sends
> fileB which causes OOM...server dies before completing fileA... your
> clients can't tell which file caused the problem.
>
> That said, it's what we have for robustness in tika-server.
>
> On Thu, Jun 25, 2020 at 9:41 AM Nicholas DiPiazza <
> nicholas.dipiazza@gmail.com> wrote:
>
> > I have an application of Tika server that I'm sure is pretty common.
> >
> > I have parse nodes that download files from data sources, and will need
> to
> > parse out the content and metadata from these files. But it needs to be
> > resilient to OOM's and needs to time out gracefully.
> >
> > Up until now. I've been using this project here:
> > https://github.com/nddipiazza/tika-fork to parse files. This manages a
> > pool
> > of JVMs and pushes the requests through them. It makes it so if a file
> is a
> > bomb and blows up the JVM, it will not affect my program.
> >
> > However, when I use this out in the wild, I get a lot of strange timeouts
> > that I can't reproduce locally.  Related to system resources on those
> local
> > systems I guess but I can't really figure out what the problem is.
> >
> > So I'm thinking instead I will try out a different approach.
> >
> > I would like to have each parser node have it's own Tika Server running,
> > and I'll just use the endpoint
> >
> > http://localhost:9998/unpack/all
> >
> > But I'm worried this will be plagued by the same problems that prompted
> me
> > to go to the tika-fork parser. Where this server will continually go down
> > due to OOMs because of random files in the wild that come in cause tika
> > bombs or cpu spikes due to infinite loops, etc.
> >
> > How is everyone else managing to do this in the field? Is there a way to
> > configure a Tika Fork parser on the Tika server so that it does not crash
> > upon zip bombs, excel bombs, etc?
> >
> > -Nicholas DiPiazza
> >
>

Re: Is there a way to use Tika Fork parser along with the Tika Server?

Posted by Tim Allison <ta...@apache.org>.
gracefully...well...  give  --spawnChild a try.

That forks a child process that is the server.  Now, unless you put a bunch
of these behind a loadbalancer, you're client will have to be resilient if
the server is restarting.  The other problem with this in a multithreaded
environment is you can't necessarily tell which file killed the
server...threadA sends fileA which takes a while to process, threadB sends
fileB which causes OOM...server dies before completing fileA... your
clients can't tell which file caused the problem.

That said, it's what we have for robustness in tika-server.

On Thu, Jun 25, 2020 at 9:41 AM Nicholas DiPiazza <
nicholas.dipiazza@gmail.com> wrote:

> I have an application of Tika server that I'm sure is pretty common.
>
> I have parse nodes that download files from data sources, and will need to
> parse out the content and metadata from these files. But it needs to be
> resilient to OOM's and needs to time out gracefully.
>
> Up until now. I've been using this project here:
> https://github.com/nddipiazza/tika-fork to parse files. This manages a
> pool
> of JVMs and pushes the requests through them. It makes it so if a file is a
> bomb and blows up the JVM, it will not affect my program.
>
> However, when I use this out in the wild, I get a lot of strange timeouts
> that I can't reproduce locally.  Related to system resources on those local
> systems I guess but I can't really figure out what the problem is.
>
> So I'm thinking instead I will try out a different approach.
>
> I would like to have each parser node have it's own Tika Server running,
> and I'll just use the endpoint
>
> http://localhost:9998/unpack/all
>
> But I'm worried this will be plagued by the same problems that prompted me
> to go to the tika-fork parser. Where this server will continually go down
> due to OOMs because of random files in the wild that come in cause tika
> bombs or cpu spikes due to infinite loops, etc.
>
> How is everyone else managing to do this in the field? Is there a way to
> configure a Tika Fork parser on the Tika server so that it does not crash
> upon zip bombs, excel bombs, etc?
>
> -Nicholas DiPiazza
>