You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Nicholas DiPiazza <ni...@gmail.com> on 2020/10/31 18:43:24 UTC

When calling /rmeta/text, is there a way to time box the request to a certain amount of time?

I have a massive number of documents that I need to fetch through apache
tika server.

Prior to making a switch to tika server, I used a project I created myself
that created tika forked VMs and would send work to the VMs through sockets
directly.

This was OK but super complicated so I chose to switch to the Tika jetty
server for simplicity's sake.

Works great for the most part. But one feature I had before was that I
could say "If I don't get a result within MAX_PARSE_TIMEOUT_MS, then stop
parsing at the moment and return the bytes we managed to get up to that
point.

This is because with the massive number of documents I need to parse, I
cannot afford to have any parse hang longer than a certain amount of time.

With the rmeta/text method, we recently added the ability to send a
writeLimit where we will stop parsing after we reach that number of bytes.

Can we similarly add something that can "stop parsing after X ms have
elapsed?"

Currently, I'm having to do this through http socket timeouts but the
problem then is it is all or nothing. And this will lead to huge gaps in my
results because many of the docs hit socket timeouts when pounding the
living crap out of Tika... these timeouts become more and more likely.

-Nicholas