You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Nate Findley <na...@zenlok.com> on 2013/12/21 19:02:10 UTC

Tika Server (JAXRS)

I am running Tika Server for processing files via curl requests.  The 
servers start running 100 CPU after a day or so.  I am wondering if 
there is any information about how to debug this situation.  The wiki is 
pretty thin on information.

Regards,
Nate

Re: Tika Server (JAXRS)

Posted by Rian J Stockbower <rs...@gmail.com>.
I've done some testing of Tika to determine how performant the JAXRS server
is under heavy loads by making 4-8 simultaneous requests as fast as the
webservice would respond, using a variety of test documents. (Some of these
document types were supported by Tika, some weren't.) I have a large text
extraction job coming up--millions of docs--and I needed to determine what
kind of resources I would need. During this testing, I found that CPU usage
was highest when Tika was unwinding exceptions. This CPU usage would
persist long after my ~10GB of documents had been completed.

These stack traces appeared to pile up such that documents would continue
to be processed as requests were made, and Tika would opportunistically
print a stack trace when it wasn't busy responding to other work. These
stack traces would scroll by--often for several minutes--after I had
finished making requests. I didn't dig into the cause because when I began
filtering the document types I was sending it, performance got better, and
dramatically reduced the number of exceptions thrown. As you might expect,
this brought CPU (and memory!) usage down dramatically.

With that in mind:

   - Have you captured any console output?
   - How busy is your web service?
   - Are you filtering the document types before they're processed?
   - Can you reproduce the problem in a test environment?

 -Rian


On Sat, Dec 21, 2013 at 1:02 PM, Nate Findley <na...@zenlok.com> wrote:

> I am running Tika Server for processing files via curl requests.  The
> servers start running 100 CPU after a day or so.  I am wondering if there
> is any information about how to debug this situation.  The wiki is pretty
> thin on information.
>
> Regards,
> Nate
>