You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Egbert van der Wal <ew...@pointpro.nl> on 2016/11/04 08:18:09 UTC

Tika-server: shutdown on exceptions (esp. OOME)?

Hi,

In a web crawling application, we're using Tika to parse binary files 
such as PDF that the crawler encounters to extract text from it.

However, due to the wide variety of garbage encountered on the internet, 
this isn't always succesful, and sometimes Tika throws exceptions due to 
this. For example the OutOfMemory exception I reported (and should be 
fixed in the upcoming release): 
https://issues.apache.org/jira/browse/TIKA-2045

This used to crash the entire application. I've recently separated this 
by running Tika-server and sending the documents over HTTP to this 
server. However, when sending such broken documents, the OutOfMemory 
process is still thrown in the Tika server. However, it does not 
terminate. It keeps running, but will either run *very* slow, doesn't 
accept new connections or doesn't respond to them. The usual 
'undetermined state' after a OOME, I suppose.

Anyway, I'd like to fix this by having the server check regularly if the 
server is still running and restart it if necessary. But for that to 
happen, I need it to shutdown when a OOME occurs.

Is there anything I can use to make this happen? Do I need to change the 
code or is there a possibility to configure this using a config file of 
some sort?

Thanks!

Egbert van der Wal