You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jana, Kumar Raja" <kj...@ptc.com> on 2010/06/28 15:49:58 UTC

Limiting the extracted content

Hi,

We use Apache Tika in our application before sending the content to Solr
for Indexing. Some of our documents are pretty large (over 150 MB in
size with "only text" content over 30 MB). Processing such documents
often result in Out of Memory Exceptions during runtime. Ofcourse,
increasing the max heap does resolve this issue and another option we
use is to index in chunks of 5 MB. 

 

On careful analysis, we realized that most of our keywords lie in the
first 1-2 MB of such documents and indexing that chunk suffices our
requirement. Is there any provision in Tika APIs to extract only the
first 1 or 2 MB (customizable) of the content instead of parsing the
entire document? If not, can someone point to which part of the code I
can play with to implement this?

 

Thanks,

Kumar