You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Gaurav Sehgal <gs...@gmail.com> on 2018/05/16 12:59:37 UTC

Tika Performance in 1.9

Hello,
           I am using Tika 1.9, and want to improve the performance of the
following document types:

1. PDF
2. Mircosoft Word / Excell
3. ZIP

For PDF I tried to fine tune the PDFParserConfig, by using
setUseNonSequentialParser to true, which according to the document should
improve the performance, but unfortunately I did not see any improvement.


Are, there any other tunables I can use to improve the performance for the
above document types.

*Any guidance will be greatly appreciated.*

Regards,
Gaurav

Re: Tika Performance in 1.9

Posted by Gaurav Sehgal <gs...@gmail.com>.
1. I am  using Java Version 8
2. The following are the sizes of documents which our system processed over
a period of 2 hours:

files size greater than 5MB = 26
file sizes greater than 1MB less than 5MB = 138
file sizes greater than 500KB less than 1MB = 134
file sizes greater than 100KB less than 500KB = 598
less than 100KB = 5000

3. Unfortunately we don't have access to production data , as this is part
of our agreement with customer.

4. The product is an email archival system, which basically archives user
data in near real time. While archiving it also extracts the data and
stores it in solr/elasticsearch for users can search the data. Therefore we
do this extraction throughout the data. We process around 7 to 8 million
emails a day.

Regards,
Gaurav



On Wed, May 16, 2018 at 8:09 PM, John Patrick <nh...@gmail.com>
wrote:

> What java version are you using?
> What size documents are you using?
> Do you have sample files?
> How frequently are you doing the conversion as sometimes performance
> improves after the 1st document but is always slow for the 1st
> document.
>
> I had issues myself previously and either upgraded the java version to
> the latest or tika and sometimes the performance improved.
>
> Compare the same version with and without, as if you compare one
> version with and another version without you not comparing like for
> like so other factors might come in to play.
>
>
>
> On 16 May 2018 at 13:59, Gaurav Sehgal <gs...@gmail.com> wrote:
> > Hello,
> >            I am using Tika 1.9, and want to improve the performance of
> the
> > following document types:
> >
> > 1. PDF
> > 2. Mircosoft Word / Excell
> > 3. ZIP
> >
> > For PDF I tried to fine tune the PDFParserConfig, by using
> > setUseNonSequentialParser to true, which according to the document should
> > improve the performance, but unfortunately I did not see any improvement.
> >
> >
> > Are, there any other tunables I can use to improve the performance for
> the
> > above document types.
> >
> > Any guidance will be greatly appreciated.
> >
> > Regards,
> > Gaurav
> >
>

Re: Tika Performance in 1.9

Posted by John Patrick <nh...@gmail.com>.
What java version are you using?
What size documents are you using?
Do you have sample files?
How frequently are you doing the conversion as sometimes performance
improves after the 1st document but is always slow for the 1st
document.

I had issues myself previously and either upgraded the java version to
the latest or tika and sometimes the performance improved.

Compare the same version with and without, as if you compare one
version with and another version without you not comparing like for
like so other factors might come in to play.



On 16 May 2018 at 13:59, Gaurav Sehgal <gs...@gmail.com> wrote:
> Hello,
>            I am using Tika 1.9, and want to improve the performance of the
> following document types:
>
> 1. PDF
> 2. Mircosoft Word / Excell
> 3. ZIP
>
> For PDF I tried to fine tune the PDFParserConfig, by using
> setUseNonSequentialParser to true, which according to the document should
> improve the performance, but unfortunately I did not see any improvement.
>
>
> Are, there any other tunables I can use to improve the performance for the
> above document types.
>
> Any guidance will be greatly appreciated.
>
> Regards,
> Gaurav
>