You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Nicholas DiPiazza <ni...@gmail.com> on 2020/11/23 15:05:08 UTC

How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

I am attempting to Tika parse dozens of millions of office documents. Pdfs,
docs, excels, xmls, etc. Wide assortment of types.

Throughput is very important. I need to be able parse these files in a
reasonable amount of time, but at the same time, accuracy is also pretty
important. I hope to have less than 10% of the documents parsed fail. (And
by fail I mean fail due to tika stability, like a timeout while parsing. I
do not mean fail due to the document itself).

My question - how to configure Tika Server in a containerized environment
to maximize throughput?

My environment:

- I am using Openshift.
- Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8
GiB to 10 GiB*.
- I have 10 tika parsing pod replicas.

On each pod, I run a java program where I have 8 parse threads.

Each thread:

- Starts a single tika server process (in spawn child mode)
- Tika server arguments: -s -spawnChild -maxChildStartupMillis 120000
-pingPulseMillis 500 -pingTimeoutMillis 30000 -taskPulseMillis 500
-taskTimeoutMillis 120000 -JXmx512m -enableUnsecureFeatures -enableFileUrl
- The thread will now continuously grab a file from the files-to-fetch
queue and will send it to the tika server, stopping when there are no more
files to parse.

Each of these files are stored locally on the pod in a buffer, so the local
file optimization is used:

The Tika web service it is using is:

Endpoint: `/rmeta/text`
Method: `PUT`
Headers: - writeLimit = 32000000 - maxEmbeddedResources = 0 -
fileUrl = file:///path/to/file

Files are no greater than 100Mb, the maximum number of bytes tika text will
be (writeLimit) 32Mb.

Each pod is parsing about 370,000 documents per day. I've been messing with
a ton of different attempts at settings.

I previously tried to use the actual Tika "ForkParser" but the performance
was far worse than spawning tika servers. So that is why I am using Tika
Server.

I don't hate the performance results of this.... but I feel like I'd better
reach out and make sure there isn't someone out there who sanity checks my
numbers and is like "woah that's awful performance, you should be getting
xyz like me!"

Anyone have any similar things you are doing? If so, what settings did you
end up settling on?

Also, I'm wondering if Apache Http Client would be causing any overhead
here when I am calling to my Tika Server /rmeta/text endpoint. I am using a
shared connection pool. Would there be any benefit in say using a unique
HttpClients.createDefault() for each thread instead of sharing a connection
pool between the threads?

Cross posted question here as well
https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput

Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

Posted by Luís Filipe Nassif <lf...@gmail.com>.

Yes, tika-server is the long way choice, as discussed in user's list recent
thread. I hope I will have time in the future to migrate to it to get rid
of jar hell problems definitely...

Em qui., 26 de nov. de 2020 às 14:32, Nicholas DiPiazza <
nicholas.dipiazza@gmail.com> escreveu:

> I created a tika fork example I want to add to the documentation as well:
> https://github.com/nddipiazza/tika-fork-parser-example
>
> When we submit your fixes, we should update this example with
> multi-threading.
>
> On Thu, Nov 26, 2020 at 11:28 AM Nicholas DiPiazza <
> nicholas.dipiazza@gmail.com> wrote:
>
>> Hey Luis,
>>
>> It is related because after your fixes I might be able to take some
>> significant performance advantage by switching to fork parser.
>> I would make great use of an example of someone else who has set up a
>> ForkParser multi-thread able processing program that can gracefully handle
>> the huge onslaught that is my use case.
>> But at this point, I doubt I'll switch from Tika Server anyways because I
>> invested some time creating a wrapper around it and it is performing very
>> well.
>>
>> On Wed, Nov 25, 2020 at 8:23 PM Luís Filipe Nassif <lf...@gmail.com>
>> wrote:
>>
>>> Not what you asked but related :)
>>>
>>> Luis
>>>
>>> Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif <lfcnassif@gmail.com
>>> >
>>> escreveu:
>>>
>>> > I've done some few improvements in ForkParser performance in an
>>> internal
>>> > fork. Will try to contribute upstream...
>>> >
>>> > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
>>> > nicholas.dipiazza@gmail.com> escreveu:
>>> >
>>> >> I am attempting to Tika parse dozens of millions of office documents.
>>> >> Pdfs,
>>> >> docs, excels, xmls, etc. Wide assortment of types.
>>> >>
>>> >> Throughput is very important. I need to be able parse these files in a
>>> >> reasonable amount of time, but at the same time, accuracy is also
>>> pretty
>>> >> important. I hope to have less than 10% of the documents parsed fail.
>>> (And
>>> >> by fail I mean fail due to tika stability, like a timeout while
>>> parsing. I
>>> >> do not mean fail due to the document itself).
>>> >>
>>> >> My question - how to configure Tika Server in a containerized
>>> environment
>>> >> to maximize throughput?
>>> >>
>>> >> My environment:
>>> >>
>>> >>    - I am using Openshift.
>>> >>    - Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory:
>>> *8
>>> >>    GiB to 10 GiB*.
>>> >>    - I have 10 tika parsing pod replicas.
>>> >>
>>> >> On each pod, I run a java program where I have 8 parse threads.
>>> >>
>>> >> Each thread:
>>> >>
>>> >>    - Starts a single tika server process (in spawn child mode)
>>> >>       - Tika server arguments: -s -spawnChild -maxChildStartupMillis
>>> >> 120000
>>> >>       -pingPulseMillis 500 -pingTimeoutMillis 30000 -taskPulseMillis
>>> 500
>>> >>       -taskTimeoutMillis 120000 -JXmx512m -enableUnsecureFeatures
>>> >> -enableFileUrl
>>> >>    - The thread will now continuously grab a file from the
>>> files-to-fetch
>>> >>    queue and will send it to the tika server, stopping when there are
>>> no
>>> >> more
>>> >>    files to parse.
>>> >>
>>> >> Each of these files are stored locally on the pod in a buffer, so the
>>> >> local
>>> >> file optimization is used:
>>> >>
>>> >> The Tika web service it is using is:
>>> >>
>>> >> Endpoint: `/rmeta/text`
>>> >> Method: `PUT`
>>> >> Headers:    - writeLimit = 32000000    - maxEmbeddedResources = 0    -
>>> >> fileUrl = file:///path/to/file
>>> >>
>>> >> Files are no greater than 100Mb, the maximum number of bytes tika text
>>> >> will
>>> >> be (writeLimit) 32Mb.
>>> >>
>>> >> Each pod is parsing about 370,000 documents per day. I've been messing
>>> >> with
>>> >> a ton of different attempts at settings.
>>> >>
>>> >> I previously tried to use the actual Tika "ForkParser" but the
>>> performance
>>> >> was far worse than spawning tika servers. So that is why I am using
>>> Tika
>>> >> Server.
>>> >>
>>> >> I don't hate the performance results of this.... but I feel like I'd
>>> >> better
>>> >> reach out and make sure there isn't someone out there who sanity
>>> checks my
>>> >> numbers and is like "woah that's awful performance, you should be
>>> getting
>>> >> xyz like me!"
>>> >>
>>> >> Anyone have any similar things you are doing? If so, what settings
>>> did you
>>> >> end up settling on?
>>> >>
>>> >> Also, I'm wondering if Apache Http Client would be causing any
>>> overhead
>>> >> here when I am calling to my Tika Server /rmeta/text endpoint. I am
>>> using
>>> >> a
>>> >> shared connection pool. Would there be any benefit in say using a
>>> unique
>>> >> HttpClients.createDefault() for each thread instead of sharing a
>>> >> connection
>>> >> pool between the threads?
>>> >>
>>> >>
>>> >> Cross posted question here as well
>>> >>
>>> >>
>>> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput
>>> >>
>>> >
>>>
>>

Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

Posted by Nicholas DiPiazza <ni...@gmail.com>.

I created a tika fork example I want to add to the documentation as well:
https://github.com/nddipiazza/tika-fork-parser-example

When we submit your fixes, we should update this example with
multi-threading.

On Thu, Nov 26, 2020 at 11:28 AM Nicholas DiPiazza <
nicholas.dipiazza@gmail.com> wrote:

> Hey Luis,
>
> It is related because after your fixes I might be able to take some
> significant performance advantage by switching to fork parser.
> I would make great use of an example of someone else who has set up a
> ForkParser multi-thread able processing program that can gracefully handle
> the huge onslaught that is my use case.
> But at this point, I doubt I'll switch from Tika Server anyways because I
> invested some time creating a wrapper around it and it is performing very
> well.
>
> On Wed, Nov 25, 2020 at 8:23 PM Luís Filipe Nassif <lf...@gmail.com>
> wrote:
>
>> Not what you asked but related :)
>>
>> Luis
>>
>> Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif <lf...@gmail.com>
>> escreveu:
>>
>> > I've done some few improvements in ForkParser performance in an internal
>> > fork. Will try to contribute upstream...
>> >
>> > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
>> > nicholas.dipiazza@gmail.com> escreveu:
>> >
>> >> I am attempting to Tika parse dozens of millions of office documents.
>> >> Pdfs,
>> >> docs, excels, xmls, etc. Wide assortment of types.
>> >>
>> >> Throughput is very important. I need to be able parse these files in a
>> >> reasonable amount of time, but at the same time, accuracy is also
>> pretty
>> >> important. I hope to have less than 10% of the documents parsed fail.
>> (And
>> >> by fail I mean fail due to tika stability, like a timeout while
>> parsing. I
>> >> do not mean fail due to the document itself).
>> >>
>> >> My question - how to configure Tika Server in a containerized
>> environment
>> >> to maximize throughput?
>> >>
>> >> My environment:
>> >>
>> >>    - I am using Openshift.
>> >>    - Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory:
>> *8
>> >>    GiB to 10 GiB*.
>> >>    - I have 10 tika parsing pod replicas.
>> >>
>> >> On each pod, I run a java program where I have 8 parse threads.
>> >>
>> >> Each thread:
>> >>
>> >>    - Starts a single tika server process (in spawn child mode)
>> >>       - Tika server arguments: -s -spawnChild -maxChildStartupMillis
>> >> 120000
>> >>       -pingPulseMillis 500 -pingTimeoutMillis 30000 -taskPulseMillis
>> 500
>> >>       -taskTimeoutMillis 120000 -JXmx512m -enableUnsecureFeatures
>> >> -enableFileUrl
>> >>    - The thread will now continuously grab a file from the
>> files-to-fetch
>> >>    queue and will send it to the tika server, stopping when there are
>> no
>> >> more
>> >>    files to parse.
>> >>
>> >> Each of these files are stored locally on the pod in a buffer, so the
>> >> local
>> >> file optimization is used:
>> >>
>> >> The Tika web service it is using is:
>> >>
>> >> Endpoint: `/rmeta/text`
>> >> Method: `PUT`
>> >> Headers:    - writeLimit = 32000000    - maxEmbeddedResources = 0    -
>> >> fileUrl = file:///path/to/file
>> >>
>> >> Files are no greater than 100Mb, the maximum number of bytes tika text
>> >> will
>> >> be (writeLimit) 32Mb.
>> >>
>> >> Each pod is parsing about 370,000 documents per day. I've been messing
>> >> with
>> >> a ton of different attempts at settings.
>> >>
>> >> I previously tried to use the actual Tika "ForkParser" but the
>> performance
>> >> was far worse than spawning tika servers. So that is why I am using
>> Tika
>> >> Server.
>> >>
>> >> I don't hate the performance results of this.... but I feel like I'd
>> >> better
>> >> reach out and make sure there isn't someone out there who sanity
>> checks my
>> >> numbers and is like "woah that's awful performance, you should be
>> getting
>> >> xyz like me!"
>> >>
>> >> Anyone have any similar things you are doing? If so, what settings did
>> you
>> >> end up settling on?
>> >>
>> >> Also, I'm wondering if Apache Http Client would be causing any overhead
>> >> here when I am calling to my Tika Server /rmeta/text endpoint. I am
>> using
>> >> a
>> >> shared connection pool. Would there be any benefit in say using a
>> unique
>> >> HttpClients.createDefault() for each thread instead of sharing a
>> >> connection
>> >> pool between the threads?
>> >>
>> >>
>> >> Cross posted question here as well
>> >>
>> >>
>> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput
>> >>
>> >
>>
>

Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

Posted by Nicholas DiPiazza <ni...@gmail.com>.

Hey Luis,

It is related because after your fixes I might be able to take some
significant performance advantage by switching to fork parser.
I would make great use of an example of someone else who has set up a
ForkParser multi-thread able processing program that can gracefully handle
the huge onslaught that is my use case.
But at this point, I doubt I'll switch from Tika Server anyways because I
invested some time creating a wrapper around it and it is performing very
well.

On Wed, Nov 25, 2020 at 8:23 PM Luís Filipe Nassif <lf...@gmail.com>
wrote:

> Not what you asked but related :)
>
> Luis
>
> Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif <lf...@gmail.com>
> escreveu:
>
> > I've done some few improvements in ForkParser performance in an internal
> > fork. Will try to contribute upstream...
> >
> > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
> > nicholas.dipiazza@gmail.com> escreveu:
> >
> >> I am attempting to Tika parse dozens of millions of office documents.
> >> Pdfs,
> >> docs, excels, xmls, etc. Wide assortment of types.
> >>
> >> Throughput is very important. I need to be able parse these files in a
> >> reasonable amount of time, but at the same time, accuracy is also pretty
> >> important. I hope to have less than 10% of the documents parsed fail.
> (And
> >> by fail I mean fail due to tika stability, like a timeout while
> parsing. I
> >> do not mean fail due to the document itself).
> >>
> >> My question - how to configure Tika Server in a containerized
> environment
> >> to maximize throughput?
> >>
> >> My environment:
> >>
> >>    - I am using Openshift.
> >>    - Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8
> >>    GiB to 10 GiB*.
> >>    - I have 10 tika parsing pod replicas.
> >>
> >> On each pod, I run a java program where I have 8 parse threads.
> >>
> >> Each thread:
> >>
> >>    - Starts a single tika server process (in spawn child mode)
> >>       - Tika server arguments: -s -spawnChild -maxChildStartupMillis
> >> 120000
> >>       -pingPulseMillis 500 -pingTimeoutMillis 30000 -taskPulseMillis 500
> >>       -taskTimeoutMillis 120000 -JXmx512m -enableUnsecureFeatures
> >> -enableFileUrl
> >>    - The thread will now continuously grab a file from the
> files-to-fetch
> >>    queue and will send it to the tika server, stopping when there are no
> >> more
> >>    files to parse.
> >>
> >> Each of these files are stored locally on the pod in a buffer, so the
> >> local
> >> file optimization is used:
> >>
> >> The Tika web service it is using is:
> >>
> >> Endpoint: `/rmeta/text`
> >> Method: `PUT`
> >> Headers:    - writeLimit = 32000000    - maxEmbeddedResources = 0    -
> >> fileUrl = file:///path/to/file
> >>
> >> Files are no greater than 100Mb, the maximum number of bytes tika text
> >> will
> >> be (writeLimit) 32Mb.
> >>
> >> Each pod is parsing about 370,000 documents per day. I've been messing
> >> with
> >> a ton of different attempts at settings.
> >>
> >> I previously tried to use the actual Tika "ForkParser" but the
> performance
> >> was far worse than spawning tika servers. So that is why I am using Tika
> >> Server.
> >>
> >> I don't hate the performance results of this.... but I feel like I'd
> >> better
> >> reach out and make sure there isn't someone out there who sanity checks
> my
> >> numbers and is like "woah that's awful performance, you should be
> getting
> >> xyz like me!"
> >>
> >> Anyone have any similar things you are doing? If so, what settings did
> you
> >> end up settling on?
> >>
> >> Also, I'm wondering if Apache Http Client would be causing any overhead
> >> here when I am calling to my Tika Server /rmeta/text endpoint. I am
> using
> >> a
> >> shared connection pool. Would there be any benefit in say using a unique
> >> HttpClients.createDefault() for each thread instead of sharing a
> >> connection
> >> pool between the threads?
> >>
> >>
> >> Cross posted question here as well
> >>
> >>
> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput
> >>
> >
>

Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

Posted by Luís Filipe Nassif <lf...@gmail.com>.

Not what you asked but related :)

Luis

Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif <lf...@gmail.com>
escreveu:

> I've done some few improvements in ForkParser performance in an internal
> fork. Will try to contribute upstream...
>
> Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
> nicholas.dipiazza@gmail.com> escreveu:
>
>> I am attempting to Tika parse dozens of millions of office documents.
>> Pdfs,
>> docs, excels, xmls, etc. Wide assortment of types.
>>
>> Throughput is very important. I need to be able parse these files in a
>> reasonable amount of time, but at the same time, accuracy is also pretty
>> important. I hope to have less than 10% of the documents parsed fail. (And
>> by fail I mean fail due to tika stability, like a timeout while parsing. I
>> do not mean fail due to the document itself).
>>
>> My question - how to configure Tika Server in a containerized environment
>> to maximize throughput?
>>
>> My environment:
>>
>>    - I am using Openshift.
>>    - Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8
>>    GiB to 10 GiB*.
>>    - I have 10 tika parsing pod replicas.
>>
>> On each pod, I run a java program where I have 8 parse threads.
>>
>> Each thread:
>>
>>    - Starts a single tika server process (in spawn child mode)
>>       - Tika server arguments: -s -spawnChild -maxChildStartupMillis
>> 120000
>>       -pingPulseMillis 500 -pingTimeoutMillis 30000 -taskPulseMillis 500
>>       -taskTimeoutMillis 120000 -JXmx512m -enableUnsecureFeatures
>> -enableFileUrl
>>    - The thread will now continuously grab a file from the files-to-fetch
>>    queue and will send it to the tika server, stopping when there are no
>> more
>>    files to parse.
>>
>> Each of these files are stored locally on the pod in a buffer, so the
>> local
>> file optimization is used:
>>
>> The Tika web service it is using is:
>>
>> Endpoint: `/rmeta/text`
>> Method: `PUT`
>> Headers:    - writeLimit = 32000000    - maxEmbeddedResources = 0    -
>> fileUrl = file:///path/to/file
>>
>> Files are no greater than 100Mb, the maximum number of bytes tika text
>> will
>> be (writeLimit) 32Mb.
>>
>> Each pod is parsing about 370,000 documents per day. I've been messing
>> with
>> a ton of different attempts at settings.
>>
>> I previously tried to use the actual Tika "ForkParser" but the performance
>> was far worse than spawning tika servers. So that is why I am using Tika
>> Server.
>>
>> I don't hate the performance results of this.... but I feel like I'd
>> better
>> reach out and make sure there isn't someone out there who sanity checks my
>> numbers and is like "woah that's awful performance, you should be getting
>> xyz like me!"
>>
>> Anyone have any similar things you are doing? If so, what settings did you
>> end up settling on?
>>
>> Also, I'm wondering if Apache Http Client would be causing any overhead
>> here when I am calling to my Tika Server /rmeta/text endpoint. I am using
>> a
>> shared connection pool. Would there be any benefit in say using a unique
>> HttpClients.createDefault() for each thread instead of sharing a
>> connection
>> pool between the threads?
>>
>>
>> Cross posted question here as well
>>
>> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput
>>
>

Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

Posted by Luís Filipe Nassif <lf...@gmail.com>.

I've done some few improvements in ForkParser performance in an internal
fork. Will try to contribute upstream...

Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
nicholas.dipiazza@gmail.com> escreveu:

> I am attempting to Tika parse dozens of millions of office documents. Pdfs,
> docs, excels, xmls, etc. Wide assortment of types.
>
> Throughput is very important. I need to be able parse these files in a
> reasonable amount of time, but at the same time, accuracy is also pretty
> important. I hope to have less than 10% of the documents parsed fail. (And
> by fail I mean fail due to tika stability, like a timeout while parsing. I
> do not mean fail due to the document itself).
>
> My question - how to configure Tika Server in a containerized environment
> to maximize throughput?
>
> My environment:
>
>    - I am using Openshift.
>    - Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8
>    GiB to 10 GiB*.
>    - I have 10 tika parsing pod replicas.
>
> On each pod, I run a java program where I have 8 parse threads.
>
> Each thread:
>
>    - Starts a single tika server process (in spawn child mode)
>       - Tika server arguments: -s -spawnChild -maxChildStartupMillis 120000
>       -pingPulseMillis 500 -pingTimeoutMillis 30000 -taskPulseMillis 500
>       -taskTimeoutMillis 120000 -JXmx512m -enableUnsecureFeatures
> -enableFileUrl
>    - The thread will now continuously grab a file from the files-to-fetch
>    queue and will send it to the tika server, stopping when there are no
> more
>    files to parse.
>
> Each of these files are stored locally on the pod in a buffer, so the local
> file optimization is used:
>
> The Tika web service it is using is:
>
> Endpoint: `/rmeta/text`
> Method: `PUT`
> Headers:    - writeLimit = 32000000    - maxEmbeddedResources = 0    -
> fileUrl = file:///path/to/file
>
> Files are no greater than 100Mb, the maximum number of bytes tika text will
> be (writeLimit) 32Mb.
>
> Each pod is parsing about 370,000 documents per day. I've been messing with
> a ton of different attempts at settings.
>
> I previously tried to use the actual Tika "ForkParser" but the performance
> was far worse than spawning tika servers. So that is why I am using Tika
> Server.
>
> I don't hate the performance results of this.... but I feel like I'd better
> reach out and make sure there isn't someone out there who sanity checks my
> numbers and is like "woah that's awful performance, you should be getting
> xyz like me!"
>
> Anyone have any similar things you are doing? If so, what settings did you
> end up settling on?
>
> Also, I'm wondering if Apache Http Client would be causing any overhead
> here when I am calling to my Tika Server /rmeta/text endpoint. I am using a
> shared connection pool. Would there be any benefit in say using a unique
> HttpClients.createDefault() for each thread instead of sharing a connection
> pool between the threads?
>
>
> Cross posted question here as well
>
> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput
>