You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Sebastian Nagel <wa...@googlemail.com> on 2021/12/06 09:26:08 UTC

Re: XMLReaderUtils Contention

Hi Cristian, hi Tim,

>> org.apache.tika.utils.XMLReaderUtils Contention waiting for a
>> SAXParser. Consider increasing the XMLReaderUtils.POOL_SIZE

I regularly count these messages in the log files of a large and highly
concurrent web crawl with 160 threads fetching data and performing
content type detection (and parsing for some setups) in parallel.

I've upgraded to Tika 2.1.0 recently but haven't seen a significant
increase in the counts of these warnings.

In addition, all warnings are observed during the first 2-5 minutes
fetching one batch (aka. "segment" in Nutch's terminology). Later
the warnings disappear. I'm trying to get rid of them by starting
the threads step by step with 0.5 seconds delay. First results
look promising.

Given the setup:
- XMLReaderUtils.POOL_SIZE = 80
- 160 threads
- 2-4 fetcher tasks (processes) on each machine
- 4-8 vCPUs per machine
... theoretically, there shouldn't no such issues as the number of CPUs
is much smaller than the pool size. Nevertheless, during the first
minutes the SAX parser usage is somehow imbalanced. If all threads
start fetching at the same time, they continue with the content and
charset detection (ev. also XML parsing) at similar times. That's
also how it looks in the logs: pool overflows do not happen often,
but when there can be up to 160 warnings in a single second.

Best,
Sebastian

On 11/29/21 21:25, Tim Allison wrote:
> Hi Cristian,
>   A couple of things come to mind...
> 
> 1) Are you seeing any parse exceptions that start with "Waited more
> than 5 minutes for a SAXParser"... that would be a bad sign.
> 2) I'm frankly not clear on how jax-rs does threading (if per request,
> etc).  The default pool size is 10 so if there are more than 10
> threads needing to parse xml at then you'll have contention.  You can
> bump up that number with something like the following in your
> tika-config.xml file
> 
> <properties>
>     <xml-reader-utils maxEntityExpansions="5" poolSize="33"/>
> </properties>
> 
> I don't think we made many changes to that area of the code between 1x
> and 2x so I'm surprised that this is new, but I can look into it a bit
> further.
> 
> Best,
> 
>           Tim
> 
> On Thu, Nov 25, 2021 at 11:11 AM Cristian Zamfir <cr...@cyberhaven.com> wrote:
>>
>> Hi,
>>
>> I am getting this error quite often with version 2.1.0-full:
>>
>> org.apache.tika.utils.XMLReaderUtils Contention waiting for a SAXParser. Consider increasing the XMLReaderUtils.POOL_SIZE
>>
>> I googled and it looks like it may be harmless, but I am not sure if that is still the case https://stackoverflow.com/questions/64333788/org-apache-tika-utils-xmlreaderutils-acquiresaxparser-warning-contention-waitin
>>
>> Thanks,
>> Cristi
>>
>>