You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ethan Wilansky (Jira)" <ji...@apache.org> on 2022/10/14 15:44:00 UTC
[jira] [Created] (TIKA-3880) Tika not picking-up setByteArrayMaxOverride from tika-config
Ethan Wilansky created TIKA-3880:
------------------------------------
Summary: Tika not picking-up setByteArrayMaxOverride from tika-config
Key: TIKA-3880
URL: https://issues.apache.org/jira/browse/TIKA-3880
Project: Tika
Issue Type: Improvement
Components: app
Affects Versions: 2.5.0
Environment: We are running this through docker on a machine with plenty of memory resources allocated to Docker.
Docker config: 32 GB, 8 processors
Host machine: 64 GB, 32 processors
Our docker-compose configuration is derived from: [https://github.com/apache/tika-docker/blob/master/docker-compose-tika-customocr.yml]
We are experienced with Docker and are confident that the issue isn't with Docker.
Reporter: Ethan Wilansky
I have specified this parser parameter in tika-config.xml:
<properties>
<parserclass="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
<params>
<paramname="byteArrayMaxOverride"type="int">700000000</param>
</params>
</parser>
</properties>
I've also verified that the tika-config.xml is being picked-up by Tika on startup:
org.apache.tika.server.core.TikaServerProcess Using custom config: /tika-config.xml
However, when I encounter a very large docx file, I can clearly see that the configuration in tika-config is not being picked-up:
Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 686,679,089, but the maximum length for this record type is 100,000,000.
If the file is not corrupt and not large, please open an issue on bugzilla to request
increasing the maximum allowable size for this record type.
You can set a higher override value with IOUtils.setByteArrayMaxOverride()
I understand that this is a very large docx file. However, we can handle this amount of text extraction and am fine with the time it takes for Tika to perform this extraction and the amount of memory required to complete this extraction.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)