You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ethan Wilansky (Jira)" <ji...@apache.org> on 2022/10/17 13:41:00 UTC

[jira] [Resolved] (TIKA-3880) Tika not picking-up setByteArrayMaxOverride from tika-config

     [ https://issues.apache.org/jira/browse/TIKA-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ethan Wilansky resolved TIKA-3880.
----------------------------------
    Fix Version/s: 2.5.0
       Resolution: Resolved

Confirmed that the setByteArrayMaxOverride setting is being read and applied to the targeted parser.

> Tika not picking-up setByteArrayMaxOverride from tika-config
> ------------------------------------------------------------
>
>                 Key: TIKA-3880
>                 URL: https://issues.apache.org/jira/browse/TIKA-3880
>             Project: Tika
>          Issue Type: Improvement
>          Components: app
>    Affects Versions: 2.5.0
>         Environment: We are running this through docker on a machine with plenty of memory resources allocated to Docker.
> Docker config: 32 GB, 8 processors
> Host machine: 64 GB, 32 processors
> Our docker-compose configuration is derived from: [https://github.com/apache/tika-docker/blob/master/docker-compose-tika-customocr.yml]
> We are experienced with Docker and are confident that the issue isn't with Docker.
>  
>            Reporter: Ethan Wilansky
>            Priority: Blocker
>             Fix For: 2.5.0
>
>
> I have specified this parser parameter in tika-config.xml:
> <properties>
>   <parserclass="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
>     <params>
>       <paramname="byteArrayMaxOverride"type="int">700000000</param>
>     </params>
> </parser>
> </properties>
>  
> I've also verified that the tika-config.xml is being picked-up by Tika on startup:
>   org.apache.tika.server.core.TikaServerProcess Using custom config: /tika-config.xml
>  
> However, when I encounter a very large docx file, I can clearly see that the configuration in tika-config is not being picked-up:
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 686,679,089, but the maximum length for this record type is 100,000,000.
> If the file is not corrupt and not large, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type.
> You can set a higher override value with IOUtils.setByteArrayMaxOverride()
>  
> I understand that this is a very large docx file. However, we can handle this amount of text extraction and am fine with the time it takes for Tika to perform this extraction and the amount of memory required to complete this extraction. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)