You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2022/10/14 16:29:00 UTC

[jira] [Commented] (TIKA-3880) Tika not picking-up setByteArrayMaxOverride from tika-config

    [ https://issues.apache.org/jira/browse/TIKA-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617859#comment-17617859 ] 

Tim Allison commented on TIKA-3880:
-----------------------------------

Can you share a bit more of the stacktrace?  It looks like you're parsing an ole2 (doc/ppt/xls) file and not an ooxml file (docx/pptx/xlsx).  The setter on the OOXMLParser triggers a static config on poi so that should work

What I can't figure out from the above is that that it looks like you are only using the OOXMLParser.  I don't see a default parser, and in fact, your <parser> should be wrapped in a <parsers/> element.

So, my guess is that Tika is silently ignoring this directive...

> Tika not picking-up setByteArrayMaxOverride from tika-config
> ------------------------------------------------------------
>
>                 Key: TIKA-3880
>                 URL: https://issues.apache.org/jira/browse/TIKA-3880
>             Project: Tika
>          Issue Type: Improvement
>          Components: app
>    Affects Versions: 2.5.0
>         Environment: We are running this through docker on a machine with plenty of memory resources allocated to Docker.
> Docker config: 32 GB, 8 processors
> Host machine: 64 GB, 32 processors
> Our docker-compose configuration is derived from: [https://github.com/apache/tika-docker/blob/master/docker-compose-tika-customocr.yml]
> We are experienced with Docker and are confident that the issue isn't with Docker.
>  
>            Reporter: Ethan Wilansky
>            Priority: Blocker
>
> I have specified this parser parameter in tika-config.xml:
> <properties>
>   <parserclass="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
>     <params>
>       <paramname="byteArrayMaxOverride"type="int">700000000</param>
>     </params>
> </parser>
> </properties>
>  
> I've also verified that the tika-config.xml is being picked-up by Tika on startup:
>   org.apache.tika.server.core.TikaServerProcess Using custom config: /tika-config.xml
>  
> However, when I encounter a very large docx file, I can clearly see that the configuration in tika-config is not being picked-up:
>  
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 686,679,089, but the maximum length for this record type is 100,000,000.
> If the file is not corrupt and not large, please open an issue on bugzilla to request 
> increasing the maximum allowable size for this record type.
> You can set a higher override value with IOUtils.setByteArrayMaxOverride()
>  
> I understand that this is a very large docx file. However, we can handle this amount of text extraction and am fine with the time it takes for Tika to perform this extraction and the amount of memory required to complete this extraction. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)