You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ethan Wilansky (Jira)" <ji...@apache.org> on 2022/10/20 19:11:00 UTC
[jira] [Closed] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction
[ https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ethan Wilansky closed TIKA-3890.
--------------------------------
Fix Version/s: 2.5.0
Resolution: Fixed
> Identifying an efficient approach for getting page count prior to running an extraction
> ---------------------------------------------------------------------------------------
>
> Key: TIKA-3890
> URL: https://issues.apache.org/jira/browse/TIKA-3890
> Project: Tika
> Issue Type: Improvement
> Components: app
> Affects Versions: 2.5.0
> Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit
> Reporter: Ethan Wilansky
> Priority: Blocker
> Fix For: 2.5.0
>
>
> Tika is doing a great job with text extraction, until we encounter an Office document with an unreasonably large number of pages with extractable text. For example a Word document containing thousands of text pages. Unfortunately, we don't have an efficient way to determine page count before calling the /tika or /rmeta endpoints and either getting back an array allocation error or setting byteArrayMaxOverride to a large number to return the text or metadata containing the page count. Returning a result other than the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document" [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{<?xml version="1.0" encoding="UTF-8" standalone="no"?>}}
> {{<properties>}}
> {{ <parsers>}}
> {{ <parser class="org.apache.tika.parser.DefaultParser">}}
> {{ <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{ <parser-exclude class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{ </parser>}}
> {{ <parser class="org.apache.tika.parser.microsoft.OfficeParser">}}
> {{ <params>}}
> {{ <param name="byteArrayMaxOverride" type="int">175000000</param>}}
> {{ </params>}}
> {{ </parser>}}
> {{ </parsers>}}
> {{ <server>}}
> {{ <params>}}
> {{ <taskTimeoutMillis>120000</taskTimeoutMillis>}}
> {{ <forkedJvmArgs>}}
> {{ <arg>-Xms2000m</arg>}}
> {{ <arg>-Xmx5000m</arg>}}
> {{ </forkedJvmArgs>}}
> {{ </params>}}
> {{ </server>}}
> {{</properties>}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I don't configure {{byteArrayMaxOverride}} I get this exception in just over a second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer these questions?
> 1. Will other extractable file types that don't use the OfficeParser also throw the same array allocation error for very large text extractions?
> 2. Is there any way to correlate the array length returned to the number of lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable content in a file before sending it for extraction? It doesn't appear that /rmeta with the /ignore path param significantly improves efficiency over calling the /tika endpoint or /rmeta w/out /igmore
> If its useful, I can share the 8MB docx file containing 14k pages.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)