You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ethan Wilansky (Jira)" <ji...@apache.org> on 2022/10/20 19:11:00 UTC
[jira] [Closed] (TIKA-3890) Identifying an efficient approach for getting page count prior to running an extraction

     [ https://issues.apache.org/jira/browse/TIKA-3890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ethan Wilansky closed TIKA-3890.
--------------------------------
    Fix Version/s: 2.5.0
       Resolution: Fixed

> Identifying an efficient approach for getting page count prior to running an extraction
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-3890
>                 URL: https://issues.apache.org/jira/browse/TIKA-3890
>             Project: Tika
>          Issue Type: Improvement
>          Components: app
>    Affects Versions: 2.5.0
>         Environment: OS: OSX, Processor: M1 (ARM), RAM: 64GB, CPU: 8 cores
> Docker container with 5.5GB reserved memory, 6GB limit
> Tika config w/ 2GB reserved memory, 5GB limit 
>            Reporter: Ethan Wilansky
>            Priority: Blocker
>             Fix For: 2.5.0
>
>
> Tika is doing a great job with text extraction, until we encounter an Office document with an  unreasonably large number of pages with extractable text. For example a Word document containing thousands of text pages. Unfortunately, we don't have an efficient way to determine page count before calling the /tika or /rmeta endpoints and either getting back an array allocation error or setting  byteArrayMaxOverride to a large number to return the text or metadata containing the page count. Returning a result other than the array allocation error can take significant time.
> For example, this call:
> {{curl -T ./8mb.docx -H "Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document" [http://localhost:9998/rmeta/ignore]}}
> {quote}{{with the configuration:}}
> {{<?xml version="1.0" encoding="UTF-8" standalone="no"?>}}
> {{<properties>}}
> {{  <parsers>}}
> {{    <parser class="org.apache.tika.parser.DefaultParser">}}
> {{      <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>}}
> {{      <parser-exclude class="org.apache.tika.parser.microsoft.OfficeParser"/>}}
> {{    </parser>}}
> {{    <parser class="org.apache.tika.parser.microsoft.OfficeParser">}}
> {{      <params>}}
> {{        <param name="byteArrayMaxOverride" type="int">175000000</param>}}
> {{      </params>}}
> {{    </parser>}}
> {{  </parsers>}}
> {{  <server>}}
> {{    <params>}}
> {{      <taskTimeoutMillis>120000</taskTimeoutMillis>}}
> {{      <forkedJvmArgs>}}
> {{        <arg>-Xms2000m</arg>}}
> {{        <arg>-Xmx5000m</arg>}}
> {{      </forkedJvmArgs>}}
> {{    </params>}}
> {{  </server>}}
> {{</properties>}}
> {quote}
> returns: {{"xmpTPg:NPages":"14625"}} in ~53 seconds.
> Yes, I know this is a huge docx file and I don't want to process it. If I don't configure {{byteArrayMaxOverride}} I get this exception in just over a second:
> {{Tried to allocate an array of length 172,983,026, but the maximum length for this record type is 100,000,000.}} which is the preferred result.
> The exception is the preferred result. With that in mind, can you answer these questions?
> 1. Will other extractable file types that don't use the OfficeParser also throw the same array allocation error for very large text extractions? 
> 2. Is there any way to correlate the array length returned to the number of lines or pages in the associated file to parse?
> 3. Is there an efficient way to calculate lines or pages of extractable content in a file before sending it for extraction? It doesn't appear that /rmeta with the /ignore path param significantly improves efficiency over calling the /tika endpoint or /rmeta w/out /igmore  
> If its useful, I can share the 8MB docx file containing 14k pages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)