You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Narendran Solai Sridharan (Jira)" <ji...@apache.org> on 2022/11/07 17:00:00 UTC

[jira] [Comment Edited] (TIKA-3919) Out of Memory during file parsing in AutoDetectParser

    [ https://issues.apache.org/jira/browse/TIKA-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629914#comment-17629914 ] 

Narendran Solai Sridharan edited comment on TIKA-3919 at 11/7/22 4:59 PM:
--------------------------------------------------------------------------

Thanks for the quick reply, AutoDetectParser is being used to *detect* the content type to *parse* and {*}index{*}.

POIFSContainerDetector type is setting a hard limit for "markLimit" to be 134217728, is there a way to set the limit from AutoDetectParser? 

Found that the issue is not occurring if the files are not Protected for DRM. if the files are protected by DRM, out of memory is occurring. For single document parsing for the same there is no issue.

 


was (Author: narendranss):
Thanks for the quick reply, AutoDetectParser is being used to *detect* the content type to *parse* and {*}index{*}.

POIFSContainerDetector type is setting a hard limit for "markLimit" to be 134217728, is there a way to set the limit from AutoDetectParser? 

Found that the issue is not occurring if the files are not Protected for DRM. if the files are protected by DRM, out of memory is occurring.

 

> Out of Memory during file parsing in AutoDetectParser
> -----------------------------------------------------
>
>                 Key: TIKA-3919
>                 URL: https://issues.apache.org/jira/browse/TIKA-3919
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser, tika-core
>    Affects Versions: 2.4.1
>         Environment: OS : Windows 10,
> Software Platform : Java
>  
>  
>            Reporter: Narendran Solai Sridharan
>            Priority: Major
>         Attachments: Large Object-1.PNG, Model.xlsx, Thread dump.PNG
>
>
> Out of Memory during file parsing in AutoDetectParser. Issue is occurring in almost all newly created Microsoft Documents while parsing documents in parallel in different threads, seems there is an issue in parsing new documents :(
> java.lang.OutOfMemoryError: Java heap space
>     at org.apache.tika.io.LookaheadInputStream.<init>(LookaheadInputStream.java:66)
>     at org.apache.tika.io.TikaInputStream.getPath(TikaInputStream.java:683)
>     at org.apache.tika.detect.microsoft.POIFSContainerDetector.getTopLevelNames(POIFSContainerDetector.java:467)
>     at org.apache.tika.detect.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:530)
>     at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:85)
>     at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:142)
> While testing load in our existing environment, which has been upgraded from tika version 1.28.1 to 2.4.1. 
> The following file which is almost empty [^Model.xlsx] had been parsed via client program multiple times via JMeter. Seems, we are getting Out of Memory due to a limit set "markLimit = 134217728", but not sure.
>  
> !Thread dump.PNG!
>  
> !Large Object-1.PNG!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)