You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/01/19 13:00:29 UTC

[jira] [Commented] (TIKA-2244) excessive memory usage when parsing a large nested package file

    [ https://issues.apache.org/jira/browse/TIKA-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15829867#comment-15829867 ] 

Tim Allison commented on TIKA-2244:
-----------------------------------

PR committed.  Thank you!  I found a few other places to check {{markSupported}} before wrapping in a new {{BufferedInputStream}} in tika-parsers.

Should we also update {{AutoDetectReader}} to check for {{markSupported}} before wrapping?

> excessive memory usage when parsing a large nested package file
> ---------------------------------------------------------------
>
>                 Key: TIKA-2244
>                 URL: https://issues.apache.org/jira/browse/TIKA-2244
>             Project: Tika
>          Issue Type: Bug
>          Components: core, parser
>    Affects Versions: 2.0
>            Reporter: Joshua Hight
>            Priority: Minor
>
> When parsing large nested files(a couple good examples are maven jars and git objects), a large number of BufferedInputStreams get generated taking up large amounts of memory with their buffers. Upon looking through the relevant code I saw that many of these allocations were coming from TikaInputStream.get(InputStream, TemporaryResources)
> which checks if the InputStream is a BufferedInputStream or ByteArrayInputStream in order to determine whether on not mark is supported. Unfortunately it is common practice to wrap InputStreams in CloseShieldInputStreams, causing it to fail even if mark is in fact supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)