You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Hudson (JIRA)" <ji...@apache.org> on 2018/03/21 09:05:00 UTC

[jira] [Commented] (TIKA-2585) TikaInputStream support for resetting via a factory of InputStreams

    [ https://issues.apache.org/jira/browse/TIKA-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407631#comment-16407631 ] 

Hudson commented on TIKA-2585:
------------------------------

SUCCESS: Integrated in Jenkins build Tika-trunk #1458 (See [https://builds.apache.org/job/Tika-trunk/1458/])
TIKA-2585 Support for creating a TikaInputStream from a Factory that (nick: [https://github.com/apache/tika/commit/682c38db038df7d3e55189623bdc8efb7eb0d0fd])
* (edit) tika-core/src/test/java/org/apache/tika/io/TikaInputStreamTest.java
* (add) tika-core/src/main/java/org/apache/tika/io/InputStreamFactory.java
* (edit) tika-core/src/main/java/org/apache/tika/io/TikaInputStream.java


> TikaInputStream support for resetting via a factory of InputStreams
> -------------------------------------------------------------------
>
>                 Key: TIKA-2585
>                 URL: https://issues.apache.org/jira/browse/TIKA-2585
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.0, 1.17
>            Reporter: Nick Burch
>            Priority: Major
>             Fix For: 1.18
>
>
> As raised in the 2.0 breaking changes thread, currently the only way that Tika has of handling the need to fully read an InputStream multiple times is to use {{TikaInputStream.getFile()}} which will spool to a temp file if not already file-based. (Reading a few kb is handled via buffering and mark/reset, but that doesn't scale for huge full files)
> In some cases, grabbing a fresh {{InputStream}} is actually cheaper than Tika spooling to a temp file, but we've no way of a caller expressing that
> So, before we make too much extra use of re-processing the whole input several times (eg for the augmenting-parsers and fallback-parsers), we should provide a way for callers to instead supply new {{InputStream}} instances on demand



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)