You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2018/02/21 21:42:00 UTC

[jira] [Created] (TIKA-2585) TikaInputStream support for resetting via a factory of InputStreams

Nick Burch created TIKA-2585:
--------------------------------

             Summary: TikaInputStream support for resetting via a factory of InputStreams
                 Key: TIKA-2585
                 URL: https://issues.apache.org/jira/browse/TIKA-2585
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.17, 2.0
            Reporter: Nick Burch


As raised in the 2.0 breaking changes thread, currently the only way that Tika has of handling the need to fully read an InputStream multiple times is to use `TikaInputStream.getFile()` which will spool to a temp file if not already file-based. (Reading a few kb is handled via buffering and mark/reset, but that doesn't scale for huge full files)

In some cases, grabbing a fresh `InputStream` is actually cheaper than Tika spooling to a temp file, but we've no way of a caller expressing that

So, before we make too much extra use of re-processing the whole input several times (eg for the augmenting-parsers and fallback-parsers), we should provide a way for callers to instead supply new InputStream instances on demand



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)