You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2018/02/21 21:42:00 UTC
[jira] [Created] (TIKA-2585) TikaInputStream support for resetting
via a factory of InputStreams
Nick Burch created TIKA-2585:
--------------------------------
Summary: TikaInputStream support for resetting via a factory of InputStreams
Key: TIKA-2585
URL: https://issues.apache.org/jira/browse/TIKA-2585
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 1.17, 2.0
Reporter: Nick Burch
As raised in the 2.0 breaking changes thread, currently the only way that Tika has of handling the need to fully read an InputStream multiple times is to use `TikaInputStream.getFile()` which will spool to a temp file if not already file-based. (Reading a few kb is handled via buffering and mark/reset, but that doesn't scale for huge full files)
In some cases, grabbing a fresh `InputStream` is actually cheaper than Tika spooling to a temp file, but we've no way of a caller expressing that
So, before we make too much extra use of re-processing the whole input several times (eg for the augmenting-parsers and fallback-parsers), we should provide a way for callers to instead supply new InputStream instances on demand
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)