You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Luís Filipe Nassif (Jira)" <ji...@apache.org> on 2020/04/10 03:53:00 UTC
[jira] [Comment Edited] (TIKA-2849) TikaInputStream copies the input stream locally

    [ https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080195#comment-17080195 ] 

Luís Filipe Nassif edited comment on TIKA-2849 at 4/10/20, 3:52 AM:
--------------------------------------------------------------------

Hi [~boris-petrov],

There are a number of Tika parsers that need a java.io.File because it is needed by Tika's dependencies. Looking at current sources, I found File is needed by parsers of rar, 7z, pst, mp4, jpg, tif, webp, sqlite, maybe others... Currently there is no way to know if a parser will spool the stream or not.

But, my organization have a project with a hard requirement to run a search tool in computers/cellphones with very limited resources in the field, and we prefer to receive an IOException("File size larger than max spool limit") from parsers instead of waiting too long in dangerous places or exhausting computer resources and crashing the app...

[~tallison], What do you think of a new TikaInputStream constructor that takes the spool limit or some setMaxSpoolSize() method to set this limit? If reached in getPath(), TikaInputStream should throw the IOException above. The problem is that many parsers (and also thirdparty parsers) uses getPath(), not the new getPath(maxBytes), and if this one is used, maxBytes should be received by parsers in some way (parseContext?). I prefer the first approach...

If approved, I can code that, is simple.


was (Author: lfcnassif):
Hi [~boris-petrov],

There are a number of Tika parsers that need a java.io.File because it is needed by Tika's dependencies. Looking at current sources, I found File is needed by parsers of rar, 7z, pst, mp4, jpg, tif, webp, sqlite, maybe others... Currently there is no way to know if a parser will spool the stream or not.

But, my organization have a project with a hard requirement to run a search tool in computers/cellphones with very limited resources in the field, and we prefer to receive an IOException("File size larger than max spool limit") from parsers instead of waiting too long in dangerous places or exhausting computer resources and crashing the app...

[~tallison], What do you think of a new TikaInputStream constructor that takes the spool limit or some setMaxSpoolSize() method to set this limit? If reached, TikaInputStream should throw the IOException above.

If approved, I can code that, is simple.

> TikaInputStream copies the input stream locally
> -----------------------------------------------
>
>                 Key: TIKA-2849
>                 URL: https://issues.apache.org/jira/browse/TIKA-2849
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.20
>            Reporter: Boris Petrov
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.21
>
>
> When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream", execution gets to "TikaInputStream#getPath" which does a "Files.copy(in, path, REPLACE_EXISTING);" which is very, very bad. This input stream could be, as in our case, an input stream from a network file which is tens or hundreds of gigabytes large. Copying it locally is a huge waste of resources to say the least. Why does it do that and can I make it not do it? Or is this something that has to be fixed in Tika?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)