You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Boris Petrov (Jira)" <ji...@apache.org> on 2020/04/08 14:27:00 UTC

[jira] [Commented] (TIKA-2849) TikaInputStream copies the input stream locally

    [ https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078338#comment-17078338 ] 

Boris Petrov commented on TIKA-2849:
------------------------------------

[~tallison] - We hit the same problem as the original issue was about but this time for parsing. These two:

{noformat}
org.apache.tika.parser.mp4.MP4Parser.parse(MP4Parser.java:132)
org.apache.tika.parser.external.ExternalParser.parse(ExternalParser.java:222)
{noformat}

On the latest Tika (1.24) copy the file. Could the same fix be done for them? If not, my previous question remains very relevant - for us copying the whole file is horrible and we have to protect from that happening. So an option to tell Tika not do it (just blow up or return an empty string or something) is very important. Or at least to have a way of knowing whether Tika will copy or not.

What do you think?

> TikaInputStream copies the input stream locally
> -----------------------------------------------
>
>                 Key: TIKA-2849
>                 URL: https://issues.apache.org/jira/browse/TIKA-2849
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.20
>            Reporter: Boris Petrov
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.21
>
>
> When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream", execution gets to "TikaInputStream#getPath" which does a "Files.copy(in, path, REPLACE_EXISTING);" which is very, very bad. This input stream could be, as in our case, an input stream from a network file which is tens or hundreds of gigabytes large. Copying it locally is a huge waste of resources to say the least. Why does it do that and can I make it not do it? Or is this something that has to be fixed in Tika?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)