You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2022/06/13 15:11:00 UTC

[jira] [Commented] (TIKA-3779) Temp file leftover in PDFParser.parse()

    [ https://issues.apache.org/jira/browse/TIKA-3779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553636#comment-17553636 ] 

Tim Allison commented on TIKA-3779:
-----------------------------------

Thank you, [~tilman]. I fixed the PDF issue and added the null check you recommended.  I'll push this update shortly.

The the stream stuff has become messy because of the refactored rendering code.  

I'm going to try to fix the grib stuff now.

> Temp file leftover in PDFParser.parse()
> ---------------------------------------
>
>                 Key: TIKA-3779
>                 URL: https://issues.apache.org/jira/browse/TIKA-3779
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.4.0
>            Reporter: Tilman Hausherr
>            Assignee: Tim Allison
>            Priority: Minor
>
> I've wondered where the many "apache-tika-" files in the temp directory came from. It turns out that they are all (or most) PDF files so I looked at the PDF parser module. After looking at the file sizes and getting a file name I focused on the test {{PDFParserTest.testSortByPosition()}} where the first 2 parse tests have a leftover file and the 3rd one doesn't.
> The difference is that in the third one, {{PDFParser.parse()}} gets a {{TikaInputStream}} as parameter. {{TikaInputStream().get()}} returns its parameter. But in the first two, it creates a new object, which is never closed. So the resource cleanup is never done.
> Adding
> {code}
>             if (!(stream instanceof TikaInputStream)) {
>                 tstream.close();
>             }
> {code}
> fixes this, i.e. no leftover files after running PDFParserTest.
> There's a null check in that method, but later the object is used without a null check. So either the null check isn't needed, or there is an NPE risk.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)