You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2014/11/12 14:01:33 UTC

[jira] [Comment Edited] (TIKA-1471) OOM with corrupt PDF file

    [ https://issues.apache.org/jira/browse/TIKA-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208008#comment-14208008 ] 

Tim Allison edited comment on TIKA-1471 at 11/12/14 1:00 PM:
-------------------------------------------------------------

>From the discussion on PDFBOX-2493, this looks to be solved by PDFBox 1.8.7, which we're now using in trunk.

Thank you, [~alanbur], for reporting this issue on both Tika and PDFBox.  We need to fix these serious errors as they are discovered.  

At this point, code that uses Tika needs to be able to handle regular exceptions, OOM errors and permanent hangs...these catastrophic errors will happen...rarely...but they do happen.  

Use of the ForkParser and tika server can help avoid some of these issues, and on TIKA-1330, we're working to develop a robust wrapper around Tika that can handle these types of problems so that every integrator doesn't have to reinvent the wheel.




was (Author: tallison@mitre.org):
>From the discussion on PDFBOX-2493, this looks to be solved by PDFBox 1.8.8.  I'll leave this open until we upgrade.

Thank you, [~alanbur], for reporting this issue on both Tika and PDFBox.  We need to fix these serious errors as they are discovered.  

At this point, code that uses Tika needs to be able to handle regular exceptions, OOM errors and permanent hangs...these catastrophic errors will happen...rarely...but they do happen.  

Use of the ForkParser and tika server can help avoid some of these issues, and on TIKA-1330, we're working to develop a robust wrapper around Tika that can handle these types of problems so that every integrator doesn't have to reinvent the wheel.



> OOM with corrupt PDF file
> -------------------------
>
>                 Key: TIKA-1471
>                 URL: https://issues.apache.org/jira/browse/TIKA-1471
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.6
>         Environment: Linux, JVM 1.8.0_25-b17, 64-bit
>            Reporter: Alan Burlison
>            Priority: Blocker
>             Fix For: 1.7
>
>
> Use of PDFBox 1.8.6 by Tika 1.6 is causing OOM errors with corrupt PDF files, due to a bug in PDFBox, see PDFBOX-2493. This makes Tika 1.6 unusable from inside a long-running webapp and I've had to revert to Tika 1.5. Although 1.5 also throws errors with the corrupt file it does not cause OOM errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)