You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2017/10/09 18:19:00 UTC

[jira] [Resolved] (PDFBOX-3955) new -- very slow processing on truncated PDF

     [ https://issues.apache.org/jira/browse/PDFBOX-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-3955.
----------------------------------------
    Resolution: Fixed

I've fixed the very slow performance. Objects streams were parsed multiple times when rebuilding the trailer dictionary. But my fix doesn't "heal" the truncated pdf. It's corrupt and can't be fixed as the root object is missing.

[~tallison@mitre.org] Thanks for the finding.

> new -- very slow processing on truncated PDF
> --------------------------------------------
>
>                 Key: PDFBOX-3955
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3955
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>            Reporter: Tim Allison
>            Assignee: Andreas Lehmkühler
>             Fix For: 2.0.8, 3.0.0
>
>
> In the latest regression run with PDFBox's 2.x branch, we're now getting very slow processing on a truncated PDF with PDFBox app's {{ExtractText}}:
> http://162.242.228.174/docs/truncated_pdfs/commoncrawl2_likely_broken/7K/7KK53NK5PVKOUGDSQ4FK6542BNPC4SWB
> Turns out this is not an infinite loop.  After 4.5 minutes, {{ExtractText}} eventually ended with: 
> {noformat}
> Exception in thread "main" java.io.IOException: Missing root object specification in trailer.
>         at org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2508)
>         at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:193)
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1012)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:950)
>         at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:192)
>         at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:82)
>         at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
> {noformat}
> .



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org