You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2014/10/23 18:53:34 UTC

[jira] [Commented] (PDFBOX-1151) StreamCorruptedException on bad PDF with -force

    [ https://issues.apache.org/jira/browse/PDFBOX-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181559#comment-14181559 ] 

Andreas Lehmkühler commented on PDFBOX-1151:
--------------------------------------------

Even the new self repair doesn't work here. I've got another exception but the result is the same, nothinh is rendered.

{code}
Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException
	at org.apache.pdfbox.filter.LZWFilter.doLZWDecode(LZWFilter.java:120)
	at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:95)
	at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:386)
	at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:327)
	at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:244)
	at org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:109)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:216)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:198)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:152)
	at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:179)
	at org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:215)
	at org.apache.pdfbox.rendering.PDFRenderer.renderPageToGraphics(PDFRenderer.java:177)
	at org.apache.pdfbox.rendering.PDFRenderer.renderPageToGraphics(PDFRenderer.java:161)
	at org.apache.pdfbox.tools.gui.PDFPagePanel.paint(PDFPagePanel.java:87)
{code}


> StreamCorruptedException on bad PDF with -force
> -----------------------------------------------
>
>                 Key: PDFBOX-1151
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1151
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0, 1.8.7, 2.0.0
>         Environment: Windows Vista
> Sun JDK 1.6.0_26
>            Reporter: Stas Shaposhnikov
>         Attachments: PDFStreamEngine.patch, test.pdf
>
>
> I am getting the StreamCorruptedException when trying to parse a possibly invalid PDF document even if the -force option is specified.
> Stack trace:
> java.io.StreamCorruptedException: Error: data is null
> 	at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)
> 	at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301)
> 	at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
> 	at org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
> 	at org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:105)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:264)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
> 	at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
> 	at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
> 	at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
> 	at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:256)
> 	at org.apache.pdfbox.ExtractText.main(ExtractText.java:76)
> 	at org.apache.pdfbox.PDFBox.main(PDFBox.java:42)
> My suggestion is to skip bad sub-streams without throwing exceptions in PDFStreamEngine.processSubStream() in case of forceParsing is true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)