You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2018/05/18 13:07:00 UTC

[jira] [Closed] (PDFBOX-4201) Certain scanned pdfs do not render

     [ https://issues.apache.org/jira/browse/PDFBOX-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tilman Hausherr closed PDFBOX-4201.
-----------------------------------
    Resolution: Won't Fix

> Certain scanned pdfs do not render
> ----------------------------------
>
>                 Key: PDFBOX-4201
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4201
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 2.0.8
>            Reporter: Antonio Contreras
>            Priority: Major
>         Attachments: PDFBOX-4201-content-stream.txt, testDoc2.pdf, testDoc2_unc-saved.pdf, testDoc2_unc.pdf
>
>
> I am using PDFBox version 2.0.8. I am trying to render scanned pdfs but there are a some that do not render and result in an error.  Native pdfs do not have any trouble rendering. The majority of the scanned pdfs that I have also do not have any trouble rendering but there are a couple that result in an error (one is attached).
> This is the code I used to render the pdf.
> {code:java}
> try (PDDocument document = load(file)) {
>     logger.debug("start generate image file " + pageNumber + " for " + name);
>     PDFRenderer pdfRenderer = new PDFRenderer(document);
>     return getPageImage(pdfRenderer, pageNumber, name, storageId);
> }{code}
> The above call to getPageImage calls the following code 
> {code:java}
> File imageFile = File.createTempFile(StringUtils.toFilename(storageId) + "_" + pageNumber, ".png");
> imageFile.deleteOnExit();
> final BufferedImage image = pdfRenderer.renderImageWithDPI(pageNumber - 1, dpi, ImageType.RGB);
> ImageIO.write(image, "png", imageFile);
> logger.debug("completed generate image file " + pageNumber + " for " + name);
> return imageFile;{code}
> The issue occurs in the second code snippet in the line
> {code:java}
> final BufferedImage image = pdfRenderer.renderImageWithDPI(pageNumber - 1, dpi, ImageType.RGB);{code}
>  
> The stack trace is the following
> {code:java}
> Caused by: java.io.IOException: Error: Expected operator 'ID' actual='In'
> at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:305) ~[pdfbox-2.0.8.jar:2.0.8]
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:502) ~[pdfbox-2.0.8.jar:2.0.8]
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469) ~[pdfbox-2.0.8.jar:2.0.8]
> at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150) ~[pdfbox-2.0.8.jar:2.0.8]
> at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:203) ~[pdfbox-2.0.8.jar:2.0.8]
> at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:145) ~[pdfbox-2.0.8.jar:2.0.8]
> at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:94) ~[pdfbox-2.0.8.jar:2.0.8]
> at com.sustain.document.PdfPageGenerator.getPageImage(PdfPageGenerator.java:70) ~[classes/:?]
> at com.sustain.document.PdfPageGenerator.getPageImage(PdfPageGenerator.java:59) ~[classes/:?]
> {code}
> Since rendering was not an issue with native pdfs I initially thought that only scanned pdfs were an issue. But after other scanned pdfs rendered, I am uncertain as to what could be causing some to render and some to error out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org