You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2012/05/18 19:05:07 UTC

[jira] [Closed] (PDFBOX-359) ClassCastException issue when extracting graphics

     [ https://issues.apache.org/jira/browse/PDFBOX-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler closed PDFBOX-359.
-------------------------------------

    Resolution: Cannot Reproduce
      Assignee: Andreas Lehmkühler

Set to closed as we don't have a sample pdf.
                
> ClassCastException issue when extracting graphics
> -------------------------------------------------
>
>                 Key: PDFBOX-359
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-359
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Jukka Zitting
>            Assignee: Andreas Lehmkühler
>
> [Issue from SourceForge]
> http://sourceforge.net/tracker/index.php?func=detail&aid=1995807&group_id=78314&atid=552833
> Hello
> I am evaluating PDFBox 7.0.13 to extract images out of a bunch of PDF
> files. These PDF files are all scanned documents. The graphics will then be
> passed to an OCR program to extract the text.
> During the execution, about 15% of the documents fail with 2 types of
> errors:
> -------------------------------------------------
> java.lang.ClassCastException: org.pdfbox.cos.COSArray cannot be cast to
> org.pdfbox.cos.COSDictionary
> at
> org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.buildHeader(PDCcitt
> .java:501)
> at
> org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java
> :363)
> at
> org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java
> :354)
> at
> org.pdfbox.pdmodel.graphics.xobject.PDCcitt.write2OutputStream(PDCcitt.java
> :128)
> at PDFBox1.parseDocument(PDFBox1.java:237)
> at PDFBox1.processAll(PDFBox1.java:108)
> at PDFBox1.main(PDFBox1.java:468)
> Failed to process - reason: Failed to parse file
> -------------------------------------------------
> java.lang.ArrayIndexOutOfBoundsException
> at java.lang.System.arraycopy(Native Method)
> at org.pdfbox.pdmodel.graphics.predictor.None.decode(None.java:71)
> at
> org.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDPixelMap.java:
> 154)
> at
> org.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStream(PDPixelMa
> p.java:166)
> at PDFBox1.parseDocument(PDFBox1.java:237)
> at PDFBox1.processAll(PDFBox1.java:108)
> at PDFBox1.main(PDFBox1.java:468)
> -------------------------------------------------
> My problem is that these documents are classified, so I cannot submit a
> test case.
> Basically, I have 2 questions:
> 1. since these problem always occur at the same address, can you identify
> the problem without a test case?
> 2. does the CVS version (7.0.14) contain a fix for these problems?
> Best regards
> JP
> dev@softpark.ws
> [Comment on SourceForge]
> Date: 2008-06-24 07:43
> Sender: nobody
> Logged In: NO 
> I run the same tests using the PDFBox-0.7.4-dev-20080223 version. The
> first error has been replaced by a new one:
> Processing java.lang.NullPointerException
> 	at
> org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.buildHeader(PDCcitt.java:529)
> 	at
> org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java:372)
> 	at
> org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java:363)
> 	at
> org.pdfbox.pdmodel.graphics.xobject.PDCcitt.write2OutputStream(PDCcitt.java:137)
> 	at PDFBox1.parseDocument(PDFBox1.java:237)
> 	at PDFBox1.processAll(PDFBox1.java:108)
> 	at PDFBox1.main(PDFBox1.java:468)
> The second error still occurs:
> java.lang.ArrayIndexOutOfBoundsException
> 	at java.lang.System.arraycopy(Native Method)
> 	at org.pdfbox.pdmodel.graphics.predictor.None.decode(None.java:71)
> 	at
> org.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDPixelMap.java:173)
> 	at
> org.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStream(PDPixelMap.java:190)
> 	at PDFBox1.parseDocument(PDFBox1.java:237)
> 	at PDFBox1.processAll(PDFBox1.java:108)
> 	at PDFBox1.main(PDFBox1.java:468)
> I used 5000 files for the test and about 10% fail with one of these two
> exceptions.
> Any solution or should I use another library to extract graphics out of
> PDF files?
> Best regards
> JP
> dev@softpark.ws

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira