You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2008/08/04 20:22:44 UTC

[jira] Created: (PDFBOX-359) ClassCastException issue when extracting graphics

ClassCastException issue when extracting graphics
-------------------------------------------------

                 Key: PDFBOX-359
                 URL: https://issues.apache.org/jira/browse/PDFBOX-359
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
            Reporter: Jukka Zitting


[Issue from SourceForge]
http://sourceforge.net/tracker/index.php?func=detail&aid=1995807&group_id=78314&atid=552833

Hello

I am evaluating PDFBox 7.0.13 to extract images out of a bunch of PDF
files. These PDF files are all scanned documents. The graphics will then be
passed to an OCR program to extract the text.
During the execution, about 15% of the documents fail with 2 types of
errors:
-------------------------------------------------
java.lang.ClassCastException: org.pdfbox.cos.COSArray cannot be cast to
org.pdfbox.cos.COSDictionary
at
org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.buildHeader(PDCcitt
.java:501)
at
org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java
:363)
at
org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java
:354)
at
org.pdfbox.pdmodel.graphics.xobject.PDCcitt.write2OutputStream(PDCcitt.java
:128)
at PDFBox1.parseDocument(PDFBox1.java:237)
at PDFBox1.processAll(PDFBox1.java:108)
at PDFBox1.main(PDFBox1.java:468)
Failed to process - reason: Failed to parse file
-------------------------------------------------
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at org.pdfbox.pdmodel.graphics.predictor.None.decode(None.java:71)
at
org.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDPixelMap.java:
154)
at
org.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStream(PDPixelMa
p.java:166)
at PDFBox1.parseDocument(PDFBox1.java:237)
at PDFBox1.processAll(PDFBox1.java:108)
at PDFBox1.main(PDFBox1.java:468)
-------------------------------------------------
My problem is that these documents are classified, so I cannot submit a
test case.
Basically, I have 2 questions:
1. since these problem always occur at the same address, can you identify
the problem without a test case?
2. does the CVS version (7.0.14) contain a fix for these problems?

Best regards

JP
dev@softpark.ws

[Comment on SourceForge]
Date: 2008-06-24 07:43
Sender: nobody
Logged In: NO 

I run the same tests using the PDFBox-0.7.4-dev-20080223 version. The
first error has been replaced by a new one:
Processing java.lang.NullPointerException
	at
org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.buildHeader(PDCcitt.java:529)
	at
org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java:372)
	at
org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java:363)
	at
org.pdfbox.pdmodel.graphics.xobject.PDCcitt.write2OutputStream(PDCcitt.java:137)
	at PDFBox1.parseDocument(PDFBox1.java:237)
	at PDFBox1.processAll(PDFBox1.java:108)
	at PDFBox1.main(PDFBox1.java:468)
The second error still occurs:
java.lang.ArrayIndexOutOfBoundsException
	at java.lang.System.arraycopy(Native Method)
	at org.pdfbox.pdmodel.graphics.predictor.None.decode(None.java:71)
	at
org.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDPixelMap.java:173)
	at
org.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStream(PDPixelMap.java:190)
	at PDFBox1.parseDocument(PDFBox1.java:237)
	at PDFBox1.processAll(PDFBox1.java:108)
	at PDFBox1.main(PDFBox1.java:468)
I used 5000 files for the test and about 10% fail with one of these two
exceptions.
Any solution or should I use another library to extract graphics out of
PDF files?

Best regards

JP
dev@softpark.ws


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.