You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "luo_123@msn.com" <lu...@msn.com> on 2015/01/04 15:54:03 UTC
Errors using extractRegions method
Hello,
I am using PDFBox to extract annotations and remarks from PDF file but I encountered something weird. I searched for snippet on the web and it works for the test PDF file. But when I try to deal with hundereds of PDFs, I found that for some PDFs IndexOutOfBounds Exception occur. Detail is listed below
org.apache.pdfbox.util.PDFStreamEngine processOperator
Warning: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 1
java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 1
at java.util.Vector.get(Unknown Source)
at org.apache.pdfbox.util.PDFTextStripper.processTextPosition(PDFTextStr
ipper.java:1033)
at org.apache.pdfbox.util.PDFTextStripperByArea.processTextPosition(PDFT
extStripperByArea.java:171)
at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEn
gine.java:499)
at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.j
ava:62)
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngin
e.java:557)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
ne.java:268)
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
ne.java:235)
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.
java:215)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.ja
va:460)
at org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextSt
ripperByArea.java:153)
at PDFBox.main(PDFBox.java:45)
----------And my source code is----------------
import java.awt.geom.Rectangle2D;
import java.io.File;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.apache.pdfbox.util.PDFTextStripperByArea;
public class Test {
public static void main(String args[]) {
try {
PDDocument pddDocument = PDDocument.load(new File(args[0]));
List allPages = pddDocument.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++) {
int pageNum = i + 1;
PDPage page = (PDPage) allPages.get(i);
List<PDAnnotation> la = page.getAnnotations();
if (la.size() < 1) {
continue;
}
PDAnnotation pdfAnnot;
PDFTextStripperByArea stripper;
PDRectangle rect;
for(int k=0;k<la.size();k++){
pdfAnnot = la.get(k);
if(pdfAnnot.getSubtype().equals("Highlight")){
stripper = new PDFTextStripperByArea();
//stripper.setSortByPosition(true);
rect = pdfAnnot.getRectangle();
float x = rect.getLowerLeftX() - 1;
float y = rect.getUpperRightY() - 1;
float width = rect.getWidth() + 2;
float height = rect.getHeight() + rect.getHeight() / 4;
int rotation = page.findRotation();
if (rotation == 0) {
PDRectangle pageSize = page.findMediaBox();
y = pageSize.getHeight() - y;
}
Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, width, height);
stripper.addRegion(Integer.toString(0), awtRect);
stripper.extractRegions(page);
System.out.println(stripper.getTextForRegion(Integer.toString(0)));
System.out.println(pdfAnnot.getContents());
System.out.println(Integer.toString(pageNum));
System.out.println(pdfAnnot.getSubtype());
}else if(pdfAnnot.getSubtype().equals("Text")){
System.out.println(pdfAnnot.getContents());
System.out.println(Integer.toString(pageNum));
System.out.println(pdfAnnot.getSubtype());
}
}
}
pddDocument.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
This error only occurs in certain PDF files or I should say my code only works for some PDFs.
Thank you for your time and your help is highly appreciated.
Luo
Re: Errors using extractRegions method
Posted by "A.M. Sabuncu" <am...@gmail.com>.
> And mention what version you use (the answer should be "1.8.8" :-))
Now that was funny! :-)
Re: Errors using extractRegions method
Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,
Please upload the PDF file to a public place. And mention what version
you use (the answer should be "1.8.8" :-))
Tilman
Am 04.01.2015 um 15:54 schrieb luo_123@msn.com:
> org.apache.pdfbox.util.PDFStreamEngine processOperator
> Warning: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 1
> java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 1
> at java.util.Vector.get(Unknown Source)
> at org.apache.pdfbox.util.PDFTextStripper.processTextPosition(PDFTextStr
> ipper.java:1033)
> at org.apache.pdfbox.util.PDFTextStripperByArea.processTextPosition(PDFT
> extStripperByArea.java:171)
> at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEn
> gine.java:499)
> at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.j
> ava:62)
> at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngin
> e.java:557)
> at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
> ne.java:268)
> at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
> ne.java:235)
> at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.
> java:215)
> at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.ja
> va:460)
> at org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextSt
> ripperByArea.java:153)
> at PDFBox.main(PDFBox.java:45)