You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "luo_123@msn.com" <lu...@msn.com> on 2015/01/04 15:54:03 UTC

Errors using extractRegions method

Hello,

I am using PDFBox to extract annotations and remarks from PDF file but I encountered something weird. I searched for snippet on the web and it works for the test PDF file. But when I try to deal with hundereds of PDFs, I found that for some PDFs IndexOutOfBounds Exception occur. Detail is listed below

org.apache.pdfbox.util.PDFStreamEngine processOperator 
Warning: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 1 
java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 1 
at java.util.Vector.get(Unknown Source) 
at org.apache.pdfbox.util.PDFTextStripper.processTextPosition(PDFTextStr 
ipper.java:1033) 
at org.apache.pdfbox.util.PDFTextStripperByArea.processTextPosition(PDFT 
extStripperByArea.java:171) 
at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEn 
gine.java:499) 
at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.j 
ava:62) 
at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngin 
e.java:557) 
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi 
ne.java:268) 
at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi 
ne.java:235) 
at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine. 
java:215) 
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.ja 
va:460) 
at org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextSt 
ripperByArea.java:153) 
at PDFBox.main(PDFBox.java:45)

----------And my source code is----------------
import java.awt.geom.Rectangle2D; 
import java.io.File; 
import java.util.List; 
import org.apache.pdfbox.pdmodel.PDDocument; 
import org.apache.pdfbox.pdmodel.PDPage; 
import org.apache.pdfbox.pdmodel.common.PDRectangle; 
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation; 
import org.apache.pdfbox.util.PDFTextStripperByArea; 

public class Test { 
public static void main(String args[]) { 
try { 
PDDocument pddDocument = PDDocument.load(new File(args[0])); 
List allPages = pddDocument.getDocumentCatalog().getAllPages(); 
for (int i = 0; i < allPages.size(); i++) { 
int pageNum = i + 1; 
PDPage page = (PDPage) allPages.get(i); 
List<PDAnnotation> la = page.getAnnotations(); 
if (la.size() < 1) { 
continue; 
} 

PDAnnotation pdfAnnot; 
PDFTextStripperByArea stripper; 
PDRectangle rect; 
for(int k=0;k<la.size();k++){ 
pdfAnnot = la.get(k); 
if(pdfAnnot.getSubtype().equals("Highlight")){ 
stripper = new PDFTextStripperByArea(); 
//stripper.setSortByPosition(true); 

rect = pdfAnnot.getRectangle(); 
float x = rect.getLowerLeftX() - 1; 
float y = rect.getUpperRightY() - 1; 
float width = rect.getWidth() + 2; 
float height = rect.getHeight() + rect.getHeight() / 4; 

int rotation = page.findRotation(); 
if (rotation == 0) { 
PDRectangle pageSize = page.findMediaBox(); 
y = pageSize.getHeight() - y; 
} 

Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, width, height); 
stripper.addRegion(Integer.toString(0), awtRect); 
stripper.extractRegions(page); 

System.out.println(stripper.getTextForRegion(Integer.toString(0))); 
System.out.println(pdfAnnot.getContents()); 
System.out.println(Integer.toString(pageNum)); 
System.out.println(pdfAnnot.getSubtype()); 
}else if(pdfAnnot.getSubtype().equals("Text")){ 
System.out.println(pdfAnnot.getContents()); 
System.out.println(Integer.toString(pageNum)); 
System.out.println(pdfAnnot.getSubtype()); 
} 
} 
} 
pddDocument.close(); 
} catch (Exception ex) { 
ex.printStackTrace(); 
} 
} 
}

This error only occurs in certain PDF files or I should say my code only works for some PDFs.
Thank you for your time and your help is highly appreciated.

Luo

Re: Errors using extractRegions method

Posted by "A.M. Sabuncu" <am...@gmail.com>.
> And mention what version you use (the answer should be "1.8.8" :-))

Now that was funny! :-)

Re: Errors using extractRegions method

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

Please upload the PDF file to a public place. And mention what version 
you use (the answer should be "1.8.8" :-))

Tilman

Am 04.01.2015 um 15:54 schrieb luo_123@msn.com:
> org.apache.pdfbox.util.PDFStreamEngine processOperator
> Warning: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 1
> java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 1
> at java.util.Vector.get(Unknown Source)
> at org.apache.pdfbox.util.PDFTextStripper.processTextPosition(PDFTextStr
> ipper.java:1033)
> at org.apache.pdfbox.util.PDFTextStripperByArea.processTextPosition(PDFT
> extStripperByArea.java:171)
> at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEn
> gine.java:499)
> at org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.j
> ava:62)
> at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngin
> e.java:557)
> at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
> ne.java:268)
> at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngi
> ne.java:235)
> at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.
> java:215)
> at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.ja
> va:460)
> at org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextSt
> ripperByArea.java:153)
> at PDFBox.main(PDFBox.java:45)