You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Andrew Munn <an...@nmedia.net> on 2015/04/25 07:24:16 UTC

pdfbox gives ArrayIndexOutOfBounds in PDFTextStripper

Procssing this doc:

http://www.oracle.com/technetwork/java/jaf-1-150219.pdf

I am getting this:

x=33
y=159
w=216
h=43
page=1
getting text from page #1 of 21 in doc.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array 
index out of range: 3
	at java.util.Vector.get(Vector.java:748)
	at org.apache.pdfbox.text.PDFTextStripper.processTextPosition(PDFTextStripper.java:903)
	at org.apache.pdfbox.text.PDFTextStripperByArea.processTextPosition(PDFTextStripperByArea.java:132)
	at org.apache.pdfbox.text.PDFTextStreamEngine.showGlyph(PDFTextStreamEngine.java:229)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:690)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:600)
	at org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:802)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:464)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
	at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117)
	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:347)
	at org.apache.pdfbox.text.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:113)

Code is:

 String textFromBox(PDDocument doc, int x, int y, int w, int h, int page) 
throws IOException {
        System.out.println("x="+x);
        System.out.println("y="+y);
        System.out.println("w="+w);
        System.out.println("h="+h);
        System.out.println("page="+page);
        PDFTextStripperByArea stripper = new PDFTextStripperByArea();
        Rectangle rect = new Rectangle(x, y - h, w, h);
        stripper.addRegion("region", rect);
        int pageCount = doc.getDocumentCatalog().getPages().getCount();
        System.out.println("getting text from page #" + page + " of " + 
pageCount + " in doc.");
        if (page <= pageCount) {
            PDPage pp = (PDPage) 
doc.getDocumentCatalog().getPages().get(page - 1);
            stripper.extractRegions(pp);
            String text = stripper.getTextForRegion("region");
            System.out.println("text=" + text);
            return text;
        } else {
            return "No page #" + page;
        }
    }


Thanks!



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: pdfbox gives ArrayIndexOutOfBounds in PDFTextStripper

Posted by Tilman Hausherr <TH...@t-online.de>.
I have opened a new issue:
https://issues.apache.org/jira/browse/PDFBOX-2775

This will be tricky... there are almost never problems with text 
extractions (except fonts).

Tilman

Am 25.04.2015 um 07:24 schrieb Andrew Munn:
> Procssing this doc:
>
> http://www.oracle.com/technetwork/java/jaf-1-150219.pdf
>
> I am getting this:
>
> x=33
> y=159
> w=216
> h=43
> page=1
> getting text from page #1 of 21 in doc.
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array
> index out of range: 3
> 	at java.util.Vector.get(Vector.java:748)
> 	at org.apache.pdfbox.text.PDFTextStripper.processTextPosition(PDFTextStripper.java:903)
> 	at org.apache.pdfbox.text.PDFTextStripperByArea.processTextPosition(PDFTextStripperByArea.java:132)
> 	at org.apache.pdfbox.text.PDFTextStreamEngine.showGlyph(PDFTextStreamEngine.java:229)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.showText(PDFStreamEngine.java:690)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.showTextStrings(PDFStreamEngine.java:600)
> 	at org.apache.pdfbox.contentstream.operator.text.ShowTextAdjusted.process(ShowTextAdjusted.java:38)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:802)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:464)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)
> 	at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)
> 	at org.apache.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:117)
> 	at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:347)
> 	at org.apache.pdfbox.text.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:113)
>
> Code is:
>
>   String textFromBox(PDDocument doc, int x, int y, int w, int h, int page)
> throws IOException {
>          System.out.println("x="+x);
>          System.out.println("y="+y);
>          System.out.println("w="+w);
>          System.out.println("h="+h);
>          System.out.println("page="+page);
>          PDFTextStripperByArea stripper = new PDFTextStripperByArea();
>          Rectangle rect = new Rectangle(x, y - h, w, h);
>          stripper.addRegion("region", rect);
>          int pageCount = doc.getDocumentCatalog().getPages().getCount();
>          System.out.println("getting text from page #" + page + " of " +
> pageCount + " in doc.");
>          if (page <= pageCount) {
>              PDPage pp = (PDPage)
> doc.getDocumentCatalog().getPages().get(page - 1);
>              stripper.extractRegions(pp);
>              String text = stripper.getTextForRegion("region");
>              System.out.println("text=" + text);
>              return text;
>          } else {
>              return "No page #" + page;
>          }
>      }
>
>
> Thanks!
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: pdfbox gives ArrayIndexOutOfBounds in PDFTextStripper

Posted by Andrew Munn <an...@nmedia.net>.
cool.  I will pull down a new snapshot soon.
Thanks

On Mon, 27 Apr 2015, Tilman Hausherr wrote:

> Am 27.04.2015 um 00:33 schrieb Andrew Munn:
> > On Mon, 27 Apr 2015, Tilman Hausherr wrote:
> > > try
> > > stripper.setShouldSeparateByBeads(false)
> > > do you get what you need?
> > Thanks.  I will check it out.  That was just one of several PDFs I was
> > doing some testing with and that one happened to generate that out of
> > bounds exception.
> 
> I have researched this a bit more and disabled ShouldSeparateByBeads in the
> area stripping class.
> https://issues.apache.org/jira/browse/PDFBOX-2775
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: pdfbox gives ArrayIndexOutOfBounds in PDFTextStripper

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 27.04.2015 um 00:33 schrieb Andrew Munn:
> On Mon, 27 Apr 2015, Tilman Hausherr wrote:
>> try
>> stripper.setShouldSeparateByBeads(false)
>> do you get what you need?
> Thanks.  I will check it out.  That was just one of several PDFs I was
> doing some testing with and that one happened to generate that out of
> bounds exception.

I have researched this a bit more and disabled ShouldSeparateByBeads in 
the area stripping class.
https://issues.apache.org/jira/browse/PDFBOX-2775


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: pdfbox gives ArrayIndexOutOfBounds in PDFTextStripper

Posted by Andrew Munn <an...@nmedia.net>.
On Mon, 27 Apr 2015, Tilman Hausherr wrote:
> try
> stripper.setShouldSeparateByBeads(false)
> do you get what you need?

Thanks.  I will check it out.  That was just one of several PDFs I was 
doing some testing with and that one happened to generate that out of 
bounds exception.

-Andrew


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: pdfbox gives ArrayIndexOutOfBounds in PDFTextStripper

Posted by Tilman Hausherr <TH...@t-online.de>.
try

stripper.setShouldSeparateByBeads(false)

do you get what you need?

Tilman



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org