You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by CM Reddy <ma...@netisoftware.com> on 2015/09/16 13:57:25 UTC

Not able to read the exact text highlighted across the lines.

Hi All,
I am working on reading the highlighted from PDF document using PDBox. I 
was able to read the highlighted text in single line both single and 
multiple words. However, I could not read the highlighted text across 
the lines. Please find the following sample code to read the highlighted 
text.

<code>
PDDocument pddDocument = PDDocument.load(new File("C:\\pdf-sample.pdf"));
         List allPages = pddDocument.getDocumentCatalog().getAllPages();
         for (int i = 0; i < allPages.size(); i++) {
             int pageNum = i + 1;
             PDPage page = (PDPage) allPages.get(i);
             List<PDAnnotation> la = page.getAnnotations();
             if (la.size() < 1) {
                 continue;
             }
             System.out.println("Page number : "+pageNum);
             for (PDAnnotation pdfAnnot: la) {
                 if (pdfAnnot.getSubtype().equals("Popup")) {
                     continue;
                 }

                 PDFTextStripperByArea stripper = new 
PDFTextStripperByArea();
                 stripper.setSortByPosition(true);

                 PDRectangle rect = pdfAnnot.getRectangle();
                 float x = rect.getLowerLeftX() - 1;
                 float y = rect.getUpperRightY() - 1;
                 float width = rect.getWidth();
                 float height = rect.getHeight() + rect.getHeight() / 4;

                 int rotation = page.findRotation();
                 if (rotation == 0) {
                     PDRectangle pageSize = page.getMediaBox();
                     y = pageSize.getHeight() - y;
                 }

                 Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y, 
width, height);
                 stripper.addRegion(Integer.toString(0), awtRect);
                 stripper.extractRegions(page);
System.out.println("------------------------------------------------------------------");
                 System.out.println("Annot type = " + 
pdfAnnot.getSubtype());
                  System.out.println("Getting text from region = " + 
stripper.getTextForRegion(Integer.toString(0)) + "\n");
                  System.out.println("Getting text from comment = " + 
pdfAnnot.getContents());

             }
         }

</code>

While reading the highlighted text across the lines, 
"pdfAnnot.getRectangle()" function returns the minimum rectangle area 
around the text. This gives more text than required. I could not find 
any API to extract the exact highlighted text.

Any help will be highly appreciated.
- Thanks
CM Reddy

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org