You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by CM Reddy <ma...@netisoftware.com> on 2015/09/16 13:57:25 UTC
Not able to read the exact text highlighted across the lines.
Hi All,
I am working on reading the highlighted from PDF document using PDBox. I
was able to read the highlighted text in single line both single and
multiple words. However, I could not read the highlighted text across
the lines. Please find the following sample code to read the highlighted
text.
<code>
PDDocument pddDocument = PDDocument.load(new File("C:\\pdf-sample.pdf"));
List allPages = pddDocument.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++) {
int pageNum = i + 1;
PDPage page = (PDPage) allPages.get(i);
List<PDAnnotation> la = page.getAnnotations();
if (la.size() < 1) {
continue;
}
System.out.println("Page number : "+pageNum);
for (PDAnnotation pdfAnnot: la) {
if (pdfAnnot.getSubtype().equals("Popup")) {
continue;
}
PDFTextStripperByArea stripper = new
PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDRectangle rect = pdfAnnot.getRectangle();
float x = rect.getLowerLeftX() - 1;
float y = rect.getUpperRightY() - 1;
float width = rect.getWidth();
float height = rect.getHeight() + rect.getHeight() / 4;
int rotation = page.findRotation();
if (rotation == 0) {
PDRectangle pageSize = page.getMediaBox();
y = pageSize.getHeight() - y;
}
Rectangle2D.Float awtRect = new Rectangle2D.Float(x, y,
width, height);
stripper.addRegion(Integer.toString(0), awtRect);
stripper.extractRegions(page);
System.out.println("------------------------------------------------------------------");
System.out.println("Annot type = " +
pdfAnnot.getSubtype());
System.out.println("Getting text from region = " +
stripper.getTextForRegion(Integer.toString(0)) + "\n");
System.out.println("Getting text from comment = " +
pdfAnnot.getContents());
}
}
</code>
While reading the highlighted text across the lines,
"pdfAnnot.getRectangle()" function returns the minimum rectangle area
around the text. This gives more text than required. I could not find
any API to extract the exact highlighted text.
Any help will be highly appreciated.
- Thanks
CM Reddy
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org