You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Navendu Garg <ga...@gmail.com> on 2009/09/16 17:06:43 UTC

Extracting text wrapped by PDAnnotationLink

Hi,

I am facing some issues with extracting text wrapped by the
PDAnnotationLink. First a little background:

I am using a the PDFTextStripper class to extract individual bounding
boxes for each character on the page. Then I extract the rectangle
from the PDAnnotationLink instance. Finally I traverse the list of
characters and see which all characters lie inside the bounding
rectangle for the link. It works fine for most of the cases. It fails
in two scenarios:

a) the link text breaks on line and continues on the next line. Thus
the bounding rectangle selects the entire text  for both the lines. As
a result my algorithm fails.
b) Sometimes the character bounding rectangle coordinates lie outside
the bounding rectangle for the link, even though visibly the character
seems to be inside the link. As a result
I am unable to select those characters.

Does anyone have a better idea about how to approach this problem?

thanks,

Navendu