You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Leandro de Oliveira <le...@yahoo.com.br> on 2010/02/03 20:11:54 UTC

Conversion to display units

Hi,

I'm using PDFTextStripper to get text from a PDF document but I need to get text only from some regions in the PDF. I know these regions are being drawn using the "re" operator which draws a rectangle using x,y,width,height as arguments. How do I convert these four arguments to display units so I can compare them with the TextPosition.getX()?

Thank you


      ____________________________________________________________________________________
Veja quais são os assuntos do momento no Yahoo! +Buscados
http://br.maisbuscados.yahoo.com

Re: Conversion to display units

Posted by Leandro de Oliveira <le...@yahoo.com.br>.
I'm doing as you said, first I find rectangular areas converting coordinates to display units then I get the text from them.

Thank you

--- Em qui, 4/2/10, Villu Ruusmann <vi...@gmail.com> escreveu:

> De: Villu Ruusmann <vi...@gmail.com>
> Assunto: Re: Conversion to display units
> Para: users@pdfbox.apache.org
> Data: Quinta-feira, 4 de Fevereiro de 2010, 9:07
> Hello there,
> 
> >
> > I'm using PDFTextStripper to get text from a PDF
> document but I need to get text only from some regions in
> the PDF. I know these regions are being drawn using the "re"
> operator which draws a rectangle using x,y,width,height as
> arguments. How do I convert these four arguments to display
> units so I can compare them with the TextPosition.getX()?
> >
> 
> The PDF "re" operator is handled by class
> org.apache.pdfbox.util.operator.pagedrawer.AppendRectangleToPath.
> As
> the package name indicates, this class is meant to be used
> from within
> the PageDrawer utility, not from within the PDFTextStripper
> utility.
> If you take a look at this class you would see that the
> actual
> transformation is implemented in method
> org.apache.pdfbox.pdfviewer.PageDrawer#transformedPoint(double,
> double).
> 
> If I were given similar task, I would perform two runs on a
> PDF
> document, First I would use PageDrawer utility to capture
> rectangular
> areas (simply override #fillPath(int) and/or #strokePath,
> and grab
> #getLinePath there). Then I would use PDFTextStripper (or
> better yet,
> PDFTextStripperByArea), and extract text from the
> previously captured
> rectangular areas.
> 
> 
> VR
> 


      ____________________________________________________________________________________
Veja quais são os assuntos do momento no Yahoo! +Buscados
http://br.maisbuscados.yahoo.com

Re: Conversion to display units

Posted by Villu Ruusmann <vi...@gmail.com>.
Hello there,

>
> I'm using PDFTextStripper to get text from a PDF document but I need to get text only from some regions in the PDF. I know these regions are being drawn using the "re" operator which draws a rectangle using x,y,width,height as arguments. How do I convert these four arguments to display units so I can compare them with the TextPosition.getX()?
>

The PDF "re" operator is handled by class
org.apache.pdfbox.util.operator.pagedrawer.AppendRectangleToPath. As
the package name indicates, this class is meant to be used from within
the PageDrawer utility, not from within the PDFTextStripper utility.
If you take a look at this class you would see that the actual
transformation is implemented in method
org.apache.pdfbox.pdfviewer.PageDrawer#transformedPoint(double,
double).

If I were given similar task, I would perform two runs on a PDF
document, First I would use PageDrawer utility to capture rectangular
areas (simply override #fillPath(int) and/or #strokePath, and grab
#getLinePath there). Then I would use PDFTextStripper (or better yet,
PDFTextStripperByArea), and extract text from the previously captured
rectangular areas.


VR