You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Andreas Lehmkuehler <an...@lehmi.de> on 2010/05/01 13:45:50 UTC

Re: Coordinate system for text

Hi,

Michael Howard schrieb:
> I have a question about the coordinate system orientation for text.
> 
> Using PrintTextLocations and the ExtractTextByArea example, I have
> observed that the coordinate system for the position of the text has
> the Y coordinate running down the page.
> 
> I was surprised by this because ExtractImageLocations reports images
> with the origin being at the lower left.
I'm not sure if the PrintImageLocations example works correct, see PDFBOX-585
for further details. [1]

> My quick browsing of the pdf spec at
> http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf shows
> examples of text with the origin being defined from the lower left
> corner of the page.
That's correct.

> I tried it on multiple .pdf documents from different sources to ensure
> that there wasn't something strange with my .pdf files.
> 
> I didn't find any discussion of this in the email archives.
> 
> Any comments or explanation about why the Y coordinate system runs
> down the page would be helpful.
Both PrintTextLocations and ExtractTextByArea are using the PDFTextStripper
class. It uses the rendering code to extract the text and the renderer itself
uses Java2D to show each page. As Java2D uses the upper left corner as
0,0 reference the Y coordinate runs down the page.

Probably we should just improve the mentioned examples to calculate/process both
coordinate systems (PDF and Java2D).

WDYT?

BR
Andreas Lehmkühler


[1] https://issues.apache.org/jira/browse/PDFBOX-585

Re: Coordinate system for text

Posted by Michael Howard <mi...@uforlife.com>.
On Sat, May 1, 2010 at 7:45 AM, Andreas Lehmkuehler <an...@lehmi.de> wrote:
> Hi,
>
> Michael Howard schrieb:
>>
>> I have a question about the coordinate system orientation for text.
>>
>> Using PrintTextLocations and the ExtractTextByArea example, I have
>> observed that the coordinate system for the position of the text has
>> the Y coordinate running down the page.
>>
>> I was surprised by this because ExtractImageLocations reports images
>> with the origin being at the lower left.
>
> I'm not sure if the PrintImageLocations example works correct, see
> PDFBOX-585
> for further details. [1]

It is true that PrintImageLocations has problems with the width and
height. But it correctly reports the x,y coordinates in the lower left
of the embedded images.

I have become familiar with PrintImageLocations and have some
understanding of the errors in the image width and height
calculations. We can discuss that in a separate thread if you would
like.

<snip>

>> Any comments or explanation about why the Y coordinate system runs
>> down the page would be helpful.
>
> Both PrintTextLocations and ExtractTextByArea are using the PDFTextStripper
> class. It uses the rendering code to extract the text and the renderer
> itself
> uses Java2D to show each page. As Java2D uses the upper left corner as
> 0,0 reference the Y coordinate runs down the page.

OK, that is a good explanation as to why the text extraction routines
have the coordinate system running down the page.

I observe that the units are still 72 dpi PDF units ... just the Y
axis runs down the page.

> Probably we should just improve the mentioned examples to calculate/process
> both
> coordinate systems (PDF and Java2D).
>
> WDYT?

Yes, I think it would be best if the text extraction routines could
work with the PDF coordinate system orientation.

I think that if the ordinate unit size is 72 dpi then it would be more
clear to users if we consistently maintained the PDF coordinate
system, for both text and images.

At this point pdfbox needs to support the existing users who have
built text extraction code with the Y-down orientation ... we should
support both.

I am not yet familiar enough with the pdfbox code base to recommend
whether supporting both Y-axis orientations should be done through
different methods or with a flag/setting that changes the behavior of
the existing methods.

Another related factor is the coordinate system that is used when pdf
documents are generated using pdfbox. Thus far I have only been
reading pdf documents, not generating them. Therefore I do not know
which coordinate system is used when pdf documents are generated.
However it seems to me that the generation and retrieval sides should
use a consistent coordinate system orientation.



Thanks for all of your work. Let me know how I can help.

Michael

> BR
> Andreas Lehmkühler
>
>
> [1] https://issues.apache.org/jira/browse/PDFBOX-585