You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/10/10 23:22:33 UTC

[jira] [Closed] (PDFBOX-577) TextPosition should expose its bounding box

     [ https://issues.apache.org/jira/browse/PDFBOX-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hewson closed PDFBOX-577.
------------------------------
    Resolution: Invalid

The Ascent and Descent values in the PDF dictionary are **not** used when computing glyph positions. In fact, it's common for these values to be missing or invalid. In any case, the BBox value is actually what is wanted, but that suffers from the same problem.

If somebody wants to tackle this problem in the future, it can be fairly easily done in 2.0 with the new APIs provided by PDFont which can extract the BBox from the embedded or substituted font - or even compute exact bounds from the glyph outlines. A new issue or patch addressing this is welcome.

> TextPosition should expose its bounding box
> -------------------------------------------
>
>                 Key: PDFBOX-577
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-577
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: PDModel
>            Reporter: Villu Ruusmann
>         Attachments: 0001-PDFont.java-Add-methods-to-retreive-the-Ascent-and-D.patch, AFM-getHeight.png, AFM-getUpperRightY.png, textposition-randombg.zip
>
>
> It does not seem to be possible to calculate the bounding box of a TextPosition.
> IIUC, TextPosition#getY is the baseline of the text and TextPosition#getHeight is the absolute height of the text. When I subtract the latter from the former I get a top line, but this is only correct if the text does not contain descender characters.
> Below is a screenshot (AFM-getHeight.png) which shows the bounding boxes of TextPositions calculated as {#getX(), #getY() - #getHeight, #getWidth, #getHeight} painted in random colors. For example, the bounding boxes of parentheses are severely misplaced, which makes the line-by-line text extraction impossible.
> Right now I've solved the problem by tweaking AFM FontMetrics code so that it returns BoundingBox#getUpperRightY instead of BoundingBox#getHeight when queried via PDSimpleFont#getFontHeight(byte[], int, int). Another screenshot (AFM-getUpperRightY.png) shows how this restores the previously broken text extraction ability.
> It seems like a good idea to rework TextPosition so that it would be aware of its bounding box:
> *) Replace methods PDSimpleFont#getFontWidth(byte[], int, int) and PDSimpleFont#getFontHeight(byte[], int, int) with a single method PDSimpleFont#getFontBoundingBox(byte[], int, int)
> *) Replace the constructor TextPosition(Matrix, Matrix) with TextPosition(Matrix, BoundingBox)
> *) Add new methods TextPosition#getBoundingBox, TextPosition#getBoundingBoxDir. This shouldn't affect existing application clients, because TextPosition#getY and TextPosition#getHeight remain in place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)