You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Villu Ruusmann (JIRA)" <ji...@apache.org> on 2009/12/04 20:58:20 UTC

[jira] Created: (PDFBOX-577) TextPosition should expose its bounding box

TextPosition should expose its bounding box
-------------------------------------------

                 Key: PDFBOX-577
                 URL: https://issues.apache.org/jira/browse/PDFBOX-577
             Project: PDFBox
          Issue Type: Improvement
            Reporter: Villu Ruusmann


It does not seem to be possible to calculate the bounding box of a TextPosition.

IIUC, TextPosition#getY is the baseline of the text and TextPosition#getHeight is the absolute height of the text. When I subtract the latter from the former I get a top line, but this is only correct if the text does not contain descender characters.

Below is a screenshot (AFM-getHeight.png) which shows the bounding boxes of TextPositions calculated as {#getX(), #getY() - #getHeight, #getWidth, #getHeight} painted in random colors. For example, the bounding boxes of parentheses are severely misplaced, which makes the line-by-line text extraction impossible.

Right now I've solved the problem by tweaking AFM FontMetrics code so that it returns BoundingBox#getUpperRightY instead of BoundingBox#getHeight when queried via PDSimpleFont#getFontHeight(byte[], int, int). Another screenshot (AFM-getUpperRightY.png) shows how this restores the previously broken text extraction ability.

It seems like a good idea to rework TextPosition so that it would be aware of its bounding box:
*) Replace methods PDSimpleFont#getFontWidth(byte[], int, int) and PDSimpleFont#getFontHeight(byte[], int, int) with a single method PDSimpleFont#getFontBoundingBox(byte[], int, int)
*) Replace the constructor TextPosition(Matrix, Matrix) with TextPosition(Matrix, BoundingBox)
*) Add new methods TextPosition#getBoundingBox, TextPosition#getBoundingBoxDir. This shouldn't affect existing application clients, because TextPosition#getY and TextPosition#getHeight remain in place.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-577) TextPosition should expose its bounding box

Posted by "Villu Ruusmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Villu Ruusmann updated PDFBOX-577:
----------------------------------

    Attachment: AFM-getUpperRightY.png

> TextPosition should expose its bounding box
> -------------------------------------------
>
>                 Key: PDFBOX-577
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-577
>             Project: PDFBox
>          Issue Type: Improvement
>            Reporter: Villu Ruusmann
>         Attachments: AFM-getHeight.png, AFM-getUpperRightY.png
>
>
> It does not seem to be possible to calculate the bounding box of a TextPosition.
> IIUC, TextPosition#getY is the baseline of the text and TextPosition#getHeight is the absolute height of the text. When I subtract the latter from the former I get a top line, but this is only correct if the text does not contain descender characters.
> Below is a screenshot (AFM-getHeight.png) which shows the bounding boxes of TextPositions calculated as {#getX(), #getY() - #getHeight, #getWidth, #getHeight} painted in random colors. For example, the bounding boxes of parentheses are severely misplaced, which makes the line-by-line text extraction impossible.
> Right now I've solved the problem by tweaking AFM FontMetrics code so that it returns BoundingBox#getUpperRightY instead of BoundingBox#getHeight when queried via PDSimpleFont#getFontHeight(byte[], int, int). Another screenshot (AFM-getUpperRightY.png) shows how this restores the previously broken text extraction ability.
> It seems like a good idea to rework TextPosition so that it would be aware of its bounding box:
> *) Replace methods PDSimpleFont#getFontWidth(byte[], int, int) and PDSimpleFont#getFontHeight(byte[], int, int) with a single method PDSimpleFont#getFontBoundingBox(byte[], int, int)
> *) Replace the constructor TextPosition(Matrix, Matrix) with TextPosition(Matrix, BoundingBox)
> *) Add new methods TextPosition#getBoundingBox, TextPosition#getBoundingBoxDir. This shouldn't affect existing application clients, because TextPosition#getY and TextPosition#getHeight remain in place.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-577) TextPosition should expose its bounding box

Posted by "Villu Ruusmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Villu Ruusmann updated PDFBOX-577:
----------------------------------

    Attachment: AFM-getHeight.png

> TextPosition should expose its bounding box
> -------------------------------------------
>
>                 Key: PDFBOX-577
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-577
>             Project: PDFBox
>          Issue Type: Improvement
>            Reporter: Villu Ruusmann
>         Attachments: AFM-getHeight.png
>
>
> It does not seem to be possible to calculate the bounding box of a TextPosition.
> IIUC, TextPosition#getY is the baseline of the text and TextPosition#getHeight is the absolute height of the text. When I subtract the latter from the former I get a top line, but this is only correct if the text does not contain descender characters.
> Below is a screenshot (AFM-getHeight.png) which shows the bounding boxes of TextPositions calculated as {#getX(), #getY() - #getHeight, #getWidth, #getHeight} painted in random colors. For example, the bounding boxes of parentheses are severely misplaced, which makes the line-by-line text extraction impossible.
> Right now I've solved the problem by tweaking AFM FontMetrics code so that it returns BoundingBox#getUpperRightY instead of BoundingBox#getHeight when queried via PDSimpleFont#getFontHeight(byte[], int, int). Another screenshot (AFM-getUpperRightY.png) shows how this restores the previously broken text extraction ability.
> It seems like a good idea to rework TextPosition so that it would be aware of its bounding box:
> *) Replace methods PDSimpleFont#getFontWidth(byte[], int, int) and PDSimpleFont#getFontHeight(byte[], int, int) with a single method PDSimpleFont#getFontBoundingBox(byte[], int, int)
> *) Replace the constructor TextPosition(Matrix, Matrix) with TextPosition(Matrix, BoundingBox)
> *) Add new methods TextPosition#getBoundingBox, TextPosition#getBoundingBoxDir. This shouldn't affect existing application clients, because TextPosition#getY and TextPosition#getHeight remain in place.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.