You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Peter Costello (JIRA)" <ji...@apache.org> on 2010/02/07 07:24:28 UTC

[jira] Created: (PDFBOX-611) PDSimpleFont. Font height reported as zero.

PDSimpleFont.  Font height reported as zero.
--------------------------------------------

                 Key: PDFBOX-611
                 URL: https://issues.apache.org/jira/browse/PDFBOX-611
             Project: PDFBox
          Issue Type: Bug
          Components: PDModel
    Affects Versions: 0.8.0-incubator
         Environment: Win and Linux
            Reporter: Peter Costello
             Fix For: 0.8.0-incubator


The logic for PDSimpleFont.getFontHeight() can return a value of zero.   
This will corrupt or compromise text extraction and layout.
In particular, test with 'http://www.encana.com/investor/financial/shareholder/pdfs/info-circular-french.pdf', pg 12 

When a PDFontDescriptor is used, the current logic uses:
   1) an average of xHeight and capHeight.   
             xHeight is the height from the baseline to the top of a lower case letter like 'x'.
             CapHeight is the height from the baseline to the top of an upper case latin char.
   2) xHeight
   3) capHeight
   4) ascent
   5) zero

This is really bizarre.  'xHeight' is an optional parameter, and 'capHeight' is often missing.

The font bounding box is a required parameter and is the height that is used by Acrobat Reader when you select a line of text.
The bounding box is not perfect, because it often overlaps the line above, but it is a consistent value.  The problem with the
current logic is that the reported height varies way too much, and a zero value can be reported.

I have modified the logic as follows. The goal was to make the nominal values the same as the current logic,
but return a very similar number when parameters go missing.

         PDFontDescriptor desc = getFontDescriptor();
          if( desc != null )  {
           	float height = desc.getCapHeight();				// Top of Cap to baseline (eg 715)
            	if (height==0) {
            		height=desc.getAscent();					// Max height from baseline (eg 715);
            	   	if (height==0) {
            	   		PDRectangle bbox = desc.getFontBoundingBox();
            	   		height = bbox.getHeight()/2;			// Max height less max depth (eg (1006-(-325))=1331/2=665)
            	   		if (height==0) {
            	   			height=desc.getXHeight();			// Top of lower-case to baseline (eg 518)
            	   			height-=desc.getDescent();		// Depth below baseline (eg 209, to get total of 727)
            	   		}
            	   	}
            	}
                retval=height;
          }


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-611) PDSimpleFont. Font height reported as zero.

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-611:
--------------------------------------

    Fix Version/s:     (was: 0.8.0-incubator)

> PDSimpleFont.  Font height reported as zero.
> --------------------------------------------
>
>                 Key: PDFBOX-611
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-611
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 0.8.0-incubator
>         Environment: Win and Linux
>            Reporter: Peter Costello
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The logic for PDSimpleFont.getFontHeight() can return a value of zero.   
> This will corrupt or compromise text extraction and layout.
> In particular, test with 'http://www.encana.com/investor/financial/shareholder/pdfs/info-circular-french.pdf', pg 12 
> When a PDFontDescriptor is used, the current logic uses:
>    1) an average of xHeight and capHeight.   
>              xHeight is the height from the baseline to the top of a lower case letter like 'x'.
>              CapHeight is the height from the baseline to the top of an upper case latin char.
>    2) xHeight
>    3) capHeight
>    4) ascent
>    5) zero
> This is really bizarre.  'xHeight' is an optional parameter, and 'capHeight' is often missing.
> The font bounding box is a required parameter and is the height that is used by Acrobat Reader when you select a line of text.
> The bounding box is not perfect, because it often overlaps the line above, but it is a consistent value.  The problem with the
> current logic is that the reported height varies way too much, and a zero value can be reported.
> I have modified the logic as follows. The goal was to make the nominal values the same as the current logic,
> but return a very similar number when parameters go missing.
>          PDFontDescriptor desc = getFontDescriptor();
>           if( desc != null )  {
>            	float height = desc.getCapHeight();				// Top of Cap to baseline (eg 715)
>             	if (height==0) {
>             		height=desc.getAscent();					// Max height from baseline (eg 715);
>             	   	if (height==0) {
>             	   		PDRectangle bbox = desc.getFontBoundingBox();
>             	   		height = bbox.getHeight()/2;			// Max height less max depth (eg (1006-(-325))=1331/2=665)
>             	   		if (height==0) {
>             	   			height=desc.getXHeight();			// Top of lower-case to baseline (eg 518)
>             	   			height-=desc.getDescent();		// Depth below baseline (eg 209, to get total of 727)
>             	   		}
>             	   	}
>             	}
>                 retval=height;
>           }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (PDFBOX-611) PDSimpleFont. Font height reported as zero.

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler reassigned PDFBOX-611:
-----------------------------------------

    Assignee: Andreas Lehmkühler

> PDSimpleFont.  Font height reported as zero.
> --------------------------------------------
>
>                 Key: PDFBOX-611
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-611
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 0.8.0-incubator
>         Environment: Win and Linux
>            Reporter: Peter Costello
>            Assignee: Andreas Lehmkühler
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The logic for PDSimpleFont.getFontHeight() can return a value of zero.   
> This will corrupt or compromise text extraction and layout.
> In particular, test with 'http://www.encana.com/investor/financial/shareholder/pdfs/info-circular-french.pdf', pg 12 
> When a PDFontDescriptor is used, the current logic uses:
>    1) an average of xHeight and capHeight.   
>              xHeight is the height from the baseline to the top of a lower case letter like 'x'.
>              CapHeight is the height from the baseline to the top of an upper case latin char.
>    2) xHeight
>    3) capHeight
>    4) ascent
>    5) zero
> This is really bizarre.  'xHeight' is an optional parameter, and 'capHeight' is often missing.
> The font bounding box is a required parameter and is the height that is used by Acrobat Reader when you select a line of text.
> The bounding box is not perfect, because it often overlaps the line above, but it is a consistent value.  The problem with the
> current logic is that the reported height varies way too much, and a zero value can be reported.
> I have modified the logic as follows. The goal was to make the nominal values the same as the current logic,
> but return a very similar number when parameters go missing.
>          PDFontDescriptor desc = getFontDescriptor();
>           if( desc != null )  {
>            	float height = desc.getCapHeight();				// Top of Cap to baseline (eg 715)
>             	if (height==0) {
>             		height=desc.getAscent();					// Max height from baseline (eg 715);
>             	   	if (height==0) {
>             	   		PDRectangle bbox = desc.getFontBoundingBox();
>             	   		height = bbox.getHeight()/2;			// Max height less max depth (eg (1006-(-325))=1331/2=665)
>             	   		if (height==0) {
>             	   			height=desc.getXHeight();			// Top of lower-case to baseline (eg 518)
>             	   			height-=desc.getDescent();		// Depth below baseline (eg 209, to get total of 727)
>             	   		}
>             	   	}
>             	}
>                 retval=height;
>           }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.