You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2014/07/17 08:26:05 UTC

[jira] [Comment Edited] (PDFBOX-2023) Text extraction gets nothing / zero font height

    [ https://issues.apache.org/jira/browse/PDFBOX-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14064632#comment-14064632 ] 

Tilman Hausherr edited comment on PDFBOX-2023 at 7/17/14 6:25 AM:
------------------------------------------------------------------

I can confirm this (0 height with org.apache.pdfbox.examples.util.PrintTextLocations) for the trunk for both attached files.


was (Author: tilman):
I can confirm this for the trunk for both attached files.

> Text extraction gets nothing / zero font height
> -----------------------------------------------
>
>                 Key: PDFBOX-2023
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2023
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Tilman Hausherr
>         Attachments: PDFBOX-2023.pdf, zero_height.pdf
>
>
> Fred Andrews posted this to the user list and I can confirm that text extraction gets nothing:
> I am using PDFTextStripper on some PDF statements from Bank of America, and everything is coming through as zero height. I traced it down to getFontHeight in org.apache.pdfbox.pdmodel.font.PDSimpleFont, which is indeed getting zero.  The font is a type 3 font and I'm not sure how it should work, but getFontHeight is calling getAFM() and that is returning a null because its not a type 1 font.  Then in the next section in getFontHeight there are no font descriptors, and the zero just flows through all the way through getFontHeight. 
> I searched for anything I could key on to calculate the font height but couldn't find it.  The font size is claimed to be 20 by getFontSize(), although it appears to be more like 8. I did trace to where it got a font size command of twenty, but somehow I'm assuming that would need to be scaled, and I can't see where that might come from.
> The font width on the other hand looks accurate, and I would think something similar to that would be needed, but would really appreciate some guidance on how it should work.  If I have clue on how it should work I can see what I can do to implement it.
> This file displays fine in Acrobat and edits fine in Nitro, so it can't be that invalid.



--
This message was sent by Atlassian JIRA
(v6.2#6252)