You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Joel Hirsh <jo...@gmail.com> on 2015/10/26 06:36:04 UTC

Problems getting the height of text in v2?

I am trying to get the size of text (i.e fontsize).  In version 1.8, the
height of text was somewhat inconsistent, and not there for type 3 fonts,
but I thought that was supposed to be all sorted out in v2.0.  But version
2 seems to be even more inconsistent than version 1.8.

I am using PDFTextStripper and reading the TextPosition array that comes
with each String.  I have tried getHeight(), getFontSize(),
getFontSizeInPt(), getYScale, and none of them are dependable for a useful
answer.  They are consistent within a file, but useless for checking if a
particular string contains readable size text.

Which one of these TextPosition values should be used for this purpose
And then do I report bugs on all the files that don't give correct results?

FYI - I ran a test with version 2 against 100+ PDF files that come from
different sources, and use a mixture of TrueType, Type 0, Type1, Type3
fonts.  All of these have text that is font size 8-12pt, as reported by
Acrobat.  I dumped the size values returned for digit strings in the files
(i.e 12345), so that everything should be a full height string.

The reported height of text mostly ranged from 2.3 to 7.5 (although one
very readable file reported a height of 0).  I examined a few files with
Acrobat and the files with reported text height of 2.3  and 7.5 both had
9pt fonts.  But the other values from TextPosition were worse. The fontsize
was a plausible value for only about half of these files, seemed
particularly bad on TrueTypeFont's.  The fontsize values ranged from 1 to
200.  The fontsizeinpt values seemed mostly to be a multiple of fontsize,
but even that was inconsistent, often it seems to be the square of the
fontsize (like a fontsize of 58 and a fontsizeinpt of 3364), but sometimes
simply a multiple of 10.

The most accurate value I could find in the TextPosition was getYScale(),
which had a plausible value about 90% of the time.  But on type3 fonts, it
too was inconsistent, often returning values of 1, but also values up to 27.

So how should I be finding out the height of text??

Re: Problems getting the height of text in v2?

Posted by John Hewson <jo...@jahewson.com>.
> On 25 Oct 2015, at 22:36, Joel Hirsh <jo...@gmail.com> wrote:
> 
> I am trying to get the size of text (i.e fontsize).  In version 1.8, the
> height of text was somewhat inconsistent, and not there for type 3 fonts,
> but I thought that was supposed to be all sorted out in v2.0.  But version
> 2 seems to be even more inconsistent than version 1.8.
> 
> I am using PDFTextStripper and reading the TextPosition array that comes
> with each String.  I have tried getHeight(), getFontSize(),
> getFontSizeInPt(), getYScale, and none of them are dependable for a useful
> answer.  They are consistent within a file, but useless for checking if a
> particular string contains readable size text.
> 
> Which one of these TextPosition values should be used for this purpose
> And then do I report bugs on all the files that don't give correct results?
> 
> FYI - I ran a test with version 2 against 100+ PDF files that come from
> different sources, and use a mixture of TrueType, Type 0, Type1, Type3
> fonts.  All of these have text that is font size 8-12pt, as reported by
> Acrobat.  I dumped the size values returned for digit strings in the files
> (i.e 12345), so that everything should be a full height string.
> 
> The reported height of text mostly ranged from 2.3 to 7.5 (although one
> very readable file reported a height of 0).  I examined a few files with
> Acrobat and the files with reported text height of 2.3  and 7.5 both had
> 9pt fonts.  But the other values from TextPosition were worse. The fontsize
> was a plausible value for only about half of these files, seemed
> particularly bad on TrueTypeFont's.  The fontsize values ranged from 1 to
> 200.  The fontsizeinpt values seemed mostly to be a multiple of fontsize,
> but even that was inconsistent, often it seems to be the square of the
> fontsize (like a fontsize of 58 and a fontsizeinpt of 3364), but sometimes
> simply a multiple of 10.
> 
> The most accurate value I could find in the TextPosition was getYScale(),
> which had a plausible value about 90% of the time.  But on type3 fonts, it
> too was inconsistent, often returning values of 1, but also values up to 27.
> 
> So how should I be finding out the height of text??

You’re right that these methods are inconsistent. You might expect that
PDFBox would be returning the dimensions of a given glyph or string’s
bounding box from those methods, however that’s not the case. What’s
actually returned from getWidth() is the *logical* width of the glyph, i.e.
it’s advance width, not it’s visual width. That’s pretty normal an is fine
for most use cases but what’s not normal is that there’s a getHeight()
method, as there’s no such thing as the logical height of a glyph, because
it’s always equal to the font size, regardless of the glyph.

So what does Font.getHeight() do? Well, it’s not pretty; sometimes it returns
the visual height of the glyph, other times it returns the y-advance (even
though that’s zero unless it’s a vertical font). Sometimes it returns values
in text space, other times in glyph space. We should probably just remove
this method as it really serves no purpose, but somewhere in the 2000 odd
lines of PDFTextStripper are some assumptions which depend on it and
I for one have no intention of entering that labyrinth.

Actually it gets worse, PDFTextStripper depends on several incorrect
calculations of the text rendering matrix and other values, which, when fixed
caused PDFTextStripper to break. As a workaround PDFTextStreamEngine
was created which overrides showGlyph and replaces the perfect calculations
of PDFStreamEngine with the incorrect calculations on which PDFTextStripper
depends. There are some fun assumptions in there, such as using 1/2 the
font’s (yes font, not glyph) bounding box as the current glyph’s height, which
is quite meaningless.

Those interested in fixing PDFTextStripper may want to start by removing
the legacy calculations from PDFTextStreamEngine and removing the
PDFont.getHeight() method entirely. They way then want to consider
whether or not to use visual bounds or logical bounds when computing
glyph properties. (Logical is simpler, faster, and probably fine). I wish
those people good luck!

The good news is that PDFont.getWidth() and PDFStreamEngine perform
their calculations correctly. Hence we get correct text rendering, even if
text extraction is incorrect. So the problems are contained solely in 
PDFTextStripper, PDFTextStreamEngine, and TextPosition.

— John



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Problems getting the height of text in v2?

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi,

> Am 26.10.2015 um 06:36 schrieb Joel Hirsh <jo...@gmail.com>:
> 
> I am trying to get the size of text (i.e fontsize).  In version 1.8, the
> height of text was somewhat inconsistent, and not there for type 3 fonts,
> but I thought that was supposed to be all sorted out in v2.0.  But version
> 2 seems to be even more inconsistent than version 1.8.
> 
> I am using PDFTextStripper and reading the TextPosition array that comes
> with each String.  I have tried getHeight(), getFontSize(),
> getFontSizeInPt(), getYScale, and none of them are dependable for a useful
> answer.  They are consistent within a file, but useless for checking if a
> particular string contains readable size text.

maybe take a look at PrintTextLocations.java in the examples package. This should allow you to compare the output of the 1.8.x version to the 2.0.0 version.
> 
> Which one of these TextPosition values should be used for this purpose
> And then do I report bugs on all the files that don't give correct results?

If there are differences between 1.8.x and 2.0.0 yes please open an issue in https://issues.apache.org/jira/browse/PDFBOX/. 

Please look if there are already similar issues which you could add to. We are currently working together with Apache TIKA to look at potential regressions in 2.0.0 compared to 1.8.x and there were already some issues created and fixed  created which you can follow at https://issues.apache.org/jira/browse/PDFBOX-3058.

BR
Maruan

> 
> FYI - I ran a test with version 2 against 100+ PDF files that come from
> different sources, and use a mixture of TrueType, Type 0, Type1, Type3
> fonts.  All of these have text that is font size 8-12pt, as reported by
> Acrobat.  I dumped the size values returned for digit strings in the files
> (i.e 12345), so that everything should be a full height string.
> 
> The reported height of text mostly ranged from 2.3 to 7.5 (although one
> very readable file reported a height of 0).  I examined a few files with
> Acrobat and the files with reported text height of 2.3  and 7.5 both had
> 9pt fonts.  But the other values from TextPosition were worse. The fontsize
> was a plausible value for only about half of these files, seemed
> particularly bad on TrueTypeFont's.  The fontsize values ranged from 1 to
> 200.  The fontsizeinpt values seemed mostly to be a multiple of fontsize,
> but even that was inconsistent, often it seems to be the square of the
> fontsize (like a fontsize of 58 and a fontsizeinpt of 3364), but sometimes
> simply a multiple of 10.
> 
> The most accurate value I could find in the TextPosition was getYScale(),
> which had a plausible value about 90% of the time.  But on type3 fonts, it
> too was inconsistent, often returning values of 1, but also values up to 27.
> 
> So how should I be finding out the height of text??




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Problems getting the height of text in v2?

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

If you tried with RC1 - yes, there were many issues about the font 
height and size. And we had a type 3 font bug that applies to many 
files. So it may have been fixed already.

If not, as Maruan said, please open an issue. If you have many problems, 
start with one single file that seem to be the worst.

Tilman

Am 26.10.2015 um 06:36 schrieb Joel Hirsh:
> I am trying to get the size of text (i.e fontsize).  In version 1.8, the
> height of text was somewhat inconsistent, and not there for type 3 fonts,
> but I thought that was supposed to be all sorted out in v2.0.  But version
> 2 seems to be even more inconsistent than version 1.8.
>
> I am using PDFTextStripper and reading the TextPosition array that comes
> with each String.  I have tried getHeight(), getFontSize(),
> getFontSizeInPt(), getYScale, and none of them are dependable for a useful
> answer.  They are consistent within a file, but useless for checking if a
> particular string contains readable size text.
>
> Which one of these TextPosition values should be used for this purpose
> And then do I report bugs on all the files that don't give correct results?
>
> FYI - I ran a test with version 2 against 100+ PDF files that come from
> different sources, and use a mixture of TrueType, Type 0, Type1, Type3
> fonts.  All of these have text that is font size 8-12pt, as reported by
> Acrobat.  I dumped the size values returned for digit strings in the files
> (i.e 12345), so that everything should be a full height string.
>
> The reported height of text mostly ranged from 2.3 to 7.5 (although one
> very readable file reported a height of 0).  I examined a few files with
> Acrobat and the files with reported text height of 2.3  and 7.5 both had
> 9pt fonts.  But the other values from TextPosition were worse. The fontsize
> was a plausible value for only about half of these files, seemed
> particularly bad on TrueTypeFont's.  The fontsize values ranged from 1 to
> 200.  The fontsizeinpt values seemed mostly to be a multiple of fontsize,
> but even that was inconsistent, often it seems to be the square of the
> fontsize (like a fontsize of 58 and a fontsizeinpt of 3364), but sometimes
> simply a multiple of 10.
>
> The most accurate value I could find in the TextPosition was getYScale(),
> which had a plausible value about 90% of the time.  But on type3 fonts, it
> too was inconsistent, often returning values of 1, but also values up to 27.
>
> So how should I be finding out the height of text??
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org