You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Sebastian Freuck <fr...@gmx.de> on 2010/06/23 13:56:16 UTC
PDF text extraction question
Hello. I am trying to extract information about mathematical formulas
from PDF documents and encountered a problem. See the example
http://upload.wikimedia.org/wikibooks/de/f/f6/Mathematik_Stochastik.pdf
on page 15 the formula in the middle. P(A) = ...
I thought about getting the TextPosition objects and using the encoding
of the font to get the glyph name of the characters. This works only
partially, for example I get for the character '=' the name 'equal' and
for '(' the name 'parenleft'. However for the absolute value characters
I get the name 'j'. Why is this? Does the font not have a separate name
for it? The same with the omega character, it got the 'W' name. Similar
things happened to the infinite character at the end of the page, it
shows the 'yen' name for it.
When trying to extract the information using the Adobe Reader, I get the
same results. The document was created using pdfTeX. Is this problem the
same for every mathematical pdf? Is there no way to get the information
which character it really displays? Also, is this an error in the font
that the glyph has a completely different name than the character it
displays?
Yours sincerely
Sebastian