You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Sebastian Freuck <fr...@gmx.de> on 2010/06/23 13:56:16 UTC

PDF text extraction question

Hello. I am trying to extract information about mathematical formulas 
from PDF documents and encountered a problem. See the example 
http://upload.wikimedia.org/wikibooks/de/f/f6/Mathematik_Stochastik.pdf 
on page 15 the formula in the middle. P(A) = ...
I thought about getting the TextPosition objects and using the encoding 
of the font to get the glyph name of the characters. This works only 
partially, for example I get for the character '=' the name 'equal' and 
for '(' the name 'parenleft'. However for the absolute value characters 
I get the name 'j'. Why is this? Does the font not have a separate name 
for it? The same with the omega character, it got the 'W' name. Similar 
things happened to the infinite character at the end of the page, it 
shows the 'yen' name for it.
When trying to extract the information using the Adobe Reader, I get the 
same results. The document was created using pdfTeX. Is this problem the 
same for every mathematical pdf? Is there no way to get the information 
which character it really displays? Also, is this an error in the font 
that the glyph has a completely different name than the character it 
displays?

Yours sincerely

Sebastian