You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by James T Whelan <jt...@us.ibm.com> on 2009/08/18 23:14:15 UTC

Question on PDFBOX and Extracting Text with font and language metrics intact.

Hello, I am investigating possible solutions to a task I have been 
assigned and have been looking at alternatives.  I have been playing with 
some of the pdfbox examples.  Perhaps I could provide a high level 
synopsis and someone can tell me if pdfbox could be the proper solution.

At a high level I need to create a Table of Contents PDF of product 
documentation that would provide a link/anchor to the pdf documents on the 
cd/dvd.  What I have in mind is to extract the title/subject from the 
first page of each document to place in the TOC pdf as a hot link to the 
document.  The solution needs to keep the language and font metrics intact 
for the TOC, this would need to support any language and fonts in the 
existing product pdf. 

I have looked at the ExtractTextBArea example but I am not sure the fonts 
and language metrics are available for replication.  It seems to work fine 
for ASCII and extended ASCII characters but others like Chinese(Simple), 
Korean, and Japanese are shown as "?????". 

I have also investigated the PDFPagePanel example and have a question as 
to why it will only process annotations in the loaded document, which in 
my pdf's there are none.

I would very much appreciate any input before I invest too much more time 
in the feasibility phase of this project.

Thank you in advance and best regards,

Jim Whelan
eCare, HAM, MServer, and Media Teams 
Dept: GJZA, ISC Manufacturing Support
Poughkeepsie, New York