You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by James T Whelan <jt...@us.ibm.com> on 2009/08/18 23:14:15 UTC
Question on PDFBOX and Extracting Text with font and language metrics
intact.
Hello, I am investigating possible solutions to a task I have been
assigned and have been looking at alternatives. I have been playing with
some of the pdfbox examples. Perhaps I could provide a high level
synopsis and someone can tell me if pdfbox could be the proper solution.
At a high level I need to create a Table of Contents PDF of product
documentation that would provide a link/anchor to the pdf documents on the
cd/dvd. What I have in mind is to extract the title/subject from the
first page of each document to place in the TOC pdf as a hot link to the
document. The solution needs to keep the language and font metrics intact
for the TOC, this would need to support any language and fonts in the
existing product pdf.
I have looked at the ExtractTextBArea example but I am not sure the fonts
and language metrics are available for replication. It seems to work fine
for ASCII and extended ASCII characters but others like Chinese(Simple),
Korean, and Japanese are shown as "?????".
I have also investigated the PDFPagePanel example and have a question as
to why it will only process annotations in the loaded document, which in
my pdf's there are none.
I would very much appreciate any input before I invest too much more time
in the feasibility phase of this project.
Thank you in advance and best regards,
Jim Whelan
eCare, HAM, MServer, and Media Teams
Dept: GJZA, ISC Manufacturing Support
Poughkeepsie, New York