You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Tony Scerri <to...@gmail.com> on 2009/08/14 13:21:24 UTC
Font Subsets and Merging

Hi

I have only recently started using PDFBox, primarily for extracting text
from PDFs. So apologies if I use the wrong terminology or not too clear on
some points.

It works pretty well giving close to a perfect text stream as I could
expect, with one minor problem. It would occasionally jumble up letters even
with sort by position enabled. I did search the mailing lists etc and found
a few reference to what I believe may be similar or related problems. I have
been able to fix my problem with some minor patching of the code, however it
was just done to experiment with what I thought might be the cause, I just
wanted to let you know what I found and if there is interest in the change I
can send them on to anyone if interested.

It looks like the PDFs I had trouble with were including the same font based
on the descriptor multiple times but in each instance with a varying set of
characters and attributes such as the widths array, first char, last char,
bounding box, cap height, stemv etc. Having read the PDF spec it seems these
should be defined with a prefix to the font name to indicate they are
subsets, these however were not. I tried a simple fix of getting the font
look up to be based on the font resource rather than the name from the font
cache. This didnt work out. So then I opted for merging fonts whenever it
was already found in the cache. Based on those attributes mentioned before
it combines widths arrays preserving what was already there and adding any
non zero values to it, whilst aligning them based on the first and last
character values to create the super set. It adjusts the first and last char
also. I added code to maintain the max capheight and stemv values (this
didnt appear to make any difference to my output).

This change resulted in the text output being corrected in several places,
with no additional errors being introduced. I figured I'd let you know that
in some cases subsets are declared incorrectly but can it seems be combined
nonetheless to give better results.
So if this proves useful and you'd like to see the code let me know, I may
get a chance to clean it up before sending it on.

Tony