You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Stéphane Zampelli <sz...@gmail.com> on 2010/07/20 11:41:25 UTC

parseCOSString bug?

Hi all,

First of all, thanks for the great work on pdfbox.

I am trying to use ReplaceString on a pdf containing accents (french
accents).
For instance, I have a pdf file containing "Bienvenue à l'école" ("Welcome
to school"); I try to replace all 'o' by 'A' using ReplaceString. I would
expect to see "Bienvenue à l'écAle" in the new pdf. I actually get a pdf
with the following displayed text: "Bienvenue  l'cle" (A). If I copy/paste
from the pdf I get this: "Bienvenue ? l'?cAle" (B).

I am trying to solve this in the pdfbox code, here is my view on it:
It looks like accents are not parsed properly.
My guess is that COSString parses string in a bad way:
BaseParser:parseCOSString() calls COSString.createFromHexString() to convert
a string of the form <...>, but
shouldn't COSString.createFromHexString() use decoding using either cmap or
predefined encodings from the font (as explained in the pdf specification,
section 5.9.1)? In the code of COSString.createFromHexString() makes a
direct conversion, without using information from the associated font.
In my particular case, in the my pdf, the encoding MacRomanEncoding should
be used.

I do not yet understand why A is not displayed.

Would you agree COSString.createFromHexString() makes bad conversion and it
should use the schema proposed in the pdf specification for proper
conversion?

Thanks,

Stéphane.