You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Cornelis Hoeflake <c....@postex.com> on 2014/04/02 14:55:44 UTC

Weird behaviour PDF from Word for Mac

Hi,

When I execute PDFStreamEngine.processStream(PDPage, PDResources,
COSStream) I see very weird behaviour on the TextPosition's.
Every TextPosition which has to be a 'space' exists of multiple characters
(TextPosition.getCharacter):
9, 13, 32, 160

When I look in the code for filling the cmap (via debugger) of the font, I
see a byte array of:
[0, 9, 0, 13, 0, 32, 0, -96] which is interpreted as a String with UTF-16BE
encoding. Huh? -96?

Copy paste the text on Windows via Adobe Reader 'adds' newline on every
space (paste to notepad).

Repoduce:
Simple document created in Word for Mac (newest version) and using font
Cambria. The document contains only 'a a'. Saving the document as PDF (via
Save-As).

When using the font Verdana in stead of Cambria the problem NOT exists.
Doing the same on Word for Windows, the problem NOT exists.

So my conclusion is that it is an issue on Word for Mac with the Cambria
font. Can anyone confirm that?

But next, my PDFBox code has to handle it correctly. What is a safe
assumption? Can I safely assume that when multiple characters are returned
from TextPosition.getCharacter this can be ignored? Or look for specific
byte order ending with the -96?

Kind regards,
Cornelis Hoeflake