You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Joel Hirsh <jo...@gmail.com> on 2014/05/04 20:15:55 UTC

Is this a bug in PDFStreamEngine?

I am using PDFTextStripper and getting odd results on some strings that I
tracked down to something that I think may be a bug in PDFStreamEngine.

The PDF file has some text that looks like "1234" in Acrobat, but comes
through as "1 2 3 4" from PDFTextStripper.  The logic in PDFTextStripper is
putting in spaces because of a large inter-character spacing.

Tracing it down, the PDF file has a 'Tc' (spacing operator) followed by a
'Tm' (matrix operator) with a scale of 8.  Other PDF files that I could
find with 'Tc' operators had the 'Tc' after the matrix operator.

What strikes me as incorrect is that PDFStreamEngine does not distinguish
between a 'Tc' followed by 'Tm' versus a 'Tm' followed by 'Tc' .  In either
case the spacing in the 'Tc' is multiplied by the scale factor in the
matrix.   There is nothing in the Adobe PDF spec that specifically
addresses order of transforms, but normal mathematics says there is big
difference.  And in the case that looks incorrect, the spacing is being
multiplied by the scale in the matrix, and the results would be more like
Acrobat if it didn't.

Can someone who might have more knowledge about
PDFStreamEngine/PDFTextStripper comment on this?  The code that does the
multiply is in PDFStreamEngine.processEncodedText when it is operating on
the value in characterSpacingText.

Thanks
JH