You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Franken (Jira)" <ji...@apache.org> on 2022/04/23 13:32:00 UTC

[jira] [Created] (PDFBOX-5420) PDFTextStripper does not use cm to infer correct font size

Franken created PDFBOX-5420:
-------------------------------

             Summary: PDFTextStripper does not use cm to infer correct font size
                 Key: PDFBOX-5420
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5420
             Project: PDFBox
          Issue Type: Bug
            Reporter: Franken
         Attachments: TextStripperTest.kt, TextStripperUsesTransformationMatrix.java, ec_2.fixed.pdf, image-2022-04-23-14-46-34-929.png

*Given*

Given is a PDF where the cm operator is used to scale the transformation matrix by a factor of 0.03. The font size is then set to 282 using the Tf operator. 

!image-2022-04-23-14-46-34-929.png|width=389,height=84!

 

*Error Description*

When the PdfTextStripper is used to fetch the text from that pdf, the internal representation of the Textpositions contains the wrong font size of 282pt. The correct font size would be 10pt. The reason for this miscalculation is the fact, that the PdfTextStripper does not scale the text size based on the current transformation matrix. 

 

 *Proposed fix*

In the file LegacyPDFStreamEngine.java that bug can be fixed in the showGlyph function. There the fontSizeInPt must be calculated using the following code:
{code:java}
processTextPosition(new TextPosition(pageRotation, pageSize.getWidth(),
        pageSize.getHeight(), translatedTextRenderingMatrix, nextX, nextY,
        Math.abs(dyDisplay), dxDisplay,
        Math.abs(spaceWidthDisplay), unicodeMapping, new int[] { code }, font,
        fontSize,
        (int)(fontSize * textMatrix.getScalingFactorX() * graphicsState.currentTransformationMatrix.scalingFactorX)));{code}
*Further remarks*

To easily triage the error, i attached a unit test and a sample file. The sample was manually edited to remove all unnecessary data and fixed with qpdf. However, i redacted only the content stream, other objects in the pdf are still present, thus the pdf is pretty large. As i'm mainly programming kotlin, i attached the original version of the test i used to debug that issue. There is also a java version attached. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org