You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Franken (Jira)" <ji...@apache.org> on 2022/04/23 13:32:00 UTC
[jira] [Created] (PDFBOX-5420) PDFTextStripper does not use cm to infer correct font size
Franken created PDFBOX-5420:
-------------------------------
Summary: PDFTextStripper does not use cm to infer correct font size
Key: PDFBOX-5420
URL: https://issues.apache.org/jira/browse/PDFBOX-5420
Project: PDFBox
Issue Type: Bug
Reporter: Franken
Attachments: TextStripperTest.kt, TextStripperUsesTransformationMatrix.java, ec_2.fixed.pdf, image-2022-04-23-14-46-34-929.png
*Given*
Given is a PDF where the cm operator is used to scale the transformation matrix by a factor of 0.03. The font size is then set to 282 using the Tf operator.
!image-2022-04-23-14-46-34-929.png|width=389,height=84!
*Error Description*
When the PdfTextStripper is used to fetch the text from that pdf, the internal representation of the Textpositions contains the wrong font size of 282pt. The correct font size would be 10pt. The reason for this miscalculation is the fact, that the PdfTextStripper does not scale the text size based on the current transformation matrix.
*Proposed fix*
In the file LegacyPDFStreamEngine.java that bug can be fixed in the showGlyph function. There the fontSizeInPt must be calculated using the following code:
{code:java}
processTextPosition(new TextPosition(pageRotation, pageSize.getWidth(),
pageSize.getHeight(), translatedTextRenderingMatrix, nextX, nextY,
Math.abs(dyDisplay), dxDisplay,
Math.abs(spaceWidthDisplay), unicodeMapping, new int[] { code }, font,
fontSize,
(int)(fontSize * textMatrix.getScalingFactorX() * graphicsState.currentTransformationMatrix.scalingFactorX)));{code}
*Further remarks*
To easily triage the error, i attached a unit test and a sample file. The sample was manually edited to remove all unnecessary data and fixed with qpdf. However, i redacted only the content stream, other objects in the pdf are still present, thus the pdf is pretty large. As i'm mainly programming kotlin, i attached the original version of the test i used to debug that issue. There is also a java version attached.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org