You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "dariusz dusberger (JIRA)" <ji...@apache.org> on 2015/09/22 03:27:04 UTC

[jira] [Updated] (PDFBOX-2984) PDFTextStripper adds extra word/line delimiters when PDF page orientation is 180 degrees

     [ https://issues.apache.org/jira/browse/PDFBOX-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dariusz dusberger updated PDFBOX-2984:
--------------------------------------
    Summary: PDFTextStripper adds extra word/line delimiters when PDF page orientation is 180 degrees  (was: PDFTextStripper adds word/line delimiters when PDF page orientation is 180 degrees)

> PDFTextStripper adds extra word/line delimiters when PDF page orientation is 180 degrees
> ----------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-2984
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2984
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.10
>         Environment: Windows/Linux, JDK 1.7
>            Reporter: dariusz dusberger
>         Attachments: 1760_001.pdf
>
>
> The PDFTextStripper adds word delimiters between each character and new-line after each word when page orientation is 180 degrees. 
> This happens because the PDFStreamEngine uses the raw scaling factor Matrix.getXScale() from the transformation Matrix to scale width/font-size which are used to calculate spacing between characters.
> =========================================================
> Output of the PDFTextStripper.getText(pdDoc);
> T h i s  i s  
> a  t e s t  1  ! ! !
> T h i s  
> i s  
> a  t e s t  
> 2  
> ! ! !
> T h i s  i s  
> a  
> t e s t  3  
> ! ! !
> T h i s  i s  
> a  t e s t  4 ! ! !
> =========================================================
> Example: The following will result in negative spaceWidthDisp  / font-size in PDFTextStripper
> 180 degrees = [-1, 0, 0; 0, -1, 0, w, h, 1]; therefore the textMatrix.getXScale() == -1
> float spaceWidthDisp = spaceWidthText * fontSizeText * horizontalScalingText * textMatrix.getXScale() * ctm.getXScale()
> fontSizeText * textMatrix.getXScale()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org