You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2015/11/24 20:47:11 UTC

[jira] [Commented] (PDFBOX-3130) Recent regression in PDFTextStripper, text getting garbled

    [ https://issues.apache.org/jira/browse/PDFBOX-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15025206#comment-15025206 ] 

Tilman Hausherr commented on PDFBOX-3130:
-----------------------------------------

You have used the sort option. Without, it would all appear on one line (which is still wrong).

The root cause is that your file has an invalid font BBox. See at {{Root/Pages/Kids/\[0]/Resources/Font/F0/FontDescriptor/FontBBox}}. I remember having seen such a weird BBox before - in PDFBOX-2158.

It is not really a regression, although it appeared recently due to using the BBox of the PDF and not of the font.

> Recent regression in PDFTextStripper, text getting garbled
> ----------------------------------------------------------
>
>                 Key: PDFBOX-3130
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3130
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Fred Andrews
>         Attachments: garbled text.pdf
>
>
> Text extraction using PrintTextLocations is getting garbled characters in the attached snippet. 
> For this file it is getting one string of "2O(Er4env vqeheurosriAurseirueeass ss/Ct:7:rh adaliaargynse csr eadc+cit6e l1ipc te+2en 6d9c1)9e 91 2933"
> This test case is about as small as I could make it and still show the problem; when I reduced the file to just one line of text, then the text came though correctly.
> This problem shows up in RC2 and the latest development build.  I believe it was OK in the development build from Nov 4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org