You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2015/11/01 20:47:27 UTC
[jira] [Comment Edited] (PDFBOX-3066) Text getting garbled in this file, was Ok in 1.8

    [ https://issues.apache.org/jira/browse/PDFBOX-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984491#comment-14984491 ] 

John Hewson edited comment on PDFBOX-3066 at 11/1/15 7:46 PM:
--------------------------------------------------------------

OS X Preview and Foxit extract the text as ")*+,-./012)456" but Acrobat extracts the correct text. The encoding in this CFF font is definitely corrupt. Acrobat is doing some black magic to correct things, but they must know something that we don't, because I can't see any telling information about how to detect and fix the problem.

Some interesting observations:

- the Font dictionary in the PDF has no Encoding or Flags entry. Note that Flags are required and affect encoding.
- the CFF font does not specify a Charset, so the default is used. This behaviour is described in the CFF spec, so it's normal, but still worth noting.
- the CFF font contains a valid format 0 encoding, but it doesn't match what we expect.
- the issue with the encoding isn't a simple off-by-one problem, e.g. adding 7 to the SID yields "01234567890;<=", which is still incorrect. Rendering is perfect, so this isn't an encoding or charset bug in PDFBox - it's purely a text extraction thing.

I don't see how we can detect that such encodings are invalid without raising false positives. Adobe know something we don't. If we do find a fix, it can't occur in the Encoding or CFF layers, because the correct Encoding is being provided already to rendering. We would have to add some extra layer to "correct" the extracted text during the text extraction process itself. Perhaps in PDFTextStreamEngine.


was (Author: jahewson):
OS X Preview and Foxit extract the text as ")*+,-./012)456" but Acrobat extracts the correct text.

> Text getting garbled in this file, was Ok in 1.8
> ------------------------------------------------
>
>                 Key: PDFBOX-3066
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3066
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Joel Hirsh
>            Assignee: John Hewson
>             Fix For: 2.0.0
>
>         Attachments: PDFBOX-3066-reduced.pdf, garbled.pdf
>
>
> Attached file, PrintTextLocations shows text garbled, like *,%-))’)) 
> Acrobat copy/paste shows accurate text, and was also fine in 1.8.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org