You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/12/08 20:53:12 UTC
[jira] [Commented] (PDFBOX-2548) Problems with character extraction (fi ligature)

    [ https://issues.apache.org/jira/browse/PDFBOX-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238369#comment-14238369 ] 

John Hewson commented on PDFBOX-2548:
-------------------------------------

The embedded text in this PDF really does contain spaces after some of the ligatures, e.g "Speziﬁ zierung" and Adobe Acrobat extracts the text with those spaces, exactly as PDFBox does. Foxit does the same, but OS X Preview strips the space, which gives the correct result: "Speziﬁzierung".

Here's the text drawing commands for "Speziﬁ zierung" shown in Adobe Preflight's PDF structure viewer:
!preflight.png!

These commands have the meaning:

0: Draw text "Speziﬁ"
1: Subtract 305.505 units from x-position (move _backwards_ approx 0.3em, roughly the width of a space)
2: Draw text " " (space)
3: Subtract -20.3063 units from the x-position (move _forwards_ approx 0.02em, this is a kern)
4: Draw text "zierung des logisch-historischen"

So the space is overlayed on top of the "fi" ligature. Needless to say this is a very unusual technique which does not result in proper text embedding.

Given that Acrobat produces the same result, and I don't see any simple way to fix this (on could imagine some complex solution). I'm going to close this issue as "not a problem".

> Problems with character extraction (fi ligature)
> ------------------------------------------------
>
>                 Key: PDFBOX-2548
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2548
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7
>         Environment: Windows7Professional JavaSE8 EclipseKepler
>            Reporter: Matthias Bösinger
>            Priority: Minor
>         Attachments: preflight.png, test.pdf, test2.pdf
>
>
>  favorite
> 	
> I have a pdf document whose font type is OpenType (Garamond OpenType). So the pdfBox text extraction can also extract special characters (for example small capital lettres), which caused problems when the underlying font has been a simple Type1 font.
> However, the text extraction now causes another type of problem. In my case, when the charater sequences "fi" or "fl" occur in the text, the PDFTextStripper#getText(PDDocument doc) extracts them as single characters: 'ﬁ' and 'ﬂ' and sets a space character on their right side.
> (Surprisingly, if I access the list of characters of a page via the charactersByArticle field of PDFTextStripper / via the PDFTextStripper#processText(TextPosition pos) method, the same characters show up as 'normal-single' characters f i / f l).
> My assumption is that the advantage of the underlying OpenFont type turns into this particular disadvantage, because the PDFTextStripper recognizes the character sequence f i / f l as special charcters ﬁ / ﬂ (- what might have to do with the fact, that the getText() method calculates things like whitespace characters by distances / positional placements).
> Background: The given document is a wordbook text with very dense printed text.
> see this link for code and output:
> http://stackoverflow.com/questions/27333499/problems-with-extracting-opentypefont-text-using-pdfbox
> My question: is there anything what I can do to avoid this problem?
> thanks in advance ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)