You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/06/17 23:19:07 UTC

[jira] [Comment Edited] (PDFBOX-970) TeX-created ligatures and umlauts are not recognised

    [ https://issues.apache.org/jira/browse/PDFBOX-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034375#comment-14034375 ] 

John Hewson edited comment on PDFBOX-970 at 6/17/14 9:18 PM:
-------------------------------------------------------------

-I'm not getting combined characters for the umlaut with 2.0 trunk-. Interestingly enough, Adobe Acrobat strips the umlaut and OSX Preview extracts it as "fu ̈r", so it's not clear that we really need to be trying to combine it.

Update: Passing {{-encoding "UTF-8"}} to ExtractText gets me the combined characters as expected.


was (Author: jahewson):
I'm not getting combined characters for the umlaut with 2.0 trunk. Interestingly enough, Adobe Acrobat strips the umlaut and OSX Preview extracts it as "fu ̈r", so it's not clear that we really need to be trying to combine it.

> TeX-created ligatures and umlauts are not recognised
> ----------------------------------------------------
>
>                 Key: PDFBOX-970
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-970
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.5.0
>         Environment: Mac OS X 10.6.6, Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
>            Reporter: Thomas Fischer
>              Labels: textExtraction
>         Attachments: A Python Library for Provenance Recording and Querying.txt, A Python Library for Provenance Recording and Querying.txt, Test.pdf, Test.pdf, Test2-1.6.txt, Test2.1.4.txt, Test2.pdf
>
>
> Ligatures in a TeX-created document are lost, which are regognised by v. 1.4, e.g.
>   1.4		 1.5
> official	ocial
> effort 	e ort
> fields 	elds
> first	         rst
> In addition, German umlauts (ä, ö, ü) are represented as ( a,  o,  u), 



--
This message was sent by Atlassian JIRA
(v6.2#6252)