You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Alex Andrushchak (JIRA)" <ji...@apache.org> on 2014/05/02 17:01:27 UTC

[jira] [Created] (TIKA-1289) Ligatures convert on text extraction

Alex Andrushchak created TIKA-1289:
--------------------------------------

             Summary: Ligatures convert on text extraction
                 Key: TIKA-1289
                 URL: https://issues.apache.org/jira/browse/TIKA-1289
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.5
         Environment: win 8, jre 1.5
            Reporter: Alex Andrushchak


According to tika sources review, it uses pdfbox to parse pdf files. 
I found that pdfbox itself uses icu4j to handle ligatures.
Unfortunately, when i added icu4j jar to my classpath nothing changed, ligatures are still not converted. Sample pdf file is attached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)