You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Alex Andrushchak (JIRA)" <ji...@apache.org> on 2014/05/02 17:01:27 UTC
[jira] [Created] (TIKA-1289) Ligatures convert on text extraction
Alex Andrushchak created TIKA-1289:
--------------------------------------
Summary: Ligatures convert on text extraction
Key: TIKA-1289
URL: https://issues.apache.org/jira/browse/TIKA-1289
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.5
Environment: win 8, jre 1.5
Reporter: Alex Andrushchak
According to tika sources review, it uses pdfbox to parse pdf files.
I found that pdfbox itself uses icu4j to handle ligatures.
Unfortunately, when i added icu4j jar to my classpath nothing changed, ligatures are still not converted. Sample pdf file is attached.
--
This message was sent by Atlassian JIRA
(v6.2#6252)