You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/08/12 12:25:22 UTC
[jira] [Commented] (TIKA-2054) Problem with ligatures converting from PDF to HTML with Tika

    [ https://issues.apache.org/jira/browse/TIKA-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418755#comment-15418755 ] 

Tim Allison commented on TIKA-2054:
-----------------------------------

I don't think we want to modify our SafeContentHandler to stop converting control characters.

This is difficult.  If I understand correctly, PDFBox complains that the ligatures aren't correctly encoded:

{noformat}
Aug 12, 2016 8:03:21 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for f_i (31) in font XOILAG+MyriadPro-Bold
Aug 12, 2016 8:03:21 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for f_i (31) in font XOILAG+MyriadPro-Regular
Aug 12, 2016 8:03:21 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for f_f (30) in font XOILAG+MyriadPro-Regular
Aug 12, 2016 8:03:21 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for f_l (29) in font XOILAG+MyriadPro-Regular
Aug 12, 2016 8:03:22 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for f_f_i (28) in font XOILAG+MyriadPro-Regular
{noformat}

So "fi" is being mapped to "0x1f" (31), "ff" to "0x1e" (30), and, as you point out, you can recover these by a custom mapping in the output of PDFBox.  Tika via its SafeContentHandler converts most chars < 0x20 to '\ufffd'.

Adobe Reader seems to do the same thing that PDFBox does, but Microsoft Edge is able to correctly extract e.g. "confidentiality"...not sure how that is happening?!


> Problem with ligatures converting from PDF to HTML with Tika
> ------------------------------------------------------------
>
>                 Key: TIKA-2054
>                 URL: https://issues.apache.org/jira/browse/TIKA-2054
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.11, 1.13
>            Reporter: Angela O
>         Attachments: 2482_2014_DAVIDE+CAMPARI-MILANO+SPA_SUSTY-AR.pdf
>
>
> When converting certain PDFs from PDF to HTML I am having trouble with ligature characters being displayed as U+FFFD � REPLACEMENT CHARACTER
> I have tried using Apache Tika 1.11 and 1.13, converting on the command line using the .jar and get the same results.
> If I use pdfbox-app-2.0.1.jar and 'ExtractText' with the icu4j-57_1.jar in the path and I convert to text rather than HTML then I am able to at least preserve information about what each ligature was originally, even if they are still represented as unprintable characters. 
> I.e. if I run the following from the command line:
> java -jar pdfbox-app-1.8.12.jar ExtractText 'test.pdf' 'test.txt'
> Then the resulting test.txt when viewed in Sublime2 has "fi" represented as the  US (unit separator character), "ff" represented as RS, "fl" represented as GS and "ffl" reperesented as FS, which I could then replace with the appropriate characters.
> I was under the impression Tika uses icu4j, is there a way to get the same behaviour I see with PDFBox with Tika when converting from PDF to HTML? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)