You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/01/31 16:57:51 UTC

[jira] (TIKA-2256) Japanese character substituted when reading PDF

    [ https://issues.apache.org/jira/browse/TIKA-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847122#comment-15847122 ] 

Tim Allison commented on TIKA-2256:
-----------------------------------

Thank you for opening this issue.

On Windows, copy/paste from Adobe Acrobat to Microsoft Word yields U+2F47.  Opening the file with MSWord and using its conversion function yields U+2F47.  Acrobat's Save as Text function yields junk.

More relevant, though, is that PDFBox also yields U+2F47...see ([trouble-shooting Tika |https://wiki.apache.org/tika/Troubleshooting%20Tika#PDF_Text_Problems]).

Note, too, the character mapping in the file, which seems to be pretty specific about 2f47.

{noformat}
3 beginbfrange
<07a2><07a2><8a9e>
<0cd6><0cd6><2f47>
<0e8c><0e8c><672c>
endbfrange
{noformat}

In short, I don't think there's anything we can do.

> Japanese character substituted when reading PDF
> -----------------------------------------------
>
>                 Key: TIKA-2256
>                 URL: https://issues.apache.org/jira/browse/TIKA-2256
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14
>            Reporter: Christopher Creutzig
>         Attachments: mixed-fonts.pdf
>
>
> The attached file contains “日本語” in its first line. It was created on Mac OS X 10.11.6 by selecting “Save As PDF” in the system print dialog started from Microsoft Word.
> Reading the text from the PDF, the first character is not read as U+65E5, but as U+2F47. Copy & paste from Preview.App results in the correct U+65E5 being copied. (The characters look the same in some fonts, but are different.)
> The MATLAB code used for reading looks as follows:
>   handler = org.apache.tika.sax.ToXMLContentHandler;
>   parser = org.apache.tika.parser.AutoDetectParser;
>   metadata = org.apache.tika.metadata.Metadata;
>   fh = java.io.FileInputStream(fullname);
>   parser.parse(fh, handler, metadata);
>   s = handler.toString;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)