You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Vojtech Knyttl (JIRA)" <ji...@apache.org> on 2016/02/18 21:43:18 UTC

[jira] [Commented] (PDFBOX-2740) Text extraction failed on Korean PDF

    [ https://issues.apache.org/jira/browse/PDFBOX-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15153050#comment-15153050 ] 

Vojtech Knyttl commented on PDFBOX-2740:
----------------------------------------

I am having exactly the same issue and I can provide the document here: http://pub.goout.cz/malformed_parse.pdf

With 1.8.11 it resolves the document as jibberish with random characters.
With 2.0.0-RC3, the document is empty with newline on the end of each page.

> Text extraction failed on Korean PDF
> ------------------------------------
>
>                 Key: PDFBOX-2740
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2740
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7, 1.8.8, 1.8.9, 2.0.0
>            Reporter: Julien Ortega
>         Attachments: g_KO_201506-ReaderDC-cutAndPaste.txt, g_KO_201506-ReaderDC-saveAsText.txt, g_KO_201506.pdf, g_KO_201506.txt
>
>
> Trying to extract text on a Korean PDF gives me a lot of warnings :
> WARNING: No Unicode mapping for US (33) in font DVCAYA+WtKoBaeumMyungjoL063zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
> WARNING: No Unicode mapping for NAK (33) in font JYLDGG+WtKoBaeumMyungjoL053zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
> WARNING: No Unicode mapping for RS (38) in font WRYULE+WtKoBaeumMyungjoL013zb4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
> WARNING: No Unicode mapping for DEL (33) in font FZEFOY+WtKoBaeumGothicL0422b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDFont <init>
> WARNING: Invalid ToUnicode CMap in font OOLNBG+WtKoBaeumGothicL0122b4?Pw
> avr. 01, 2015 12:05:32 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
> WARNING: No Unicode mapping for SOH (33) in font OOLNBG+WtKoBaeumGothicL0122b4?Pw
> and the result is not readable. The pdf is containing the necessary conversion table because every pdf reader (Desktop or Mobile) let me copy and past the text without problem.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org