You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (Resolved JIRA)" <ji...@apache.org> on 2011/11/06 17:42:51 UTC

[jira] [Resolved] (PDFBOX-5) CJK decoding

     [ https://issues.apache.org/jira/browse/PDFBOX-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-5.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.7.0
         Assignee: Andreas Lehmkühler

PDFBox won't extract the whole text from chinese.pdf as it doesn't provide any character mappings for some of the used fonts. Even acrobat reader can't extract the whole text.

Set to resolved.
                
> CJK decoding
> ------------
>
>                 Key: PDFBOX-5
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Text extraction
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.7.0
>
>         Attachments: PDFBOX5-CJK.zip
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=765686
> Originally submitted by bguan on 2003-07-03 17:57.
> Another feature I need a lot is the correct interpretation 
> of CJK encoding.
> Yes, I know PDF can be a pain when it comes to 
> correctly interpreting CJK charsets, as many factors are 
> involved, including whether a font (or its subset) is 
> embeded or not.
> Attached is a simple Korean PDF that so far has not 
> been correctly interpreted by any java based 
> opensource libraries.  Though it could be rendered 
> correctly by XPDF on linux and also Windows.
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552835&aid=765686&file_id=80181
> CJK.zip (), 142061 bytes
> CJK PDF, output and test program
> [comment on SourceForge]
> Originally sent by bguan.
> Logged In: YES 
> user_id=815589
> Hello Ben,
> Thanks for the response.  I just downloaded PDFBox 0.6.5 and 
> wrote a little sample program to test it against 3 CJK PDF files 
> I have, and the output is still no good.  I have attached my 
> sample program, the 3 PDFs and the output in the attached 
> zip file.
> Can you tell me what I am foing wrong?
> The PDF files were generated by using Adobe Acrobat 5.0 
> using embeded fonts I believe.
> Thank you.
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> There was no attachment with this.  I have done some CJK 
> work in the 0.6.5 release.  Please attach the document and I 
> can take a look at it.(Make sure you check the 'attach file' 
> checkbox)
> Ben

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira