You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Atsuo Ishimoto (JIRA)" <ji...@apache.org> on 2010/03/10 07:39:27 UTC
[jira] Updated: (PDFBOX-654) Extracting CJK text
[ https://issues.apache.org/jira/browse/PDFBOX-654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Atsuo Ishimoto updated PDFBOX-654:
----------------------------------
Attachment: identity-h.patch
> Extracting CJK text
> -------------------
>
> Key: PDFBOX-654
> URL: https://issues.apache.org/jira/browse/PDFBOX-654
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Reporter: Atsuo Ishimoto
> Attachments: identity-h.patch
>
>
> This is an update for PDFBOX-420 filed by Takashi Komatsubara.
> In this patch, if "Identity-H" is used as encoding of font and the font doesn't supply TO_UNICODE table, then encoding name is generated from CID information (Registry and Ordering). This idea is borrowed from pdfminer[1], an another PDF library written in Python. I don't see any test failures with this patch.
> I have published this patch last year[2], and got some good feedbacks from Japanese users[3].
> [1] http://www.unixuser.org/~euske/python/pdfminer/index.html
> [2] https://code.launchpad.net/~aishimoto/+junk/pdfbox-ja,
> https://code.launchpad.net/~aishimoto/+junk/pdfbox-1.0.0-ja
> [3] http://d.hatena.ne.jp/atsuoishimoto/20091211/1260533539
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.