You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Atsuo Ishimoto (JIRA)" <ji...@apache.org> on 2010/03/10 07:37:27 UTC

[jira] Created: (PDFBOX-654) Extracting CJK text

Extracting CJK text
-------------------

                 Key: PDFBOX-654
                 URL: https://issues.apache.org/jira/browse/PDFBOX-654
             Project: PDFBox
          Issue Type: Improvement
          Components: Text extraction
            Reporter: Atsuo Ishimoto


This is an update for PDFBOX-420 filed by Takashi Komatsubara.

In this patch, if "Identity-H" is used as encoding of font and the font doesn't supply TO_UNICODE table, then encoding name is generated from CID information (Registry and Ordering). This idea is borrowed from pdfminer[1], an another PDF library written in Python. I don't see any test failures with this patch.

I have published this patch last year[2], and got some good feedbacks from Japanese users[3].

[1] http://www.unixuser.org/~euske/python/pdfminer/index.html
[2] https://code.launchpad.net/~aishimoto/+junk/pdfbox-ja, 
    https://code.launchpad.net/~aishimoto/+junk/pdfbox-1.0.0-ja
[3] http://d.hatena.ne.jp/atsuoishimoto/20091211/1260533539


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (PDFBOX-654) Extracting CJK text

Posted by Takashi Komatsubara <ta...@gmail.com>.

Sorry,,,, Completely my test environment has some issue.
Yeah, perfectly we could perfectly exported the chinese content. cool!!

Please disregard my previous comment.

Wow! I'll try another cjk document!

Takashi.

[jira] Issue Comment Edited: (PDFBOX-654) Extracting CJK text

Posted by "Takashi Komatsubara (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844496#action_12844496 ] 

Takashi Komatsubara edited comment on PDFBOX-654 at 3/12/10 1:42 PM:
---------------------------------------------------------------------

Hi Andreas san and Atsuo san,

I have tested Atsuo san's patch and confirmed that his patch passed the maven test.
Also I have successfully extracted Japanese text from many pdf files.
Currently, his patch is the highest quality of exporting text from Japanese PDF files.

Unfortunately, I have tested with Chinese pdf files with his patch.
The result is not good. Chinese handling seems to be using different type implemented within pdf file.

As one of Japanese pdfbox developer, I would like you guys to include Japanese pdf files for the maven testing,



      was (Author: takashi-smi):
    Hi Andreas san and Atsuo san,

I have tested Atsuo san's patch and confirmed that his patch passed the maven test.
Also I have successfully extract Japanese text from many pdf files.
Currently, his patch is the highest quality of exporting text from Japanese PDF files.

Unfortunately, I have tested with Chinese pdf files with his patch.
The result is not good. Chinese handling seems to be using different type implemented within pdf file.

As one of Japanese pdfbox developer, I would like you guys to include Japanese pdf files for the maven testing,


  
> Extracting CJK text
> -------------------
>
>                 Key: PDFBOX-654
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-654
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Atsuo Ishimoto
>             Fix For: 1.1.0
>
>         Attachments: identity-h.patch
>
>
> This is an update for PDFBOX-420 filed by Takashi Komatsubara.
> In this patch, if "Identity-H" is used as encoding of font and the font doesn't supply TO_UNICODE table, then encoding name is generated from CID information (Registry and Ordering). This idea is borrowed from pdfminer[1], an another PDF library written in Python. I don't see any test failures with this patch.
> I have published this patch last year[2], and got some good feedbacks from Japanese users[3].
> [1] http://www.unixuser.org/~euske/python/pdfminer/index.html
> [2] https://code.launchpad.net/~aishimoto/+junk/pdfbox-ja, 
>     https://code.launchpad.net/~aishimoto/+junk/pdfbox-1.0.0-ja
> [3] http://d.hatena.ne.jp/atsuoishimoto/20091211/1260533539

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-654) Extracting CJK text

Posted by "Atsuo Ishimoto (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843915#action_12843915 ] 

Atsuo Ishimoto commented on PDFBOX-654:
---------------------------------------

Thank you!
I'll call for testing trunk at Solr workshop in Japan tonight.

> Extracting CJK text
> -------------------
>
>                 Key: PDFBOX-654
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-654
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Atsuo Ishimoto
>             Fix For: 1.1.0
>
>         Attachments: identity-h.patch
>
>
> This is an update for PDFBOX-420 filed by Takashi Komatsubara.
> In this patch, if "Identity-H" is used as encoding of font and the font doesn't supply TO_UNICODE table, then encoding name is generated from CID information (Registry and Ordering). This idea is borrowed from pdfminer[1], an another PDF library written in Python. I don't see any test failures with this patch.
> I have published this patch last year[2], and got some good feedbacks from Japanese users[3].
> [1] http://www.unixuser.org/~euske/python/pdfminer/index.html
> [2] https://code.launchpad.net/~aishimoto/+junk/pdfbox-ja, 
>     https://code.launchpad.net/~aishimoto/+junk/pdfbox-1.0.0-ja
> [3] http://d.hatena.ne.jp/atsuoishimoto/20091211/1260533539

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-654) Extracting CJK text

Posted by "Takashi Komatsubara (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Takashi Komatsubara updated PDFBOX-654:
---------------------------------------

    Attachment: China.pdf

Here is the chinese pdf file I got from Chinese government site or somewhere else.


> Extracting CJK text
> -------------------
>
>                 Key: PDFBOX-654
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-654
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Atsuo Ishimoto
>             Fix For: 1.1.0
>
>         Attachments: China.pdf, identity-h.patch
>
>
> This is an update for PDFBOX-420 filed by Takashi Komatsubara.
> In this patch, if "Identity-H" is used as encoding of font and the font doesn't supply TO_UNICODE table, then encoding name is generated from CID information (Registry and Ordering). This idea is borrowed from pdfminer[1], an another PDF library written in Python. I don't see any test failures with this patch.
> I have published this patch last year[2], and got some good feedbacks from Japanese users[3].
> [1] http://www.unixuser.org/~euske/python/pdfminer/index.html
> [2] https://code.launchpad.net/~aishimoto/+junk/pdfbox-ja, 
>     https://code.launchpad.net/~aishimoto/+junk/pdfbox-1.0.0-ja
> [3] http://d.hatena.ne.jp/atsuoishimoto/20091211/1260533539

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PDFBOX-654) Extracting CJK text

Posted by "Takashi Komatsubara (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844496#action_12844496 ] 

Takashi Komatsubara edited comment on PDFBOX-654 at 3/12/10 1:43 PM:
---------------------------------------------------------------------

Hi Andreas san and Atsuo san,

I have tested Atsuo san's patch and confirmed that his patch passed the maven test.
Also I have successfully extracted Japanese text from many pdf files.
Currently, his patch is the highest quality of exporting text from Japanese PDF files.

Unfortunately, I have tested with Chinese pdf files with his patch and got bad result.
Chinese handling seems to be using different type implemented within pdf file.

As one of Japanese pdfbox developer, I would like you guys to include Japanese pdf files for the maven testing,



      was (Author: takashi-smi):
    Hi Andreas san and Atsuo san,

I have tested Atsuo san's patch and confirmed that his patch passed the maven test.
Also I have successfully extracted Japanese text from many pdf files.
Currently, his patch is the highest quality of exporting text from Japanese PDF files.

Unfortunately, I have tested with Chinese pdf files with his patch.
The result is not good. Chinese handling seems to be using different type implemented within pdf file.

As one of Japanese pdfbox developer, I would like you guys to include Japanese pdf files for the maven testing,


  
> Extracting CJK text
> -------------------
>
>                 Key: PDFBOX-654
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-654
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Atsuo Ishimoto
>             Fix For: 1.1.0
>
>         Attachments: identity-h.patch
>
>
> This is an update for PDFBOX-420 filed by Takashi Komatsubara.
> In this patch, if "Identity-H" is used as encoding of font and the font doesn't supply TO_UNICODE table, then encoding name is generated from CID information (Registry and Ordering). This idea is borrowed from pdfminer[1], an another PDF library written in Python. I don't see any test failures with this patch.
> I have published this patch last year[2], and got some good feedbacks from Japanese users[3].
> [1] http://www.unixuser.org/~euske/python/pdfminer/index.html
> [2] https://code.launchpad.net/~aishimoto/+junk/pdfbox-ja, 
>     https://code.launchpad.net/~aishimoto/+junk/pdfbox-1.0.0-ja
> [3] http://d.hatena.ne.jp/atsuoishimoto/20091211/1260533539

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PDFBOX-654) Extracting CJK text

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-654.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.1.0

WOW, that's really a great improvement. I've applied the patch with version 921494. All text extract tests are still working. As a test for the patch I've extracted the text from the document attached to PDFBOX-420. I'm not really able to read the result, but I've just compared the "pictures" from the textfile with those displayed in acrobat and it looks great.

Thanks to  Atsuo for the contribution. 

> Extracting CJK text
> -------------------
>
>                 Key: PDFBOX-654
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-654
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Atsuo Ishimoto
>             Fix For: 1.1.0
>
>         Attachments: identity-h.patch
>
>
> This is an update for PDFBOX-420 filed by Takashi Komatsubara.
> In this patch, if "Identity-H" is used as encoding of font and the font doesn't supply TO_UNICODE table, then encoding name is generated from CID information (Registry and Ordering). This idea is borrowed from pdfminer[1], an another PDF library written in Python. I don't see any test failures with this patch.
> I have published this patch last year[2], and got some good feedbacks from Japanese users[3].
> [1] http://www.unixuser.org/~euske/python/pdfminer/index.html
> [2] https://code.launchpad.net/~aishimoto/+junk/pdfbox-ja, 
>     https://code.launchpad.net/~aishimoto/+junk/pdfbox-1.0.0-ja
> [3] http://d.hatena.ne.jp/atsuoishimoto/20091211/1260533539

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-654) Extracting CJK text

Posted by "Atsuo Ishimoto (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844504#action_12844504 ] 

Atsuo Ishimoto commented on PDFBOX-654:
---------------------------------------

Hmm. Could you send me Chinese PDF file you tried? I'll take a look at them.

> Extracting CJK text
> -------------------
>
>                 Key: PDFBOX-654
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-654
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Atsuo Ishimoto
>             Fix For: 1.1.0
>
>         Attachments: identity-h.patch
>
>
> This is an update for PDFBOX-420 filed by Takashi Komatsubara.
> In this patch, if "Identity-H" is used as encoding of font and the font doesn't supply TO_UNICODE table, then encoding name is generated from CID information (Registry and Ordering). This idea is borrowed from pdfminer[1], an another PDF library written in Python. I don't see any test failures with this patch.
> I have published this patch last year[2], and got some good feedbacks from Japanese users[3].
> [1] http://www.unixuser.org/~euske/python/pdfminer/index.html
> [2] https://code.launchpad.net/~aishimoto/+junk/pdfbox-ja, 
>     https://code.launchpad.net/~aishimoto/+junk/pdfbox-1.0.0-ja
> [3] http://d.hatena.ne.jp/atsuoishimoto/20091211/1260533539

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-654) Extracting CJK text

Posted by "Atsuo Ishimoto (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844854#action_12844854 ] 

Atsuo Ishimoto commented on PDFBOX-654:
---------------------------------------

Thank you for the file. I cannot read Chinese, but characters looks
being extracted correctly for me. Could you be more specific about the
problem you found?

> Extracting CJK text
> -------------------
>
>                 Key: PDFBOX-654
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-654
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Atsuo Ishimoto
>             Fix For: 1.1.0
>
>         Attachments: China.pdf, identity-h.patch
>
>
> This is an update for PDFBOX-420 filed by Takashi Komatsubara.
> In this patch, if "Identity-H" is used as encoding of font and the font doesn't supply TO_UNICODE table, then encoding name is generated from CID information (Registry and Ordering). This idea is borrowed from pdfminer[1], an another PDF library written in Python. I don't see any test failures with this patch.
> I have published this patch last year[2], and got some good feedbacks from Japanese users[3].
> [1] http://www.unixuser.org/~euske/python/pdfminer/index.html
> [2] https://code.launchpad.net/~aishimoto/+junk/pdfbox-ja, 
>     https://code.launchpad.net/~aishimoto/+junk/pdfbox-1.0.0-ja
> [3] http://d.hatena.ne.jp/atsuoishimoto/20091211/1260533539

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-654) Extracting CJK text

Posted by "Atsuo Ishimoto (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Atsuo Ishimoto updated PDFBOX-654:
----------------------------------

    Attachment: identity-h.patch

> Extracting CJK text
> -------------------
>
>                 Key: PDFBOX-654
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-654
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Atsuo Ishimoto
>         Attachments: identity-h.patch
>
>
> This is an update for PDFBOX-420 filed by Takashi Komatsubara.
> In this patch, if "Identity-H" is used as encoding of font and the font doesn't supply TO_UNICODE table, then encoding name is generated from CID information (Registry and Ordering). This idea is borrowed from pdfminer[1], an another PDF library written in Python. I don't see any test failures with this patch.
> I have published this patch last year[2], and got some good feedbacks from Japanese users[3].
> [1] http://www.unixuser.org/~euske/python/pdfminer/index.html
> [2] https://code.launchpad.net/~aishimoto/+junk/pdfbox-ja, 
>     https://code.launchpad.net/~aishimoto/+junk/pdfbox-1.0.0-ja
> [3] http://d.hatena.ne.jp/atsuoishimoto/20091211/1260533539

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-654) Extracting CJK text

Posted by "Takashi Komatsubara (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12844496#action_12844496 ] 

Takashi Komatsubara commented on PDFBOX-654:
--------------------------------------------

Hi Andreas san and Atsuo san,

I have tested Atsuo san's patch and confirmed that his patch passed the maven test.
Also I have successfully extract Japanese text from many pdf files.
Currently, his patch is the highest quality of exporting text from Japanese PDF files.

Unfortunately, I have tested with Chinese pdf files with his patch.
The result is not good. Chinese handling seems to be using different type implemented within pdf file.

As one of Japanese pdfbox developer, I would like you guys to include Japanese pdf files for the maven testing,



> Extracting CJK text
> -------------------
>
>                 Key: PDFBOX-654
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-654
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Atsuo Ishimoto
>             Fix For: 1.1.0
>
>         Attachments: identity-h.patch
>
>
> This is an update for PDFBOX-420 filed by Takashi Komatsubara.
> In this patch, if "Identity-H" is used as encoding of font and the font doesn't supply TO_UNICODE table, then encoding name is generated from CID information (Registry and Ordering). This idea is borrowed from pdfminer[1], an another PDF library written in Python. I don't see any test failures with this patch.
> I have published this patch last year[2], and got some good feedbacks from Japanese users[3].
> [1] http://www.unixuser.org/~euske/python/pdfminer/index.html
> [2] https://code.launchpad.net/~aishimoto/+junk/pdfbox-ja, 
>     https://code.launchpad.net/~aishimoto/+junk/pdfbox-1.0.0-ja
> [3] http://d.hatena.ne.jp/atsuoishimoto/20091211/1260533539

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.