You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Liang Qu (JIRA)" <ji...@apache.org> on 2011/01/13 22:04:45 UTC
[jira] Created: (PDFBOX-941) extracting Japanese characters gives
garbage
extracting Japanese characters gives garbage
--------------------------------------------
Key: PDFBOX-941
URL: https://issues.apache.org/jira/browse/PDFBOX-941
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.4.0
Environment: java 1.6 on CentOS 64bit Linux and MacOSX 10.6
Reporter: Liang Qu
when extracting text from this pdf file, I got this exception, and the text extracted was gibberish.
44 [main] ERROR org.apache.pdfbox.pdmodel.font.PDFont - Error: Could not parse predefined CMAP file for 'Adobe-Japan1-UCS2'
PDFBox 1.2.1 worked fine with the same file, I wonder why 1.4.0 could not.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-941) extracting Japanese characters gives
garbage
Posted by "Liang Qu (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Liang Qu updated PDFBOX-941:
----------------------------
Attachment: 1010gaiyou.pdf
a part of what I got with 1.4.0:
ظɾ݄
݄ ݄ ݄ ݄ ݄ ݄ ݄ ݄ ݄
धཁऀ ࣮ ࣮ ࣮ ࣮ ݟ௨͠ ࣮ ࣮ ࣮ ࣮
ड૯ֹ ˚ ˚ ˚
ຽध ˚ ˚ ˚ ˚
with 1.2.1:
(年度)
(10億円) 月次
四半期(月平均)
四半期(見通し)
17 18 19 20 21 22
> extracting Japanese characters gives garbage
> --------------------------------------------
>
> Key: PDFBOX-941
> URL: https://issues.apache.org/jira/browse/PDFBOX-941
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.4.0
> Environment: java 1.6 on CentOS 64bit Linux and MacOSX 10.6
> Reporter: Liang Qu
> Attachments: 1010gaiyou.pdf
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> when extracting text from this pdf file, I got this exception, and the text extracted was gibberish.
> 44 [main] ERROR org.apache.pdfbox.pdmodel.font.PDFont - Error: Could not parse predefined CMAP file for 'Adobe-Japan1-UCS2'
> PDFBox 1.2.1 worked fine with the same file, I wonder why 1.4.0 could not.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PDFBOX-941) extracting Japanese characters gives
garbage
Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler resolved PDFBOX-941.
---------------------------------------
Resolution: Fixed
Fix Version/s: 1.5.0
Assignee: Andreas Lehmkühler
I added the missing ToUnicode mappings from Adobe. They got lost after the reorganisation of those files (see PDFBOX-494).
I downloaded the mapping files from http://opensource.adobe.com/wiki/display/pdfmapping/Downloads. They are licensed under the same terms than the other cmap files.
Furthermore I improved the encoding of Type0 fonts.
Both the text extraction and the rendering now work fine with revision 1059595
> extracting Japanese characters gives garbage
> --------------------------------------------
>
> Key: PDFBOX-941
> URL: https://issues.apache.org/jira/browse/PDFBOX-941
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.4.0
> Environment: java 1.6 on CentOS 64bit Linux and MacOSX 10.6
> Reporter: Liang Qu
> Assignee: Andreas Lehmkühler
> Fix For: 1.5.0
>
> Attachments: 1010gaiyou.pdf
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> when extracting text from this pdf file, I got this exception, and the text extracted was gibberish.
> 44 [main] ERROR org.apache.pdfbox.pdmodel.font.PDFont - Error: Could not parse predefined CMAP file for 'Adobe-Japan1-UCS2'
> PDFBox 1.2.1 worked fine with the same file, I wonder why 1.4.0 could not.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] [Commented] (PDFBOX-941) extracting Japanese characters
gives garbage
Posted by "Kevin Clark (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118855#comment-13118855 ]
Kevin Clark commented on PDFBOX-941:
------------------------------------
I'm seeing this with the Tika 0.10 release which uses 1.6.0:
2011-10-01 16:15:43,516 (53344917) [Parser-thread-1] ERROR org.apache.pdfbox.pdmodel.font.PDFont - Error: Could not parse predefined CMAP file for 'Adobe-Japan1-UCS2'
> extracting Japanese characters gives garbage
> --------------------------------------------
>
> Key: PDFBOX-941
> URL: https://issues.apache.org/jira/browse/PDFBOX-941
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.4.0
> Environment: java 1.6 on CentOS 64bit Linux and MacOSX 10.6
> Reporter: Liang Qu
> Assignee: Andreas Lehmkühler
> Fix For: 1.5.0
>
> Attachments: 1010gaiyou.pdf
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> when extracting text from this pdf file, I got this exception, and the text extracted was gibberish.
> 44 [main] ERROR org.apache.pdfbox.pdmodel.font.PDFont - Error: Could not parse predefined CMAP file for 'Adobe-Japan1-UCS2'
> PDFBox 1.2.1 worked fine with the same file, I wonder why 1.4.0 could not.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PDFBOX-941) extracting Japanese characters
gives garbage
Posted by "Andreas Lehmkühler (Commented JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118998#comment-13118998 ]
Andreas Lehmkühler commented on PDFBOX-941:
-------------------------------------------
I'm quite sure that this is not related to PDFBox as it works fine here (without using tika). Probably a misconfigured environment (missing resource files)?
> extracting Japanese characters gives garbage
> --------------------------------------------
>
> Key: PDFBOX-941
> URL: https://issues.apache.org/jira/browse/PDFBOX-941
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.4.0
> Environment: java 1.6 on CentOS 64bit Linux and MacOSX 10.6
> Reporter: Liang Qu
> Assignee: Andreas Lehmkühler
> Fix For: 1.5.0
>
> Attachments: 1010gaiyou.pdf
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> when extracting text from this pdf file, I got this exception, and the text extracted was gibberish.
> 44 [main] ERROR org.apache.pdfbox.pdmodel.font.PDFont - Error: Could not parse predefined CMAP file for 'Adobe-Japan1-UCS2'
> PDFBox 1.2.1 worked fine with the same file, I wonder why 1.4.0 could not.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira