You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Liang Qu (JIRA)" <ji...@apache.org> on 2011/01/13 22:08:45 UTC

[jira] Updated: (PDFBOX-941) extracting Japanese characters gives garbage

     [ https://issues.apache.org/jira/browse/PDFBOX-941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Liang Qu updated PDFBOX-941:
----------------------------

    Attachment: 1010gaiyou.pdf

a part of what I got with 1.4.0:

ظɾ݄ ೥ ೥ ೥ ೥ ೥ ೥ ೥ ೥
݄ ݄ ݄ ݄ ݄ ݄ ݄ ݄ ݄
धཁऀ ࣮੷ ࣮੷ ࣮੷ ࣮੷ ݟ௨͠ ࣮੷ ࣮੷ ࣮੷ ࣮੷
ड஫૯ֹ   ˚  ˚   ˚ 
ຽध   ˚  ˚   ˚ ˚

with 1.2.1:
 (年度)
 (10億円) 月次
 四半期(月平均)
 四半期(見通し)
 17 18 19 20 21 22

> extracting Japanese characters gives garbage
> --------------------------------------------
>
>                 Key: PDFBOX-941
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-941
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>         Environment: java 1.6 on CentOS 64bit Linux and MacOSX 10.6
>            Reporter: Liang Qu
>         Attachments: 1010gaiyou.pdf
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> when extracting text from this pdf file, I got this exception, and the text extracted was gibberish.
> 44 [main] ERROR org.apache.pdfbox.pdmodel.font.PDFont  - Error: Could not parse predefined CMAP file for 'Adobe-Japan1-UCS2'
> PDFBox 1.2.1 worked fine with the same file, I wonder why 1.4.0 could not.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.