You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Oliver Sauder (JIRA)" <ji...@apache.org> on 2010/12/10 14:12:00 UTC

[jira] Commented: (PDFBOX-328) PDFTextStripper not handling some Japanese

    [ https://issues.apache.org/jira/browse/PDFBOX-328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970158#action_12970158 ] 

Oliver Sauder commented on PDFBOX-328:
--------------------------------------

Have the same issue with the attached PDFTransform_japanese.pdf file.

In pdfbox version 1.3.1 the error PDFont - Error: Could not parse predefined CMAP file for 'Adobe-Japan1-UCS2' appears.

When I then copy the cmap files from the poppler-data package (http://packages.debian.org/lenny/poppler-data) to the resource package org.apache.pdfbox.resources.cmap in pdfbox the error message disappears. But the output is still gibberish (s. PDFTransform_japanese_out.txt).

> PDFTextStripper not handling some Japanese
> ------------------------------------------
>
>                 Key: PDFBOX-328
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-328
>             Project: PDFBox
>          Issue Type: Bug
>            Priority: Minor
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552833&aid=1841058
> Originally submitted by sflaumen on 2007-11-29 07:33.
> Using this code sequence: 
>     PDDocument document = PDDocument.load(stream);
>     PDFTextStripper stripper = new PDFTextStripper();
>     String contents = stripper.getText(document);
> some Japanese documents are handled properly. This is shown by viewing the chars in the String "contents".
> However, other Japanese documents produce garbage non-Japanese characters as viewed in the String contents. 
> The ones that are not handled properly in PDFTextStripper display a prompt when opened in Acrobat Reader which says that a Japanese language support pack needs to be installed to view the document properly. The ones that are handled properly display Japanese characters fine when viewed through Acrobat Reader. Installing the language support pack is not a solution since it would only resolve the display in Acrobat Reader. This code needs to run on a Unix server so even if the support pack would provide help on a PC (unlikely) it would have no affect on the task when run in Unix.
> This appears to be an encoding issue however, unlike similar issues that have been reported, the above code completes successfully. It is just that the results are as described above.
> Attached is an example of a PDF file that is not handled properly by PDFTextStripper and requires a Japanese language pack to view in Acrobat Reader.
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552833&aid=1841058&file_id=256615
> JS51ZX3PWT1G.pdf (application/pdf), 84799 bytes
> Not handled properly by PDFTextStripper 
> [comment on SourceForge]
> Originally sent by sflaumen.
> Logged In: YES 
> user_id=1948467
> Originator: YES
> After looking over the code in PDFBox, I would like to suggest that this problem is caused by not having the latest cmap files in the PDFBox cmap folder. Specifically, this folder contains cmap files through the Adobe-Japan1-4 Character Collection. However, additional character collections have been added by Adobe since then. Specifically, they now contain collections for Adobe-Japan1-5 and Adobe-Japan1-6. See Adobe Technical Note #5078. 
> Also, I downloaded the japanese font support pack for Acrobat Reader 8.0 which did resolve the display issue for viewing this pdf document. You can find the list of cmap files in the Resources folder for Acrobat after the download. However, copying these into the one for PDFBox did not solve the problem. I think it is because the identity cmap files are missing which are need to do the conversion. See the 00_ReadMe.pdf in the PDFBox cmaps folder. Please let me know if I'm on the right track. This technology is new to me. Thanks, Steve

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.