You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Keiji Suzuki (JIRA)" <ji...@apache.org> on 2010/08/28 08:58:53 UTC

[jira] Created: (PDFBOX-805) Extratced ascii text in CJK document is malformed

Extratced ascii text in CJK document is malformed
-------------------------------------------------

                 Key: PDFBOX-805
                 URL: https://issues.apache.org/jira/browse/PDFBOX-805
             Project: PDFBox
          Issue Type: Bug
          Components: FontBox
    Affects Versions: 1.2.1
            Reporter: Keiji Suzuki


When I run ExtractText with CJK PDF document with ascii text, the only ascii text is malformed. This does not occur in version 1.1.0.
I can fix it with the attached patch. I attach an example pdf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-805) Extratced ascii text in CJK document is malformed

Posted by "Keiji Suzuki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keiji Suzuki updated PDFBOX-805:
--------------------------------

    Attachment: cjk.pdf
                CMapParser.java.patch

The patch is for org/apache/fontbox/cmap/CMapParser.java in trunk. The sample pdf is made from iText sample code(
http://www.1t3xt.info/examples/browse/?page=example&id=142)


> Extratced ascii text in CJK document is malformed
> -------------------------------------------------
>
>                 Key: PDFBOX-805
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-805
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 1.2.1
>            Reporter: Keiji Suzuki
>         Attachments: cjk.pdf, CMapParser.java.patch
>
>
> When I run ExtractText with CJK PDF document with ascii text, the only ascii text is malformed. This does not occur in version 1.1.0.
> I can fix it with the attached patch. I attach an example pdf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-805) Extratced ascii text in CJK document is malformed

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905004#action_12905004 ] 

Andreas Lehmkühler commented on PDFBOX-805:
-------------------------------------------

It is always a good idea to embed all used fonts to the pdf. Otherwise one can't be sure that all needed fonts are installed on your destination platform. E.g. here on my german WinXP the acrobat reader doesn't show anything. Please, recreate the pdf with embedded fonts if possible?

> Extratced ascii text in CJK document is malformed
> -------------------------------------------------
>
>                 Key: PDFBOX-805
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-805
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 1.2.1
>            Reporter: Keiji Suzuki
>         Attachments: cjk.pdf, CMapParser.java.patch
>
>
> When I run ExtractText with CJK PDF document with ascii text, the only ascii text is malformed. This does not occur in version 1.1.0.
> I can fix it with the attached patch. I attach an example pdf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-805) Extratced ascii text in CJK document is malformed

Posted by "Keiji Suzuki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906235#action_12906235 ] 

Keiji Suzuki commented on PDFBOX-805:
-------------------------------------

I created another pdf which contains the same paragraph with two font. One is not embedded but may show with the acrobat reader adding Asian Font Pack. The other is embedded.

The result of extraction is that the paragraph without embedded font is malformed and the paragraph with embedded font is correct. With version 1.1.0 and the patched version 1.2.1 of fontbox, both paragraphs are correct.



> Extratced ascii text in CJK document is malformed
> -------------------------------------------------
>
>                 Key: PDFBOX-805
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-805
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 1.2.1
>            Reporter: Keiji Suzuki
>         Attachments: cjk.pdf, CMapParser.java.patch
>
>
> When I run ExtractText with CJK PDF document with ascii text, the only ascii text is malformed. This does not occur in version 1.1.0.
> I can fix it with the attached patch. I attach an example pdf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-805) Extratced ascii text in CJK document is malformed

Posted by "Keiji Suzuki (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keiji Suzuki updated PDFBOX-805:
--------------------------------

    Attachment: cjk.pdf
                extracted.txt

> Extratced ascii text in CJK document is malformed
> -------------------------------------------------
>
>                 Key: PDFBOX-805
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-805
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 1.2.1
>            Reporter: Keiji Suzuki
>         Attachments: cjk.pdf, cjk.pdf, CMapParser.java.patch, extracted.txt
>
>
> When I run ExtractText with CJK PDF document with ascii text, the only ascii text is malformed. This does not occur in version 1.1.0.
> I can fix it with the attached patch. I attach an example pdf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PDFBOX-805) Extratced ascii text in CJK document is malformed

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-805.
---------------------------------------

    Fix Version/s: 1.3.0
       Resolution: Fixed

Fixed in revision 992763.

Keiji's patch won't work in every situation. It's correct that the calculation of the number of mapping was wrong. But the remaining part would always lead to an identity mapping for cidranges. But that approach was a good pointer where to look for a solution. I improved the CMapParser (see PDFBOX-11) including the "number of mapping"-fix and PDFont.encode().

Thanks for the report and the investigations



> Extratced ascii text in CJK document is malformed
> -------------------------------------------------
>
>                 Key: PDFBOX-805
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-805
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 1.2.1
>            Reporter: Keiji Suzuki
>             Fix For: 1.3.0
>
>         Attachments: cjk.pdf, cjk.pdf, CMapParser.java.patch, extracted.txt
>
>
> When I run ExtractText with CJK PDF document with ascii text, the only ascii text is malformed. This does not occur in version 1.1.0.
> I can fix it with the attached patch. I attach an example pdf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.