You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/11/06 07:10:34 UTC

[jira] [Closed] (PDFBOX-1304) Text extraction meets "Could not parse predefined CMAP" and returns just a small part of the content containing garbage chars.

     [ https://issues.apache.org/jira/browse/PDFBOX-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hewson closed PDFBOX-1304.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 2.0.0

Works with 2.0 trunk.

> Text extraction meets "Could not parse predefined CMAP" and returns just a small part of the content containing garbage chars.
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1304
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1304
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>         Environment: Win7 32bits
>            Reporter: Huan LI
>             Fix For: 2.0.0
>
>         Attachments: fj.pdf, fj.txt
>
>
> i'm using pdfbox-1.6.0 for text extraction from a Chinese pdf file(see the attachment "fj.pdf").
>  
> the extraction code looks like below:
> [code]
>     stripper = new PDFTextStripper(encoding);
>     txt = stripper.getText(_pdfDoc);
> [/code] 
> when running getText(), the console says :
> [console]
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUE1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUO1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUF1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUF1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUE1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUE1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKUF1-UCS2'
> 五月 06, 2012 4:13:51 下午 org.apache.pdfbox.pdmodel.font.PDCIDFont determineEncoding
> 严重: Error: Could not parse predefined CMAP file for 'Founder-PKU2-UCS2'
> [/console]
> after getText() returns, the txt contains just a small part of the pdf content (lots are missing) and some garbage chars like "犖犑狌犣犎犗犝犔犻犺犅"(see attachment "fj.txt").
>  
> I've heard some said that the "org.apache.pdfbox.cos.COSString.java" has some errors when pdfbox-0.7.3. Has COSString.java been corrected in 1.6.0?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)