You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tony Bray (JIRA)" <ji...@apache.org> on 2017/05/08 17:02:04 UTC

[jira] [Created] (PDFBOX-3782) WARNING: No Unicode mapping for CID+0 (0) in font RGOFPX+IPAexMincho

Tony Bray created PDFBOX-3782:
---------------------------------

             Summary: WARNING: No Unicode mapping for CID+0 (0) in font RGOFPX+IPAexMincho
                 Key: PDFBOX-3782
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3782
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 2.0.4
         Environment: Java/Tika
            Reporter: Tony Bray
            Priority: Minor
         Attachments: Test doc - Japanese writing system - Kanji Hiragana Katakana.pdf, Test doc - Japanese writing system - Kanji Hiragana Katakana.txt

I have a PDF document that I am using Tika/PDFBox to extract the content.  In several areas, the content extracted loses the whitespace, causing a tokenization problem for indexing/searching.  

I have attached the original document and the text output.  If you search (Ctrl+f) the text document for "Another example".  Here you will see no space after "is" and the Japanese text.  The same issue shows for "whichmeans"eraser"" at the end of the sentence.  
Another example is消しゴム (Rō- maji: keshigomu) whichmeans“eraser”

I get the warning "WARNING: No Unicode mapping for CID+0 (0) in font RGOFPX+IPAexMincho" during extraction but have been unable to find any information on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org