You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2017/05/10 15:27:04 UTC

[jira] [Updated] (PDFBOX-3782) Text extraction loses whitespace

     [ https://issues.apache.org/jira/browse/PDFBOX-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tilman Hausherr updated PDFBOX-3782:
------------------------------------
    Affects Version/s: 2.0.6
                       2.0.5

> Text extraction loses whitespace
> --------------------------------
>
>                 Key: PDFBOX-3782
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3782
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.4, 2.0.5, 2.0.6
>         Environment: Java/Tika
>            Reporter: Tony Bray
>            Priority: Minor
>         Attachments: PDFBOX-3782-reduced.pdf, Test doc - Japanese writing system - Kanji Hiragana Katakana.pdf, Test doc - Japanese writing system - Kanji Hiragana Katakana.txt
>
>
> I have a PDF document that I am using Tika/PDFBox to extract the content.  In several areas, the content extracted loses the whitespace, causing a tokenization problem for indexing/searching.  
> I have attached the original document and the text output.  If you search (Ctrl+f) the text document for "Another example".  Here you will see no space after "is" and the Japanese text.  The same issue shows for "whichmeans"eraser"" at the end of the sentence.  
> Another example is消しゴム (Rō- maji: keshigomu) whichmeans“eraser”
> I get the warning "WARNING: No Unicode mapping for CID+0 (0) in font RGOFPX+IPAexMincho" during extraction but have been unable to find any information on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org