You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2017/05/10 15:32:04 UTC
[jira] [Comment Edited] (PDFBOX-3782) Text extraction loses
whitespace
[ https://issues.apache.org/jira/browse/PDFBOX-3782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004860#comment-16004860 ]
Tilman Hausherr edited comment on PDFBOX-3782 at 5/10/17 3:31 PM:
------------------------------------------------------------------
Can you tell what part "resisted" the extraction with the modified parameter?
was (Author: tilman):
Can you tell what part "resisted" the extraction?
> Text extraction loses whitespace
> --------------------------------
>
> Key: PDFBOX-3782
> URL: https://issues.apache.org/jira/browse/PDFBOX-3782
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.4, 2.0.5, 2.0.6
> Environment: Java/Tika
> Reporter: Tony Bray
> Priority: Minor
> Attachments: PDFBOX-3782-reduced.pdf, Test doc - Japanese writing system - Kanji Hiragana Katakana.pdf, Test doc - Japanese writing system - Kanji Hiragana Katakana.txt
>
>
> I have a PDF document that I am using Tika/PDFBox to extract the content. In several areas, the content extracted loses the whitespace, causing a tokenization problem for indexing/searching.
> I have attached the original document and the text output. If you search (Ctrl+f) the text document for "Another example". Here you will see no space after "is" and the Japanese text. The same issue shows for "whichmeans"eraser"" at the end of the sentence.
> Another example is消しゴム (Rō- maji: keshigomu) whichmeans“eraser”
> I get the warning "WARNING: No Unicode mapping for CID+0 (0) in font RGOFPX+IPAexMincho" during extraction but have been unable to find any information on it.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org