You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org> on 2011/10/19 13:12:10 UTC
[jira] [Updated] (TIKA-724) PDF text sometimes has extra space
between letters
[ https://issues.apache.org/jira/browse/TIKA-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated TIKA-724:
------------------------------------
Attachment: TIKA-724.patch
Patch.
> PDF text sometimes has extra space between letters
> --------------------------------------------------
>
> Key: TIKA-724
> URL: https://issues.apache.org/jira/browse/TIKA-724
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: TIKA-724.patch, extraSpaces.pdf
>
>
> I have a PDF with simple text "Here is some formatted text", but when
> I extract with Tika I get extra spaces inserted:
> {noformat}
> H e re i s so me fo rma tte d te x t
> {noformat}
> When I created the text in this PDF (I used the PDFpen tool on OS X),
> I set the style of the text to "loosen" (ie, increase space slightly
> between the letters), so I think Tika (PDFBox) is trying to "respect"
> that whitespace, but it'd be nice to turn this off (if it won't mess
> up other places where we DO want the whitespace).
> When I copy/paste the text is copied correctly.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira