You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Michael McCandless (Assigned) (JIRA)" <ji...@apache.org> on 2011/10/19 13:10:10 UTC

[jira] [Assigned] (TIKA-724) PDF text sometimes has extra space between letters

     [ https://issues.apache.org/jira/browse/TIKA-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned TIKA-724:
---------------------------------------

    Assignee: Michael McCandless
    
> PDF text sometimes has extra space between letters
> --------------------------------------------------
>
>                 Key: TIKA-724
>                 URL: https://issues.apache.org/jira/browse/TIKA-724
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: extraSpaces.pdf
>
>
> I have a PDF with simple text "Here is some formatted text", but when
> I extract with Tika I get extra spaces inserted:
> {noformat}
> H e re  i s  so me  fo rma tte d  te x t
> {noformat}
> When I created the text in this PDF (I used the PDFpen tool on OS X),
> I set the style of the text to "loosen" (ie, increase space slightly
> between the letters), so I think Tika (PDFBox) is trying to "respect"
> that whitespace, but it'd be nice to turn this off (if it won't mess
> up other places where we DO want the whitespace).
> When I copy/paste the text is copied correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira