You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tyler Palsulich (JIRA)" <ji...@apache.org> on 2015/06/29 10:53:04 UTC

[jira] [Closed] (TIKA-1552) Pdf document parser

     [ https://issues.apache.org/jira/browse/TIKA-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tyler Palsulich closed TIKA-1552.
---------------------------------
    Resolution: Not A Problem

Marking this as not a problem, since Adobe Reader also adds white space.

> Pdf document parser
> -------------------
>
>                 Key: TIKA-1552
>                 URL: https://issues.apache.org/jira/browse/TIKA-1552
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.7
>            Reporter: Konstantin
>         Attachments: 2014_US_Federal_Budget.pdf, issue.jpg
>
>
> Hello,
> We found that when a pdf document has marked text inside frame (table) then after parsing Tika insert tabs between words.
> Original text from attached file:
> Provides $17.7 billion in discretionary funding for the National Aeronautics and Space
> Parsed text (jira removed tabs, so i will add -> symbols instead):
> •        Provides -> $17.7 -> billion->in->discretionary->funding->for->the->National->Aeronautics->and->Space
> Please  take a look in attached screenshot.
> On the left side is the parsed text in text editor
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)