You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/07/01 17:33:04 UTC

[jira] [Commented] (TIKA-1671) Wrapped lines in PDF files not processed correctly

    [ https://issues.apache.org/jira/browse/TIKA-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14610474#comment-14610474 ] 

Tim Allison commented on TIKA-1671:
-----------------------------------

Thank you for raising this.  Please see TIKA-1641 for the same type of issue, I think.  If you can give pure PDFBox-app's ExtractText a try and see if you get the same result, that'd be great.  If you get the same result, then unfortunately, it is beyond the scope of Tika to recombine lines.  If you get what you want, then there may be something in Tika that we can fix.


> Wrapped lines in PDF files not processed correctly
> --------------------------------------------------
>
>                 Key: TIKA-1671
>                 URL: https://issues.apache.org/jira/browse/TIKA-1671
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.9
>            Reporter: James Baker
>              Labels: pdf, wrapping
>         Attachments: Test Document.pdf
>
>
> Text that wraps over multiple lines in PDF documents is not extracted correctly by Tika. The expected behaviour would be for it to be extracted as a single line, but instead a line break is inserted at each wrap point.
> This makes it hard, if not impossible, to reassemble text into it's intended form, as it is not known whether a line break in the extracted text is one that appeared in the document or one that was inserted by Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)