You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/09/08 16:49:01 UTC

[jira] [Resolved] (TIKA-2459) Missing text in .doc file (but can be extracted by POI)

     [ https://issues.apache.org/jira/browse/TIKA-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison resolved TIKA-2459.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 1.17

Thank you for opening this and sharing a test file.  We hadn't seen \u0014 and \u0015 together in the same character run before.  This is now fixed.

> Missing text in .doc file (but can be extracted by POI)
> -------------------------------------------------------
>
>                 Key: TIKA-2459
>                 URL: https://issues.apache.org/jira/browse/TIKA-2459
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16
>         Environment: Windows and Linux
>            Reporter: Dustin Spicuzza
>             Fix For: 1.17
>
>         Attachments: foo2.doc
>
>
> I've got a document whose text can be extracted via org.apache.poi.hwpf.converter.WordToTextConverter, but does not fully get extracted by Tika. The 'paragraph one' paragraph is present in the POI extraction output, and is not present in Tika's output.
> Tika's output:
> {noformat}
> Something
> One:
> Else
> Two:
> Here
> Three:
> Four
> Paragraph two
> Paragraph three
> Paragraph four
> cc: Somebody
>      Somebody else
> Something here too
> {noformat}
> POI's output:
> {noformat}
> Something
> One:    Else
> Two:    Here
> Three:  Four
> Paragraph one
> Paragraph two
> Paragraph three
> Paragraph four
> cc: Somebody
>      Somebody else
> Something here too
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)