You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Dustin Spicuzza (JIRA)" <ji...@apache.org> on 2017/09/05 22:52:00 UTC

[jira] [Updated] (TIKA-2459) Missing text in .doc file (but can be extracted by POI)

     [ https://issues.apache.org/jira/browse/TIKA-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dustin Spicuzza updated TIKA-2459:
----------------------------------
    Attachment: foo2.doc

> Missing text in .doc file (but can be extracted by POI)
> -------------------------------------------------------
>
>                 Key: TIKA-2459
>                 URL: https://issues.apache.org/jira/browse/TIKA-2459
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16
>         Environment: Windows and Linux
>            Reporter: Dustin Spicuzza
>         Attachments: foo2.doc
>
>
> I've got a document whose text can be extracted via org.apache.poi.hwpf.converter.WordToTextConverter, but does not fully get extracted by Tika. The 'paragraph one' paragraph is present in the POI extraction output, and is not present in Tika's output.
> Tika's output:
> {noformat}
> Something
> One:
> Else
> Two:
> Here
> Three:
> Four
> Paragraph two
> Paragraph three
> Paragraph four
> cc: Somebody
>      Somebody else
> Something here too
> {noformat}
> POI's output:
> {noformat}
> Something
> One:    Else
> Two:    Here
> Three:  Four
> Paragraph one
> Paragraph two
> Paragraph three
> Paragraph four
> cc: Somebody
>      Somebody else
> Something here too
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)