You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Dustin Spicuzza (JIRA)" <ji...@apache.org> on 2017/09/05 22:52:00 UTC

[jira] [Created] (TIKA-2459) Missing text in .doc file (but can be extracted by POI)

Dustin Spicuzza created TIKA-2459:
-------------------------------------

             Summary: Missing text in .doc file (but can be extracted by POI)
                 Key: TIKA-2459
                 URL: https://issues.apache.org/jira/browse/TIKA-2459
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.16
         Environment: Windows and Linux
            Reporter: Dustin Spicuzza
         Attachments: foo2.doc

I've got a document whose text can be extracted via org.apache.poi.hwpf.converter.WordToTextConverter, but does not fully get extracted by Tika. The 'paragraph one' paragraph is present in the POI extraction output, and is not present in Tika's output.

Tika's output:


{noformat}
Something
One:
Else
Two:
Here
Three:
Four

Paragraph two
Paragraph three
Paragraph four
cc: Somebody
     Somebody else
Something here too
{noformat}

POI's output:

{noformat}
Something
One:    Else
Two:    Here
Three:  Four

Paragraph one

Paragraph two

Paragraph three

Paragraph four


cc: Somebody
     Somebody else


Something here too
{noformat}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)