You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Dustin Spicuzza (JIRA)" <ji...@apache.org> on 2017/09/05 22:52:00 UTC
[jira] [Created] (TIKA-2459) Missing text in .doc file (but can be
extracted by POI)
Dustin Spicuzza created TIKA-2459:
-------------------------------------
Summary: Missing text in .doc file (but can be extracted by POI)
Key: TIKA-2459
URL: https://issues.apache.org/jira/browse/TIKA-2459
Project: Tika
Issue Type: Bug
Affects Versions: 1.16
Environment: Windows and Linux
Reporter: Dustin Spicuzza
Attachments: foo2.doc
I've got a document whose text can be extracted via org.apache.poi.hwpf.converter.WordToTextConverter, but does not fully get extracted by Tika. The 'paragraph one' paragraph is present in the POI extraction output, and is not present in Tika's output.
Tika's output:
{noformat}
Something
One:
Else
Two:
Here
Three:
Four
Paragraph two
Paragraph three
Paragraph four
cc: Somebody
Somebody else
Something here too
{noformat}
POI's output:
{noformat}
Something
One: Else
Two: Here
Three: Four
Paragraph one
Paragraph two
Paragraph three
Paragraph four
cc: Somebody
Somebody else
Something here too
{noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)