You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/01/18 15:35:26 UTC

[jira] [Commented] (TIKA-1194) Missing text from MS Word (DOC) file

    [ https://issues.apache.org/jira/browse/TIKA-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828279#comment-15828279 ] 

Tim Allison commented on TIKA-1194:
-----------------------------------

[~tssk]...With the attached .doc file, the attached patch won't help, I don't think.  The triggering file is handled by the regular HWPFDocument, not the HWPFOldDocument.

The problem seems to be in the calculation of the number of cells in that particular row in the table.

I'm able to see the text if I iterate through all paragraphs (and ignore table info) or if I call {{.text()}} on the table.

> Missing text from MS Word (DOC) file
> ------------------------------------
>
>                 Key: TIKA-1194
>                 URL: https://issues.apache.org/jira/browse/TIKA-1194
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Tomas Safarik
>            Priority: Critical
>         Attachments: apache-tika-1.5.patch, OP-06-015.doc
>
>
> Hello,
> we noticed that filtered text from some MS Word DOC files is missing one line (in table cell) in the original document.
> - If you add or remove one character anywhere before the problematic line/cell then the filtered text is correct. If you get the text back to original the filtering problem is back.
> - If the file is resaved as DOCX filtering works fine.
> I will provide sample document. And please let me know if more information is needed.
> Regards,
> Tomas



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)