You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Steve Gullion (JIRA)" <ji...@apache.org> on 2016/07/16 15:48:20 UTC

[jira] [Updated] (TIKA-2036) Deleted Text from Word File Shows Up in Extract

     [ https://issues.apache.org/jira/browse/TIKA-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Gullion updated TIKA-2036:
--------------------------------
    Description: 
A .docx file, with "track changes" on, includes deleted text. In this case, there are two overlapping deletions:

9.	[DELETED:This Agreement shall be governed by and construed in accordance with [INSERTED, THEN DELETED:Arizona] New York law] (Intentionally omitted.)

The text should only include "9. (Intentionally omitted)". However, the output is "9. This Agreement shall be governed and construed in accordance with New York law." So it recognizes "Arizona" as deleted, but not the rest of it.

Edit: this is worse than I originally thought. ALL deleted text is showing up in text exported from other Word docs. I saw this reported in 2011, and there was supposedly a patch, but apparently it doesn't work, or something else was changed. Is there an option somewhere that provides for the exclusion of deleted text generally?

  was:
A .docx file, with "track changes" on, includes deleted text. In this case, there are two overlapping deletions:

9.	[DELETED:This Agreement shall be governed by and construed in accordance with [INSERTED, THEN DELETED:Arizona] New York law] (Intentionally omitted.)

The text should only include "9. (Intentionally omitted)". However, the output is "9. This Agreement shall be governed and construed in accordance with New York law." So it recognizes "Arizona" as deleted, but not the rest of it.


> Deleted Text from Word File Shows Up in Extract
> -----------------------------------------------
>
>                 Key: TIKA-2036
>                 URL: https://issues.apache.org/jira/browse/TIKA-2036
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.13
>         Environment: Windows, under TikaOnDotNet
>            Reporter: Steve Gullion
>              Labels: word
>
> A .docx file, with "track changes" on, includes deleted text. In this case, there are two overlapping deletions:
> 9.	[DELETED:This Agreement shall be governed by and construed in accordance with [INSERTED, THEN DELETED:Arizona] New York law] (Intentionally omitted.)
> The text should only include "9. (Intentionally omitted)". However, the output is "9. This Agreement shall be governed and construed in accordance with New York law." So it recognizes "Arizona" as deleted, but not the rest of it.
> Edit: this is worse than I originally thought. ALL deleted text is showing up in text exported from other Word docs. I saw this reported in 2011, and there was supposedly a patch, but apparently it doesn't work, or something else was changed. Is there an option somewhere that provides for the exclusion of deleted text generally?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)