You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Peter Kronenberg <pe...@torch.ai> on 2021/08/27 14:33:56 UTC

Deleted text in Word document

When Tika extracts from a Microsoft Word document, deleted text is extracted, with no indication that it is deleted.  In fact, if a word was deleted and replaced by another word, both words just show up side-by-side.  Is there a way to get some sort of annotation that indicates the status of the text?  Or extract it in some sort of structured (e.g., XML) format?  Similarly for highlighted text or other mark-up.  Any way to get that?

For example
[cid:image001.png@01D79B2F.09028FD0]

Time of Essence was changed Time of Importance

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>



Re: Deleted text in Word document

Posted by Peter Kronenberg <pe...@torch.ai>.
Thanks. I'll at least try the flag and see if that improves things


________________________________
From: Tim Allison <ta...@apache.org>
Sent: Saturday, August 28, 2021 9:26:01 AM
To: user@tika.apache.org <us...@tika.apache.org>
Subject: Re: Deleted text in Word document

> It _might_ be fairly trivial to add. It isn’t at least for docx. The challenge iirc is that the deleted content flag can be on a run, a paragraph, table, etc. You’d have to add deleted check

> It _might_ be fairly trivial to add.

It isn’t at least for docx. The challenge iirc is that the deleted content flag can be on a run, a paragraph, table, etc. You’d have to add deleted checks on every element.

So, definitely possible, but non-trivial.

In addition to deleted text, there’s also “movefrom” text which we should handle at the same time if we fix this for deleted text.

Finally, it looks like includeDeletedContent may not work correctly for docx. :( I’d need to check when back to keyboard.

If this is important for you, please open an issue with example documents.

On Sat, Aug 28, 2021 at 9:13 AM Tim Allison <ta...@apache.org>> wrote:

You can turn off the extraction of deleted text via the OfficeParserConfig#setIncludeDeletedContent

However, I agree that it would be an improvement to add div tags for deleted text.  I haven’t been in this part of the codebase in a while. It _might_ be fairly trivial to add.


On Fri, Aug 27, 2021 at 10:34 AM Peter Kronenberg <pe...@torch.ai>> wrote:

When Tika extracts from a Microsoft Word document, deleted text is extracted, with no indication that it is deleted.  In fact, if a word was deleted and replaced by another word, both words just show up side-by-side.  Is there a way to get some sort of annotation that indicates the status of the text?  Or extract it in some sort of structured (e.g., XML) format?  Similarly for highlighted text or other mark-up.  Any way to get that?



For example

[cid:17b8ce17cccad7999131]



Time of Essence was changed Time of Importance



Peter Kronenberg  |  Senior AI Analytic ENGINEER

C: 703.887.5623

[Torch AI]<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=a23f52deb3ea4590aab506d0d07fcd03>

4303 W. 119th St., Leawood, KS 66209<https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC80MzAzK1cuKzExOXRoK1N0LiwrTGVhd29vZCwrS1MrNjYyMDk_ZW50cnk9Z21haWwmc291cmNlPWc=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=N2FQazRValYxZ2cyRHZLcXZnb1AzcTVlQVc0SHJFYXdjMkFPemVSR1M1cz0=&h=a23f52deb3ea4590aab506d0d07fcd03>
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=a23f52deb3ea4590aab506d0d07fcd03>





RE: Deleted text in Word document

Posted by Peter Kronenberg <pe...@torch.ai>.
Just tested this and you’re right.  It doesn’t work for docx files.  Works fine for doc

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


From: Tim Allison <ta...@apache.org>
Sent: Saturday, August 28, 2021 9:26 AM
To: user@tika.apache.org
Subject: Re: Deleted text in Word document

> It _might_ be fairly trivial to add. It isn’t at least for docx. The challenge iirc is that the deleted content flag can be on a run, a paragraph, table, etc. You’d have to add deleted check

> It _might_ be fairly trivial to add.


It isn’t at least for docx. The challenge iirc is that the deleted content flag can be on a run, a paragraph, table, etc. You’d have to add deleted checks on every element.


So, definitely possible, but non-trivial.


In addition to deleted text, there’s also “movefrom” text which we should handle at the same time if we fix this for deleted text.


Finally, it looks like includeDeletedContent may not work correctly for docx. :( I’d need to check when back to keyboard.


If this is important for you, please open an issue with example documents.

On Sat, Aug 28, 2021 at 9:13 AM Tim Allison <ta...@apache.org>> wrote:

You can turn off the extraction of deleted text via the OfficeParserConfig#setIncludeDeletedContent

However, I agree that it would be an improvement to add div tags for deleted text.  I haven’t been in this part of the codebase in a while. It _might_ be fairly trivial to add.


On Fri, Aug 27, 2021 at 10:34 AM Peter Kronenberg <pe...@torch.ai>> wrote:
When Tika extracts from a Microsoft Word document, deleted text is extracted, with no indication that it is deleted.  In fact, if a word was deleted and replaced by another word, both words just show up side-by-side.  Is there a way to get some sort of annotation that indicates the status of the text?  Or extract it in some sort of structured (e.g., XML) format?  Similarly for highlighted text or other mark-up.  Any way to get that?

For example
[cid:image002.png@01D79D9D.048E7430]

Time of Essence was changed Time of Importance

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=a23f52deb3ea4590aab506d0d07fcd03>
4303 W. 119th St., Leawood, KS 66209<https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC80MzAzK1cuKzExOXRoK1N0LiwrTGVhd29vZCwrS1MrNjYyMDk_ZW50cnk9Z21haWwmc291cmNlPWc=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=N2FQazRValYxZ2cyRHZLcXZnb1AzcTVlQVc0SHJFYXdjMkFPemVSR1M1cz0=&h=a23f52deb3ea4590aab506d0d07fcd03>
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=a23f52deb3ea4590aab506d0d07fcd03>



Re: Deleted text in Word document

Posted by Tim Allison <ta...@apache.org>.
> It _might_ be fairly trivial to add.

It isn’t at least for docx. The challenge iirc is that the deleted content
flag can be on a run, a paragraph, table, etc. You’d have to add deleted
checks on every element.

So, definitely possible, but non-trivial.

In addition to deleted text, there’s also “movefrom” text which we should
handle at the same time if we fix this for deleted text.

Finally, it looks like includeDeletedContent may not work correctly for
docx. :( I’d need to check when back to keyboard.

If this is important for you, please open an issue with example documents.

On Sat, Aug 28, 2021 at 9:13 AM Tim Allison <ta...@apache.org> wrote:

>
> You can turn off the extraction of deleted text via the
> OfficeParserConfig#setIncludeDeletedContent
>
> However, I agree that it would be an improvement to add div tags for
> deleted text.  I haven’t been in this part of the codebase in a while. It
> _might_ be fairly trivial to add.
>
>
> On Fri, Aug 27, 2021 at 10:34 AM Peter Kronenberg <
> peter.kronenberg@torch.ai> wrote:
>
>> When Tika extracts from a Microsoft Word document, deleted text is
>> extracted, with no indication that it is deleted.  In fact, if a word was
>> deleted and replaced by another word, both words just show up
>> side-by-side.  Is there a way to get some sort of annotation that indicates
>> the status of the text?  Or extract it in some sort of structured (e.g.,
>> XML) format?  Similarly for highlighted text or other mark-up.  Any way to
>> get that?
>>
>>
>>
>> For example
>>
>>
>>
>> *Time of Essence* was changed *Time of Importance*
>>
>>
>>
>> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>>
>> *C: 703.887.5623*
>>
>> [image: Torch AI] <http://www.torch.ai/>
>>
>> 4303 W. 119th St., Leawood, KS 66209
>> <https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
>> WWW.TORCH.AI <http://www.torch.ai/>
>>
>>
>>
>>
>>
>

Re: Deleted text in Word document

Posted by Tim Allison <ta...@apache.org>.
You can turn off the extraction of deleted text via the
OfficeParserConfig#setIncludeDeletedContent

However, I agree that it would be an improvement to add div tags for
deleted text.  I haven’t been in this part of the codebase in a while. It
_might_ be fairly trivial to add.


On Fri, Aug 27, 2021 at 10:34 AM Peter Kronenberg <pe...@torch.ai>
wrote:

> When Tika extracts from a Microsoft Word document, deleted text is
> extracted, with no indication that it is deleted.  In fact, if a word was
> deleted and replaced by another word, both words just show up
> side-by-side.  Is there a way to get some sort of annotation that indicates
> the status of the text?  Or extract it in some sort of structured (e.g.,
> XML) format?  Similarly for highlighted text or other mark-up.  Any way to
> get that?
>
>
>
> For example
>
>
>
> *Time of Essence* was changed *Time of Importance*
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 4303 W. 119th St., Leawood, KS 66209
> <https://www.google.com/maps/search/4303+W.+119th+St.,+Leawood,+KS+66209?entry=gmail&source=g>
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>

RE: Deleted text in Word document

Posted by Peter Kronenberg <pe...@torch.ai>.
No, it doesn't appear to.  Here's what I get



<p class="list_Paragraph">12.2 <u>Time of EssenceImportance</u>. Time is of the essenceimportance with respect  <REDACTED>.</p>

Peter Kronenberg  |  SENIOR AI ANALYTIC ENGINEER 
C: 703.887.5623

4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI


-----Original Message-----
From: Nick Burch <ap...@gagravarr.org> 
Sent: Friday, August 27, 2021 11:10 AM
To: user@tika.apache.org
Subject: Re: Deleted text in Word document

On Fri, 27 Aug 2021, Peter Kronenberg wrote:
> When Tika extracts from a Microsoft Word document, deleted text is 
> extracted, with no indication that it is deleted.  In fact, if a word 
> was deleted and replaced by another word, both words just show up 
> side-by-side.  Is there a way to get some sort of annotation that 
> indicates the status of the text?  Or extract it in some sort of 
> structured (e.g., XML) format?

How are you calling Tika? Is the XHTML output sufficiently marked-up to let you spot it?

Nick

Re: Deleted text in Word document

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 27 Aug 2021, Peter Kronenberg wrote:
> When Tika extracts from a Microsoft Word document, deleted text is 
> extracted, with no indication that it is deleted.  In fact, if a word 
> was deleted and replaced by another word, both words just show up 
> side-by-side.  Is there a way to get some sort of annotation that 
> indicates the status of the text?  Or extract it in some sort of 
> structured (e.g., XML) format?

How are you calling Tika? Is the XHTML output sufficiently marked-up to 
let you spot it?

Nick