You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Sam Stephens (Jira)" <ji...@apache.org> on 2022/04/05 00:26:00 UTC

[jira] [Comment Edited] (TIKA-3711) Image file names included in parsed Word Document text

    [ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517167#comment-17517167 ] 

Sam Stephens edited comment on TIKA-3711 at 4/5/22 12:25 AM:
-------------------------------------------------------------

Regarding filenames, I don't think they will ever be semantically meaningful. I just created a document with Word 365 (uploaded as word-doc-with-image-from-word-365.docx), added a picture with the filename test-image.png, and the extracted filename is image.png. I think Word is creating non-interesting filenames.

As far as breaking other users, I'm raising this bug because this *is* a change in behavior that's broken me. I'm relying on the Tika 2.2.0 behavior where images are not part of the text.

Using the ToXMLContentHandler to look at the actual generated HTML, I think that's also going to be surprising behavior for most users.

{{<p><img src="embedded:image.png" alt="" /></p>}}
{{<div class="package-entry"><h1>image.png</h1>}}
{{</div>}}

My image actually has alt text; that's not included. And I think including the image file name as a header in the markup is going to be surprising to almost every user. It certainly doesn't match the source document (which has no headers, or visible text of any kind).

As an end user, what I'd like is the XHTML to be

{{<p><img src="embedded:image.png" alt="Test Alt Text" /></p>}}
{{}}

And the text from BodyContentHandler to not include the image at all. That way the text is the text, and if I have an interest in Image alt tags, I can operate on the XHTML.

If you wanted to include an option to provide text for the image, I don't think image filenames will ever be useful from Word; alt text is the right place semantically to be looking for a textual representation of an image.


was (Author: JIRAUSER287416):
Regarding filenames, I don't think they will ever be semantically meaningful. I just created a document with Word 365 (uploaded as word-doc-with-image-from-word-365.docx), added a picture with the filename test-image.png, and the extracted filename is still image1.png. I think Word is creating non-interesting filenames.

As far as breaking other users, I'm raising this bug because this *is* a change in behavior that's broken me. I'm relying on the Tika 2.2.0 behavior where images are not part of the text.

Using the ToXMLContentHandler to look at the actual generated HTML, I think that's also going to be surprising behavior for most users.

{{<p><img src="embedded:image.png" alt="" /></p>}}
{{<div class="package-entry"><h1>image.png</h1>}}
{{</div>}}

My image actually has alt text; that's not included. And I think including the image file name as a header in the markup is going to be surprising to almost every user. It certainly doesn't match the source document (which has no headers, or visible text of any kind).

As an end user, what I'd like is the XHTML to be

{{<p><img src="embedded:image.png" alt="Test Alt Text" /></p>}}
{{}}

And the text from BodyContentHandler to not include the image at all. That way the text is the text, and if I have an interest in Image alt tags, I can operate on the XHTML.

If you wanted to include an option to provide text for the image, I don't think image filenames will ever be useful from Word; alt text is the right place semantically to be looking for a textual representation of an image.

> Image file names included in parsed Word Document text
> ------------------------------------------------------
>
>                 Key: TIKA-3711
>                 URL: https://issues.apache.org/jira/browse/TIKA-3711
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.3.0
>            Reporter: Sam Stephens
>            Priority: Minor
>         Attachments: word-doc-with-image-from-word-365.docx, word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)