You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2022/04/01 13:32:00 UTC

[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

    [ https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515919#comment-17515919 ] 

Tim Allison commented on TIKA-3711:
-----------------------------------

I introduced that change because some parsers were including it and some were not.  So we had different behavior for different file types, which was less than ideal.

I included this bullet in the CHANGES.txt file as an alert to changed behavior:

bq.    * Improve consistency in reporting package-entry divs across all parsers for embedded files (TIKA-3644). This leads to some more text (embedded file names) in files with many embedded attachments.

We can change the behavior to "include the file name only in xhtml attributes" which will not show up in text.  But we should do that consistently for all file types.

Fellow devs, what do you think?

> Image file names included in parsed Word Document text
> ------------------------------------------------------
>
>                 Key: TIKA-3711
>                 URL: https://issues.apache.org/jira/browse/TIKA-3711
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.3.0
>            Reporter: Sam Stephens
>            Priority: Major
>         Attachments: word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)