You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/11/28 14:24:00 UTC
[jira] [Comment Edited] (TIKA-2510) Embedded MP3 file in PPTX
document no longer identified
[ https://issues.apache.org/jira/browse/TIKA-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268793#comment-16268793 ]
Tim Allison edited comment on TIKA-2510 at 11/28/17 2:23 PM:
-------------------------------------------------------------
Figured out diff btwn 1.14 and 1.16. We added the set {{seen}} so that we only processed a media item once. In 1.16,
1. the first relationship to the mp3 file was type {{http://schemas.microsoft.com/office/2007/relationships/media}}
2. we added the mp3 to {{seen}}
3. we then didn't process it because we didn't recognize that relationship
4. the second relationship to the mp3 file was type {{http://schemas.openxmlformats.org/officeDocument/2006/relationships/audio}}
5. In 1.14, we processed this, but in 1.16, because we had added that media item to {{seen}} (without processing it!), we skipped it.
The correct fix is to add embedded media items to {{seen}} only after they are processed.
was (Author: tallison@mitre.org):
Figured out diff btwn 1.14 and 1.16. We added the set {{seen}} so that we only processed a media item once. In 1.16,
1. the first relationship to the mp3 file was type {{http://schemas.microsoft.com/office/2007/relationships/media}}
2. we added the mp3 to {{seen}}
3. we then didn't process it because we didn't recognize that relationship
4. the second relationship to the mp3 file was type {{http://schemas.openxmlformats.org/officeDocument/2006/relationships/audio}}
5. In 1.14, we processed this, but in 1.16, because we had added that media item to seen (without processing it!), we skipped it.
The correct fix is to add embedded media items to {{seen}} only after they are processed.
> Embedded MP3 file in PPTX document no longer identified
> -------------------------------------------------------
>
> Key: TIKA-2510
> URL: https://issues.apache.org/jira/browse/TIKA-2510
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.15
> Reporter: Eamonn Saunders
> Assignee: Tim Allison
> Priority: Minor
> Fix For: 1.17
>
> Attachments: Windows Audio File.pptx, tika-1.14-output.json, tika-1.15-output.json
>
>
> I'm attaching a sample PPTX file with an embedded MP3 file along with JSON files produced by Tika App (versions 1.14 and 1.15).
> Notice that the 1.14 output identifies the embedded MP3 file while the 1.15 version does not.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)