You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2022/04/26 16:52:00 UTC

[jira] [Comment Edited] (TIKA-3738) ForkParser missing metadata for some document formats

    [ https://issues.apache.org/jira/browse/TIKA-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528292#comment-17528292 ] 

Tim Allison edited comment on TIKA-3738 at 4/26/22 4:51 PM:
------------------------------------------------------------

This issue appears to go all the way back to the original ForkParser. To confirm, this isn't a new issue, right?

The metadata that comes through is what is written into the xhtml.

For example, if we use a regular parser we get a bunch more metadata in the metadata object than we do in the xhtml, and this is expected because the content of the metadata is dumped into the xhtml as soon as the first character is written to the body.

{noformat}
<head>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.mp4.MP4Parser" />
<meta name="dc:title" content="Test Title" />
<meta name="Content-Type" content="audio/mp4" />
<title>Test Title</title>
</head>
{noformat}

Metadata:
{noformat}
Content-Type : audio/mp4
X-TIKA:EXCEPTION:warn : End of data reached.
X-TIKA:Parsed-By : org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By : org.apache.tika.parser.mp4.MP4Parser
X-TIKA:Parsed-By-Full-Set : org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By-Full-Set : org.apache.tika.parser.mp4.MP4Parser
dc:creator : Test Artist
dc:title : Test Title
dcterms:created : 2012-01-28T18:39:18Z
dcterms:modified : 2012-01-28T18:40:25Z
xmp:CreatorTool : iTunes 10.5.3.3
xmpDM:album : Test Album
xmpDM:albumArtist : Test Album Artist
xmpDM:artist : Test Artist
xmpDM:audioChannelType : Stereo
xmpDM:audioCompressor : M4A
xmpDM:audioSampleRate : 44100
xmpDM:compilation : 0
xmpDM:composer : Test Composer
xmpDM:discNumber : 6
xmpDM:duration : 0.07
xmpDM:genre : Test Genre
xmpDM:logComment : Test Comments
xmpDM:releaseDate : 2008
xmpDM:trackNumber : 1
{noformat}

I haven't looked at this part of the codebase in a while, and I'm frankly trying to figure out how any metadata comes back.

Will update when I figure that out. :D


was (Author: tallison@mitre.org):
This issue appears to go all the way back to the original ForkParser. To confirm, this isn't a new issue, right?

The metadata that comes through is what is written into the xhtml.

For example, if we use a regular parser we get a bunch more metadata in the metadata object than we do in the xhtml:
{noformat}
<head>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.mp4.MP4Parser" />
<meta name="dc:title" content="Test Title" />
<meta name="Content-Type" content="audio/mp4" />
<title>Test Title</title>
</head>
{noformat}

Metadata:
{noformat}
Content-Type : audio/mp4
X-TIKA:EXCEPTION:warn : End of data reached.
X-TIKA:Parsed-By : org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By : org.apache.tika.parser.mp4.MP4Parser
X-TIKA:Parsed-By-Full-Set : org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By-Full-Set : org.apache.tika.parser.mp4.MP4Parser
dc:creator : Test Artist
dc:title : Test Title
dcterms:created : 2012-01-28T18:39:18Z
dcterms:modified : 2012-01-28T18:40:25Z
xmp:CreatorTool : iTunes 10.5.3.3
xmpDM:album : Test Album
xmpDM:albumArtist : Test Album Artist
xmpDM:artist : Test Artist
xmpDM:audioChannelType : Stereo
xmpDM:audioCompressor : M4A
xmpDM:audioSampleRate : 44100
xmpDM:compilation : 0
xmpDM:composer : Test Composer
xmpDM:discNumber : 6
xmpDM:duration : 0.07
xmpDM:genre : Test Genre
xmpDM:logComment : Test Comments
xmpDM:releaseDate : 2008
xmpDM:trackNumber : 1
{noformat}

I haven't looked at this part of the codebase in a while, and I'm frankly trying to figure out how any metadata comes back.

Will update when I figure that out. :D

> ForkParser missing metadata for some document formats
> -----------------------------------------------------
>
>                 Key: TIKA-3738
>                 URL: https://issues.apache.org/jira/browse/TIKA-3738
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.3.0
>         Environment: Java 11.0.14.
>            Reporter: Stephen H
>            Priority: Major
>         Attachments: ForkParserIntegrationTest.java.diff, testVideoMetadataMp4.mp4
>
>
> When using ForkParser, metadata from some parsers is not being returned in the Metadata object or in the head of the returned XML. These include OpenDocument Presentation (ODP), OpenDocument Spreadsheet (ODS), Microsoft Word 2006 XML, MP4 Audio (M4A) and MP4 Video (MP4).
> Patch for ForkParserIntegrationTest showing the issue for these file types is attached, along with an MP4 video file containing metadata as there doesn't appear to be one currently in the test set.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)