You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Tucker B <ba...@gmail.com> on 2020/05/11 15:40:08 UTC

Missing XMP Metadata from PDF

I have a PDF with XMP metadata with two rdf:Description tags with
different namespaces. The first namespace is DublinCore the other is
XMPSchemaBasic. I can confirm jempbox is able to read the XMP metadata
properly and properly identify the namespaces. However, it appears the
PDFParser in Tika is not adding XMPSchemaBasic metadata to the extracted
metadata, specifically the CreateDate. I'm curious if this is expected
behaviour. Ideally, the PDFParser would set the TikaCoreProperties.CREATED
to the value in the XMP metadata absent the presence of a created date in
the PDDocumentInformation. Or at least a Property such as "xmp:CreateDate".
I've attached the XMP packet and a PDF with the XMP metadata. I'm using
Tika 1.24.1 Any help or guidance would be greatly appreciated.

Also, I noticed the XMP packet id is "W5M0MpCehiHzreSzNTczkc9d" which is
base64 encoded string "[42!573]". Curious if anyone knows the
significance of this.

Re: Missing XMP Metadata from PDF

Posted by Tim Allison <ta...@apache.org>.

Yep, that's a problem.  Thank you!

https://issues.apache.org/jira/browse/TIKA-3101

On Mon, May 11, 2020 at 2:24 PM Tim Allison <ta...@apache.org> wrote:

> Thank you for letting us know about this and sharing a file.  My belief is
> that we should be trusting the XMP metadata over the PDFInfo for DC
> metadata keys like TikaCoreProperties.CREATED.  I'll take a look.
>
> On Mon, May 11, 2020 at 11:40 AM Tucker B <ba...@gmail.com> wrote:
>
>> I have a PDF with XMP metadata with two rdf:Description tags with
>> different namespaces. The first namespace is DublinCore the other is
>> XMPSchemaBasic. I can confirm jempbox is able to read the XMP metadata
>> properly and properly identify the namespaces. However, it appears the
>> PDFParser in Tika is not adding XMPSchemaBasic metadata to the extracted
>> metadata, specifically the CreateDate. I'm curious if this is expected
>> behaviour. Ideally, the PDFParser would set the TikaCoreProperties.CREATED
>> to the value in the XMP metadata absent the presence of a created date in
>> the PDDocumentInformation. Or at least a Property such as "xmp:CreateDate".
>> I've attached the XMP packet and a PDF with the XMP metadata. I'm using
>> Tika 1.24.1 Any help or guidance would be greatly appreciated.
>>
>> Also, I noticed the XMP packet id is "W5M0MpCehiHzreSzNTczkc9d" which is
>> base64 encoded string "[42!573]". Curious if anyone knows the
>> significance of this.
>>
>

Re: Missing XMP Metadata from PDF

Posted by Tim Allison <ta...@apache.org>.

Thank you for letting us know about this and sharing a file.  My belief is
that we should be trusting the XMP metadata over the PDFInfo for DC
metadata keys like TikaCoreProperties.CREATED.  I'll take a look.

On Mon, May 11, 2020 at 11:40 AM Tucker B <ba...@gmail.com> wrote:

> I have a PDF with XMP metadata with two rdf:Description tags with
> different namespaces. The first namespace is DublinCore the other is
> XMPSchemaBasic. I can confirm jempbox is able to read the XMP metadata
> properly and properly identify the namespaces. However, it appears the
> PDFParser in Tika is not adding XMPSchemaBasic metadata to the extracted
> metadata, specifically the CreateDate. I'm curious if this is expected
> behaviour. Ideally, the PDFParser would set the TikaCoreProperties.CREATED
> to the value in the XMP metadata absent the presence of a created date in
> the PDDocumentInformation. Or at least a Property such as "xmp:CreateDate".
> I've attached the XMP packet and a PDF with the XMP metadata. I'm using
> Tika 1.24.1 Any help or guidance would be greatly appreciated.
>
> Also, I noticed the XMP packet id is "W5M0MpCehiHzreSzNTczkc9d" which is
> base64 encoded string "[42!573]". Curious if anyone knows the
> significance of this.
>