You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2016/03/08 18:50:09 UTC

[DISCUSS] options for XMP parsing?

All,

  PDFBox 2.0 is soon to be released.  In the course of its development, the project has migrated from Jempbox (which we're now using) to XmpBox; and Jempbox is now on its last legs.  
  
  XmpBox was "written for PDF/A checking," not for robust processing of common variants of XMPs in the wild; I found that it fails on roughly 40% of XMPs I pulled out of PDFs from govdocs1/commoncrawl.
 
  In short, we can't migrate to XmpBox, and Jempbox is at the end of its life.
  
  Has anyone had any luck with an Apache-friendly XMP parser?  Are there better options than copying and pasting jempbox into Tika and maintaining it ourselves (yuk!)?

          Best,

                 Tim

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Tuesday, March 08, 2016 12:13 PM
To: dev@pdfbox.apache.org
Subject: Re: roadmap for XMPBox?

I think the problem is that XmpBox was written for PDF/A checking, so it fails with XMPs that are not PDF/A. For example, file 000142.pdf has the schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_properties_in_pdfa-1_2008-03-20.pdf

And no, there are no plans for anything on XMP at this time...

Tilman


Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
> All,
>
>
>
>    When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs.
>
>
>
>    I’m including a table below of the counts of exception messages.  Are there any plans to make XMPBox more lenient or is this what we can expect going forward?
>
>
>
>    As always, I’m more than happy to help with files and tests.  Let me know what I can do.
>
>
>
>               Cheers,
>
>
>
>                        Tim
>
>
>
> No XmpParsingException on 42,022 files.
>
>
>
>
>
>
>
> Exceptions:
>
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/pdfx/1.3/
>
> 13403
>
> Type 'originalDocumentID' not defined in 
> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>
> 3710
>
> Missing pdfaSchema:property in type definition
>
> 3113
>
> Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>
> 2867
>
> Invalid array type, expecting Seq and found Bag [prefix=dc; 
> name=creator]
>
> 927
>
> Invalid array type, expecting Alt and found Seq [prefix=dc; 
> name=description]
>
> 723
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/xmp/InDesign/private
>
> 710
>
> Invalid array type, expecting Bag and found Seq [prefix=dc; 
> name=subject]
>
> 654
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>
> 522
>
> Failed to parse
>
> 492
>
> Invalid array definition, expecting Seq and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=date]
>
> 370
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/illustrator/1.0/
>
> 262
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/xfa/promoted-desc/
>
> 188
>
> Failed to instanciate property in xmp:CreateDate
>
> 144
>
> Schema is not set in this document : 
> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>
> 125
>
> Expecting local name 'xmpmeta' and found 'xapmeta'
>
> 94
>
> Cannot find a definition for the namespace 
> http://www.rwjf.org/rwjf/1.0
>
> 84
>
> Failed to instanciate property in xap:CreateDate
>
> 74
>
> Invalid array definition, expecting Bag and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=language]
>
> 68
>
> Invalid array definition, expecting Alt and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=title]
>
> 49
>
> Cannot find a definition for the namespace http://www.sap.com
>
> 46
>
> Failed to instanciate property in exif:ColorSpace
>
> 33
>
> Failed to instanciate property in xmpMM:History
>
> 28
>
> xmp should start with a processing instruction
>
> 26
>
> Cannot find a definition for the namespace 
> http://prismstandard.org/namespaces/basic/2.0/
>
> 24
>
> Cannot find a definition for the namespace 
> http://www.npes.org/pdfx/ns/id/
>
> 21
>
> Cannot find a definition for the namespace 
> http://ns.InsiderSoftware.com/fontlist/1.0/
>
> 14
>
> Invalid array definition, expecting Seq and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=creator]
>
> 14
>
> Failed to instanciate property in xmp:MetadataDate
>
> 12
>
> Cannot find a definition for the namespace 
> http://ns.xinet.com/webnative/private/1.0/
>
> 10
>
> Failed to instanciate property in xap:ModifyDate
>
> 10
>
> Failed to instanciate property in xmp:ModifyDate
>
> 10
>
> Type 'params' not defined in 
> http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>
> 9
>
> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; 
> name=History]
>
> 8
>
> Type 'documentName' not defined in 
> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>
> 8
>
> Cannot find a definition for the namespace http://www.day.com/dam/1.0
>
> 7
>
> Cannot find a definition for the namespace ptc
>
> 7
>
> Failed to instanciate property in xapMM:History
>
> 6
>
> Invalid array definition, expecting Seq and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff; 
> name=YCbCrPositioning]
>
> 5
>
> Schema is not set in this document : http://purl.org/dc/elements/1.1/
>
> 5
>
> Cannot find a definition for the namespace 
> http://www.extensis.com/meta/FontSense/
>
> 4
>
> Excepted xpacket 'end' attribute (must be present and placed in first)
>
> 4
>
> Invalid array type, expecting Seq and found Bag [prefix=photoshop; 
> name=TextLayers]
>
> 3
>
> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>
> 3
>
> no message (NPE)
>
> 2
>
> Cannot find a definition for the namespace 
> http://laserfiche.com/xmp/schema/1.0/
>
> 2
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>
> 2
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/camera-raw-settings/1.0/
>
> 2
>
> Failed to instanciate property in xapRights:Marked
>
> 2
>
> Invalid array type, expecting Alt and found Bag [prefix=dc; 
> name=title]
>
> 2
>
> Invalid array type, expecting Alt and found Seq [prefix=dc; 
> name=title]
>
> 2
>
> Invalid array type, expecting Seq and found Alt [prefix=dc; 
> name=creator]
>
> 2
>
> Cannot find a definition for the namespace 
> http://ns.cambridgeassociates.com/status/1.0/
>
> 1
>
> Cannot find a definition for the namespace 
> http://ns.computershare.com.au/ccs/1.0/
>
> 1
>
> Cannot find a definition for the namespace 
> http://ns.esko-graphics.com/grinfo/1.0/
>
> 1
>
> Cannot find a definition for the namespace 
> http://ns.tripletriangle.com/ns/tripletri/
>
> 1
>
> Cannot find a definition for the namespace 
> http://prismstandard.org/namespaces/basic/2.1/
>
> 1
>
> Cannot find a definition for the namespace 
> http://www.aiim.org/pdfa/ns/id.html
>
> 1
>
> Cannot find a definition for the namespace 
> http://www.aiim.org/pdfe/ns/id/
>
> 1
>
> Cannot find a definition for the namespace 
> http://www.enfocus.com/ns/CertifiedPDF/2.0/
>
> 1
>
> Cannot find a definition for the namespace 
> http://www.northplains.com/xmpnps/cov/1.0/
>
> 1
>
> Failed to instanciate property in xmpRights:Marked
>
> 1
>
> Invalid array type, expecting Seq and found Bag [prefix=dc; name=date]
>
> 1
>
> This namespace is not a schema or a structured type : 
> http://ns.adobe.com/xap/1.0/sType/Job#
>
> 1
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: [DISCUSS] options for XMP parsing?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Thank you!  Will take a look the SO link, and I'll see if I can dig up any of these in our regression testing corpus.

-----Original Message-----
From: Ray Gauss [mailto:ray.gauss@alfresco.com] 
Sent: Monday, March 14, 2016 1:06 PM
To: dev@tika.apache.org
Subject: Re: [DISCUSS] options for XMP parsing?

Hi Tim,

Consolidated handing of XMP would be great, I'm glad you're taking a look at it and I'll try to help out where I can.

> You've been happy with it at Alfresco? 

It's been a while since I looked at it but I don't recall any difficulties.

> I'd be interested to hear more about what happens with InDesign files.

It stores things in 'pages' [1].

Regards,

Ray


[1] http://stackoverflow.com/a/22661992


> On Mar 10, 2016, at 9:38 AM, Allison, Timothy B. <ta...@mitre.org> wrote:
> 
> Hi Ray,
>  Got it.  Thank you.
> 
> That'd be great.  In follow up discussion with PDFBox devs, they mentioned that it is not a design feature/restriction on XMPBox that it doesn't handle non PDF/A files...only a matter of patching and building out their current code base.   The downside is there's quite a bit to do, the upside is that it is a living code base.
> 
> I'll experiment with Adobe's xmp-core.  If you have any pointers/examples, let me know...I'll be starting with: https://indisnip.wordpress.com/2010/08/17/extract-metadata-with-adobe-xmp-part-2/. You've been happy with it at Alfresco? 
> 
> No matter which package we use, it would be nice to build out uniform extraction of XMP for all image and PDF files for the common elements -- with special handling by file type if necessary.  As you mentioned, it would also be great to add or modify our XMPScanner to extract all XMP packets from a file...I've started dabbling with this here: https://github.com/tballison/tika/tree/xmp_scanner .  I'd be interested to hear more about what happens with InDesign files. In our own test set, we have a PDF file with two packets containing conflicting authorship info IIRC! :)  It would be nice to expose both the canonical XMP info (with proper processing of "later-xmp-overrides-earlier") as well as all of the info that can be scraped from the XMP (packet1: authorXYZ packet2: authorQRS)...two different use cases.
> 
> Thank you, again.
> 
>             Cheers,
> 
>                   Tim
> 
> 
> 
> 
> -----Original Message-----
> From: Ray Gauss [mailto:ray.gauss@alfresco.com]
> Sent: Tuesday, March 08, 2016 2:34 PM
> To: dev@tika.apache.org
> Subject: Re: [DISCUSS] options for XMP parsing?
> 
> To clarify... the 'we' in my third sentence was referring to Alfresco, not Tika.
> 
> I'm not sure how much of that code would be useful but I may be able to contribute some of it.
> 
> Regards,
> 
> Ray
> 
> 
>> On Mar 8, 2016, at 2:07 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
>> 
>> Thank you.  Will take a look.
>> 
>> -----Original Message-----
>> From: Ray Gauss [mailto:ray.gauss@alfresco.com]
>> Sent: Tuesday, March 08, 2016 1:55 PM
>> To: dev@tika.apache.org
>> Subject: Re: [DISCUSS] options for XMP parsing?
>> 
>> Hi Tim,
>> 
>> We're already using Adobe's xmpcore in tika-xmp which works fine for parsing XMP (though has not seen updates in a while), but getting the XMP packets out of the files is tricker.  
>> 
>> We have XMPPacketScanner which works for many cases, but not all.  InDesign files for example do some strange things.
>> 
>> In the past we've used different packet scanners depending on the file type (including Exiftool command-line) to get the XMP out then used xmpcore to parse into simple flattened properties.
>> 
>> Regards,
>> 
>> Ray
>> 
>> 
>>> On Mar 8, 2016, at 12:50 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
>>> 
>>> All,
>>> 
>>> PDFBox 2.0 is soon to be released.  In the course of its development, the project has migrated from Jempbox (which we're now using) to XmpBox; and Jempbox is now on its last legs.  
>>> 
>>> XmpBox was "written for PDF/A checking," not for robust processing of common variants of XMPs in the wild; I found that it fails on roughly 40% of XMPs I pulled out of PDFs from govdocs1/commoncrawl.
>>> 
>>> In short, we can't migrate to XmpBox, and Jempbox is at the end of its life.
>>> 
>>> Has anyone had any luck with an Apache-friendly XMP parser?  Are there better options than copying and pasting jempbox into Tika and maintaining it ourselves (yuk!)?
>>> 
>>>        Best,
>>> 
>>>               Tim
>>> 
>>> -----Original Message-----
>>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>>> Sent: Tuesday, March 08, 2016 12:13 PM
>>> To: dev@pdfbox.apache.org
>>> Subject: Re: roadmap for XMPBox?
>>> 
>>> I think the problem is that XmpBox was written for PDF/A checking, so it fails with XMPs that are not PDF/A. For example, file 000142.pdf has the schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
>>> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp
>>> _
>>> p
>>> roperties_in_pdfa-1_2008-03-20.pdf
>>> 
>>> And no, there are no plans for anything on XMP at this time...
>>> 
>>> Tilman
>>> 
>>> 
>>> Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
>>>> All,
>>>> 
>>>> 
>>>> 
>>>> When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs.
>>>> 
>>>> 
>>>> 
>>>> I’m including a table below of the counts of exception messages.  Are there any plans to make XMPBox more lenient or is this what we can expect going forward?
>>>> 
>>>> 
>>>> 
>>>> As always, I’m more than happy to help with files and tests.  Let me know what I can do.
>>>> 
>>>> 
>>>> 
>>>>            Cheers,
>>>> 
>>>> 
>>>> 
>>>>                     Tim
>>>> 
>>>> 
>>>> 
>>>> No XmpParsingException on 42,022 files.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Exceptions:
>>>> 
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/pdfx/1.3/
>>>> 
>>>> 13403
>>>> 
>>>> Type 'originalDocumentID' not defined in 
>>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>>> 
>>>> 3710
>>>> 
>>>> Missing pdfaSchema:property in type definition
>>>> 
>>>> 3113
>>>> 
>>>> Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>>>> 
>>>> 2867
>>>> 
>>>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>>>> name=creator]
>>>> 
>>>> 927
>>>> 
>>>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>>>> name=description]
>>>> 
>>>> 723
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/xmp/InDesign/private
>>>> 
>>>> 710
>>>> 
>>>> Invalid array type, expecting Bag and found Seq [prefix=dc; 
>>>> name=subject]
>>>> 
>>>> 654
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>>>> 
>>>> 522
>>>> 
>>>> Failed to parse
>>>> 
>>>> 492
>>>> 
>>>> Invalid array definition, expecting Seq and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>>> name=date]
>>>> 
>>>> 370
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/illustrator/1.0/
>>>> 
>>>> 262
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/xfa/promoted-desc/
>>>> 
>>>> 188
>>>> 
>>>> Failed to instanciate property in xmp:CreateDate
>>>> 
>>>> 144
>>>> 
>>>> Schema is not set in this document : 
>>>> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>>>> 
>>>> 125
>>>> 
>>>> Expecting local name 'xmpmeta' and found 'xapmeta'
>>>> 
>>>> 94
>>>> 
>>>> Cannot find a definition for the namespace
>>>> http://www.rwjf.org/rwjf/1.0
>>>> 
>>>> 84
>>>> 
>>>> Failed to instanciate property in xap:CreateDate
>>>> 
>>>> 74
>>>> 
>>>> Invalid array definition, expecting Bag and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>>> name=language]
>>>> 
>>>> 68
>>>> 
>>>> Invalid array definition, expecting Alt and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>>> name=title]
>>>> 
>>>> 49
>>>> 
>>>> Cannot find a definition for the namespace http://www.sap.com
>>>> 
>>>> 46
>>>> 
>>>> Failed to instanciate property in exif:ColorSpace
>>>> 
>>>> 33
>>>> 
>>>> Failed to instanciate property in xmpMM:History
>>>> 
>>>> 28
>>>> 
>>>> xmp should start with a processing instruction
>>>> 
>>>> 26
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://prismstandard.org/namespaces/basic/2.0/
>>>> 
>>>> 24
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.npes.org/pdfx/ns/id/
>>>> 
>>>> 21
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.InsiderSoftware.com/fontlist/1.0/
>>>> 
>>>> 14
>>>> 
>>>> Invalid array definition, expecting Seq and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>>> name=creator]
>>>> 
>>>> 14
>>>> 
>>>> Failed to instanciate property in xmp:MetadataDate
>>>> 
>>>> 12
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.xinet.com/webnative/private/1.0/
>>>> 
>>>> 10
>>>> 
>>>> Failed to instanciate property in xap:ModifyDate
>>>> 
>>>> 10
>>>> 
>>>> Failed to instanciate property in xmp:ModifyDate
>>>> 
>>>> 10
>>>> 
>>>> Type 'params' not defined in
>>>> http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>>>> 
>>>> 9
>>>> 
>>>> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; 
>>>> name=History]
>>>> 
>>>> 8
>>>> 
>>>> Type 'documentName' not defined in
>>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>>> 
>>>> 8
>>>> 
>>>> Cannot find a definition for the namespace
>>>> http://www.day.com/dam/1.0
>>>> 
>>>> 7
>>>> 
>>>> Cannot find a definition for the namespace ptc
>>>> 
>>>> 7
>>>> 
>>>> Failed to instanciate property in xapMM:History
>>>> 
>>>> 6
>>>> 
>>>> Invalid array definition, expecting Seq and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl
>>>> [prefix=tiff; name=YCbCrPositioning]
>>>> 
>>>> 5
>>>> 
>>>> Schema is not set in this document : 
>>>> http://purl.org/dc/elements/1.1/
>>>> 
>>>> 5
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.extensis.com/meta/FontSense/
>>>> 
>>>> 4
>>>> 
>>>> Excepted xpacket 'end' attribute (must be present and placed in
>>>> first)
>>>> 
>>>> 4
>>>> 
>>>> Invalid array type, expecting Seq and found Bag [prefix=photoshop; 
>>>> name=TextLayers]
>>>> 
>>>> 3
>>>> 
>>>> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>>>> 
>>>> 3
>>>> 
>>>> no message (NPE)
>>>> 
>>>> 2
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://laserfiche.com/xmp/schema/1.0/
>>>> 
>>>> 2
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>>>> 
>>>> 2
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/camera-raw-settings/1.0/
>>>> 
>>>> 2
>>>> 
>>>> Failed to instanciate property in xapRights:Marked
>>>> 
>>>> 2
>>>> 
>>>> Invalid array type, expecting Alt and found Bag [prefix=dc; 
>>>> name=title]
>>>> 
>>>> 2
>>>> 
>>>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>>>> name=title]
>>>> 
>>>> 2
>>>> 
>>>> Invalid array type, expecting Seq and found Alt [prefix=dc; 
>>>> name=creator]
>>>> 
>>>> 2
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.cambridgeassociates.com/status/1.0/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.computershare.com.au/ccs/1.0/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.esko-graphics.com/grinfo/1.0/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.tripletriangle.com/ns/tripletri/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://prismstandard.org/namespaces/basic/2.1/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.aiim.org/pdfa/ns/id.html
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.aiim.org/pdfe/ns/id/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.enfocus.com/ns/CertifiedPDF/2.0/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.northplains.com/xmpnps/cov/1.0/
>>>> 
>>>> 1
>>>> 
>>>> Failed to instanciate property in xmpRights:Marked
>>>> 
>>>> 1
>>>> 
>>>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>>>> name=date]
>>>> 
>>>> 1
>>>> 
>>>> This namespace is not a schema or a structured type : 
>>>> http://ns.adobe.com/xap/1.0/sType/Job#
>>>> 
>>>> 1
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --------------------------------------------------------------------
>>> - To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
>>> additional commands, e-mail: dev-help@pdfbox.apache.org
>>> 
>> 
> 


Re: [DISCUSS] options for XMP parsing?

Posted by Ray Gauss <ra...@alfresco.com>.
Hi Tim,

Consolidated handing of XMP would be great, I'm glad you're taking a look at it and I'll try to help out where I can.

> You've been happy with it at Alfresco? 

It's been a while since I looked at it but I don't recall any difficulties.

> I'd be interested to hear more about what happens with InDesign files.

It stores things in 'pages' [1].

Regards,

Ray


[1] http://stackoverflow.com/a/22661992


> On Mar 10, 2016, at 9:38 AM, Allison, Timothy B. <ta...@mitre.org> wrote:
> 
> Hi Ray,
>  Got it.  Thank you.
> 
> That'd be great.  In follow up discussion with PDFBox devs, they mentioned that it is not a design feature/restriction on XMPBox that it doesn't handle non PDF/A files...only a matter of patching and building out their current code base.   The downside is there's quite a bit to do, the upside is that it is a living code base.
> 
> I'll experiment with Adobe's xmp-core.  If you have any pointers/examples, let me know...I'll be starting with: https://indisnip.wordpress.com/2010/08/17/extract-metadata-with-adobe-xmp-part-2/. You've been happy with it at Alfresco? 
> 
> No matter which package we use, it would be nice to build out uniform extraction of XMP for all image and PDF files for the common elements -- with special handling by file type if necessary.  As you mentioned, it would also be great to add or modify our XMPScanner to extract all XMP packets from a file...I've started dabbling with this here: https://github.com/tballison/tika/tree/xmp_scanner .  I'd be interested to hear more about what happens with InDesign files. In our own test set, we have a PDF file with two packets containing conflicting authorship info IIRC! :)  It would be nice to expose both the canonical XMP info (with proper processing of "later-xmp-overrides-earlier") as well as all of the info that can be scraped from the XMP (packet1: authorXYZ packet2: authorQRS)...two different use cases.
> 
> Thank you, again.
> 
>             Cheers,
> 
>                   Tim 
> 
> 
> 
> 
> -----Original Message-----
> From: Ray Gauss [mailto:ray.gauss@alfresco.com] 
> Sent: Tuesday, March 08, 2016 2:34 PM
> To: dev@tika.apache.org
> Subject: Re: [DISCUSS] options for XMP parsing?
> 
> To clarify... the 'we' in my third sentence was referring to Alfresco, not Tika.
> 
> I'm not sure how much of that code would be useful but I may be able to contribute some of it.
> 
> Regards,
> 
> Ray
> 
> 
>> On Mar 8, 2016, at 2:07 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
>> 
>> Thank you.  Will take a look.
>> 
>> -----Original Message-----
>> From: Ray Gauss [mailto:ray.gauss@alfresco.com]
>> Sent: Tuesday, March 08, 2016 1:55 PM
>> To: dev@tika.apache.org
>> Subject: Re: [DISCUSS] options for XMP parsing?
>> 
>> Hi Tim,
>> 
>> We're already using Adobe's xmpcore in tika-xmp which works fine for parsing XMP (though has not seen updates in a while), but getting the XMP packets out of the files is tricker.  
>> 
>> We have XMPPacketScanner which works for many cases, but not all.  InDesign files for example do some strange things.
>> 
>> In the past we've used different packet scanners depending on the file type (including Exiftool command-line) to get the XMP out then used xmpcore to parse into simple flattened properties.
>> 
>> Regards,
>> 
>> Ray
>> 
>> 
>>> On Mar 8, 2016, at 12:50 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
>>> 
>>> All,
>>> 
>>> PDFBox 2.0 is soon to be released.  In the course of its development, the project has migrated from Jempbox (which we're now using) to XmpBox; and Jempbox is now on its last legs.  
>>> 
>>> XmpBox was "written for PDF/A checking," not for robust processing of common variants of XMPs in the wild; I found that it fails on roughly 40% of XMPs I pulled out of PDFs from govdocs1/commoncrawl.
>>> 
>>> In short, we can't migrate to XmpBox, and Jempbox is at the end of its life.
>>> 
>>> Has anyone had any luck with an Apache-friendly XMP parser?  Are there better options than copying and pasting jempbox into Tika and maintaining it ourselves (yuk!)?
>>> 
>>>        Best,
>>> 
>>>               Tim
>>> 
>>> -----Original Message-----
>>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>>> Sent: Tuesday, March 08, 2016 12:13 PM
>>> To: dev@pdfbox.apache.org
>>> Subject: Re: roadmap for XMPBox?
>>> 
>>> I think the problem is that XmpBox was written for PDF/A checking, so it fails with XMPs that are not PDF/A. For example, file 000142.pdf has the schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
>>> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_
>>> p
>>> roperties_in_pdfa-1_2008-03-20.pdf
>>> 
>>> And no, there are no plans for anything on XMP at this time...
>>> 
>>> Tilman
>>> 
>>> 
>>> Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
>>>> All,
>>>> 
>>>> 
>>>> 
>>>> When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs.
>>>> 
>>>> 
>>>> 
>>>> I’m including a table below of the counts of exception messages.  Are there any plans to make XMPBox more lenient or is this what we can expect going forward?
>>>> 
>>>> 
>>>> 
>>>> As always, I’m more than happy to help with files and tests.  Let me know what I can do.
>>>> 
>>>> 
>>>> 
>>>>            Cheers,
>>>> 
>>>> 
>>>> 
>>>>                     Tim
>>>> 
>>>> 
>>>> 
>>>> No XmpParsingException on 42,022 files.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Exceptions:
>>>> 
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/pdfx/1.3/
>>>> 
>>>> 13403
>>>> 
>>>> Type 'originalDocumentID' not defined in 
>>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>>> 
>>>> 3710
>>>> 
>>>> Missing pdfaSchema:property in type definition
>>>> 
>>>> 3113
>>>> 
>>>> Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>>>> 
>>>> 2867
>>>> 
>>>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>>>> name=creator]
>>>> 
>>>> 927
>>>> 
>>>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>>>> name=description]
>>>> 
>>>> 723
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/xmp/InDesign/private
>>>> 
>>>> 710
>>>> 
>>>> Invalid array type, expecting Bag and found Seq [prefix=dc; 
>>>> name=subject]
>>>> 
>>>> 654
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>>>> 
>>>> 522
>>>> 
>>>> Failed to parse
>>>> 
>>>> 492
>>>> 
>>>> Invalid array definition, expecting Seq and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>>> name=date]
>>>> 
>>>> 370
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/illustrator/1.0/
>>>> 
>>>> 262
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/xfa/promoted-desc/
>>>> 
>>>> 188
>>>> 
>>>> Failed to instanciate property in xmp:CreateDate
>>>> 
>>>> 144
>>>> 
>>>> Schema is not set in this document : 
>>>> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>>>> 
>>>> 125
>>>> 
>>>> Expecting local name 'xmpmeta' and found 'xapmeta'
>>>> 
>>>> 94
>>>> 
>>>> Cannot find a definition for the namespace
>>>> http://www.rwjf.org/rwjf/1.0
>>>> 
>>>> 84
>>>> 
>>>> Failed to instanciate property in xap:CreateDate
>>>> 
>>>> 74
>>>> 
>>>> Invalid array definition, expecting Bag and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>>> name=language]
>>>> 
>>>> 68
>>>> 
>>>> Invalid array definition, expecting Alt and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>>> name=title]
>>>> 
>>>> 49
>>>> 
>>>> Cannot find a definition for the namespace http://www.sap.com
>>>> 
>>>> 46
>>>> 
>>>> Failed to instanciate property in exif:ColorSpace
>>>> 
>>>> 33
>>>> 
>>>> Failed to instanciate property in xmpMM:History
>>>> 
>>>> 28
>>>> 
>>>> xmp should start with a processing instruction
>>>> 
>>>> 26
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://prismstandard.org/namespaces/basic/2.0/
>>>> 
>>>> 24
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.npes.org/pdfx/ns/id/
>>>> 
>>>> 21
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.InsiderSoftware.com/fontlist/1.0/
>>>> 
>>>> 14
>>>> 
>>>> Invalid array definition, expecting Seq and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>>> name=creator]
>>>> 
>>>> 14
>>>> 
>>>> Failed to instanciate property in xmp:MetadataDate
>>>> 
>>>> 12
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.xinet.com/webnative/private/1.0/
>>>> 
>>>> 10
>>>> 
>>>> Failed to instanciate property in xap:ModifyDate
>>>> 
>>>> 10
>>>> 
>>>> Failed to instanciate property in xmp:ModifyDate
>>>> 
>>>> 10
>>>> 
>>>> Type 'params' not defined in
>>>> http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>>>> 
>>>> 9
>>>> 
>>>> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; 
>>>> name=History]
>>>> 
>>>> 8
>>>> 
>>>> Type 'documentName' not defined in
>>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>>> 
>>>> 8
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.day.com/dam/1.0
>>>> 
>>>> 7
>>>> 
>>>> Cannot find a definition for the namespace ptc
>>>> 
>>>> 7
>>>> 
>>>> Failed to instanciate property in xapMM:History
>>>> 
>>>> 6
>>>> 
>>>> Invalid array definition, expecting Seq and found 
>>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl 
>>>> [prefix=tiff; name=YCbCrPositioning]
>>>> 
>>>> 5
>>>> 
>>>> Schema is not set in this document : 
>>>> http://purl.org/dc/elements/1.1/
>>>> 
>>>> 5
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.extensis.com/meta/FontSense/
>>>> 
>>>> 4
>>>> 
>>>> Excepted xpacket 'end' attribute (must be present and placed in
>>>> first)
>>>> 
>>>> 4
>>>> 
>>>> Invalid array type, expecting Seq and found Bag [prefix=photoshop; 
>>>> name=TextLayers]
>>>> 
>>>> 3
>>>> 
>>>> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>>>> 
>>>> 3
>>>> 
>>>> no message (NPE)
>>>> 
>>>> 2
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://laserfiche.com/xmp/schema/1.0/
>>>> 
>>>> 2
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>>>> 
>>>> 2
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.adobe.com/camera-raw-settings/1.0/
>>>> 
>>>> 2
>>>> 
>>>> Failed to instanciate property in xapRights:Marked
>>>> 
>>>> 2
>>>> 
>>>> Invalid array type, expecting Alt and found Bag [prefix=dc; 
>>>> name=title]
>>>> 
>>>> 2
>>>> 
>>>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>>>> name=title]
>>>> 
>>>> 2
>>>> 
>>>> Invalid array type, expecting Seq and found Alt [prefix=dc; 
>>>> name=creator]
>>>> 
>>>> 2
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.cambridgeassociates.com/status/1.0/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.computershare.com.au/ccs/1.0/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.esko-graphics.com/grinfo/1.0/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://ns.tripletriangle.com/ns/tripletri/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://prismstandard.org/namespaces/basic/2.1/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.aiim.org/pdfa/ns/id.html
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.aiim.org/pdfe/ns/id/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.enfocus.com/ns/CertifiedPDF/2.0/
>>>> 
>>>> 1
>>>> 
>>>> Cannot find a definition for the namespace 
>>>> http://www.northplains.com/xmpnps/cov/1.0/
>>>> 
>>>> 1
>>>> 
>>>> Failed to instanciate property in xmpRights:Marked
>>>> 
>>>> 1
>>>> 
>>>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>>>> name=date]
>>>> 
>>>> 1
>>>> 
>>>> This namespace is not a schema or a structured type : 
>>>> http://ns.adobe.com/xap/1.0/sType/Job#
>>>> 
>>>> 1
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
>>> additional commands, e-mail: dev-help@pdfbox.apache.org
>>> 
>> 
> 


RE: [DISCUSS] options for XMP parsing?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Hi Ray,
  Got it.  Thank you.

That'd be great.  In follow up discussion with PDFBox devs, they mentioned that it is not a design feature/restriction on XMPBox that it doesn't handle non PDF/A files...only a matter of patching and building out their current code base.   The downside is there's quite a bit to do, the upside is that it is a living code base.

I'll experiment with Adobe's xmp-core.  If you have any pointers/examples, let me know...I'll be starting with: https://indisnip.wordpress.com/2010/08/17/extract-metadata-with-adobe-xmp-part-2/. You've been happy with it at Alfresco? 

No matter which package we use, it would be nice to build out uniform extraction of XMP for all image and PDF files for the common elements -- with special handling by file type if necessary.  As you mentioned, it would also be great to add or modify our XMPScanner to extract all XMP packets from a file...I've started dabbling with this here: https://github.com/tballison/tika/tree/xmp_scanner .  I'd be interested to hear more about what happens with InDesign files. In our own test set, we have a PDF file with two packets containing conflicting authorship info IIRC! :)  It would be nice to expose both the canonical XMP info (with proper processing of "later-xmp-overrides-earlier") as well as all of the info that can be scraped from the XMP (packet1: authorXYZ packet2: authorQRS)...two different use cases.

Thank you, again.

             Cheers,

                   Tim 




-----Original Message-----
From: Ray Gauss [mailto:ray.gauss@alfresco.com] 
Sent: Tuesday, March 08, 2016 2:34 PM
To: dev@tika.apache.org
Subject: Re: [DISCUSS] options for XMP parsing?

To clarify... the 'we' in my third sentence was referring to Alfresco, not Tika.

I'm not sure how much of that code would be useful but I may be able to contribute some of it.

Regards,

Ray


> On Mar 8, 2016, at 2:07 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
> 
> Thank you.  Will take a look.
> 
> -----Original Message-----
> From: Ray Gauss [mailto:ray.gauss@alfresco.com]
> Sent: Tuesday, March 08, 2016 1:55 PM
> To: dev@tika.apache.org
> Subject: Re: [DISCUSS] options for XMP parsing?
> 
> Hi Tim,
> 
> We're already using Adobe's xmpcore in tika-xmp which works fine for parsing XMP (though has not seen updates in a while), but getting the XMP packets out of the files is tricker.  
> 
> We have XMPPacketScanner which works for many cases, but not all.  InDesign files for example do some strange things.
> 
> In the past we've used different packet scanners depending on the file type (including Exiftool command-line) to get the XMP out then used xmpcore to parse into simple flattened properties.
> 
> Regards,
> 
> Ray
> 
> 
>> On Mar 8, 2016, at 12:50 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
>> 
>> All,
>> 
>> PDFBox 2.0 is soon to be released.  In the course of its development, the project has migrated from Jempbox (which we're now using) to XmpBox; and Jempbox is now on its last legs.  
>> 
>> XmpBox was "written for PDF/A checking," not for robust processing of common variants of XMPs in the wild; I found that it fails on roughly 40% of XMPs I pulled out of PDFs from govdocs1/commoncrawl.
>> 
>> In short, we can't migrate to XmpBox, and Jempbox is at the end of its life.
>> 
>> Has anyone had any luck with an Apache-friendly XMP parser?  Are there better options than copying and pasting jempbox into Tika and maintaining it ourselves (yuk!)?
>> 
>>         Best,
>> 
>>                Tim
>> 
>> -----Original Message-----
>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>> Sent: Tuesday, March 08, 2016 12:13 PM
>> To: dev@pdfbox.apache.org
>> Subject: Re: roadmap for XMPBox?
>> 
>> I think the problem is that XmpBox was written for PDF/A checking, so it fails with XMPs that are not PDF/A. For example, file 000142.pdf has the schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
>> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_
>> p
>> roperties_in_pdfa-1_2008-03-20.pdf
>> 
>> And no, there are no plans for anything on XMP at this time...
>> 
>> Tilman
>> 
>> 
>> Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
>>> All,
>>> 
>>> 
>>> 
>>>  When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs.
>>> 
>>> 
>>> 
>>>  I’m including a table below of the counts of exception messages.  Are there any plans to make XMPBox more lenient or is this what we can expect going forward?
>>> 
>>> 
>>> 
>>>  As always, I’m more than happy to help with files and tests.  Let me know what I can do.
>>> 
>>> 
>>> 
>>>             Cheers,
>>> 
>>> 
>>> 
>>>                      Tim
>>> 
>>> 
>>> 
>>> No XmpParsingException on 42,022 files.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Exceptions:
>>> 
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/pdfx/1.3/
>>> 
>>> 13403
>>> 
>>> Type 'originalDocumentID' not defined in 
>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>> 
>>> 3710
>>> 
>>> Missing pdfaSchema:property in type definition
>>> 
>>> 3113
>>> 
>>> Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>>> 
>>> 2867
>>> 
>>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>>> name=creator]
>>> 
>>> 927
>>> 
>>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>>> name=description]
>>> 
>>> 723
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/xmp/InDesign/private
>>> 
>>> 710
>>> 
>>> Invalid array type, expecting Bag and found Seq [prefix=dc; 
>>> name=subject]
>>> 
>>> 654
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>>> 
>>> 522
>>> 
>>> Failed to parse
>>> 
>>> 492
>>> 
>>> Invalid array definition, expecting Seq and found 
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>> name=date]
>>> 
>>> 370
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/illustrator/1.0/
>>> 
>>> 262
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/xfa/promoted-desc/
>>> 
>>> 188
>>> 
>>> Failed to instanciate property in xmp:CreateDate
>>> 
>>> 144
>>> 
>>> Schema is not set in this document : 
>>> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>>> 
>>> 125
>>> 
>>> Expecting local name 'xmpmeta' and found 'xapmeta'
>>> 
>>> 94
>>> 
>>> Cannot find a definition for the namespace
>>> http://www.rwjf.org/rwjf/1.0
>>> 
>>> 84
>>> 
>>> Failed to instanciate property in xap:CreateDate
>>> 
>>> 74
>>> 
>>> Invalid array definition, expecting Bag and found 
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>> name=language]
>>> 
>>> 68
>>> 
>>> Invalid array definition, expecting Alt and found 
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>> name=title]
>>> 
>>> 49
>>> 
>>> Cannot find a definition for the namespace http://www.sap.com
>>> 
>>> 46
>>> 
>>> Failed to instanciate property in exif:ColorSpace
>>> 
>>> 33
>>> 
>>> Failed to instanciate property in xmpMM:History
>>> 
>>> 28
>>> 
>>> xmp should start with a processing instruction
>>> 
>>> 26
>>> 
>>> Cannot find a definition for the namespace 
>>> http://prismstandard.org/namespaces/basic/2.0/
>>> 
>>> 24
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.npes.org/pdfx/ns/id/
>>> 
>>> 21
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.InsiderSoftware.com/fontlist/1.0/
>>> 
>>> 14
>>> 
>>> Invalid array definition, expecting Seq and found 
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>> name=creator]
>>> 
>>> 14
>>> 
>>> Failed to instanciate property in xmp:MetadataDate
>>> 
>>> 12
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.xinet.com/webnative/private/1.0/
>>> 
>>> 10
>>> 
>>> Failed to instanciate property in xap:ModifyDate
>>> 
>>> 10
>>> 
>>> Failed to instanciate property in xmp:ModifyDate
>>> 
>>> 10
>>> 
>>> Type 'params' not defined in
>>> http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>>> 
>>> 9
>>> 
>>> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; 
>>> name=History]
>>> 
>>> 8
>>> 
>>> Type 'documentName' not defined in
>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>> 
>>> 8
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.day.com/dam/1.0
>>> 
>>> 7
>>> 
>>> Cannot find a definition for the namespace ptc
>>> 
>>> 7
>>> 
>>> Failed to instanciate property in xapMM:History
>>> 
>>> 6
>>> 
>>> Invalid array definition, expecting Seq and found 
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl 
>>> [prefix=tiff; name=YCbCrPositioning]
>>> 
>>> 5
>>> 
>>> Schema is not set in this document : 
>>> http://purl.org/dc/elements/1.1/
>>> 
>>> 5
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.extensis.com/meta/FontSense/
>>> 
>>> 4
>>> 
>>> Excepted xpacket 'end' attribute (must be present and placed in
>>> first)
>>> 
>>> 4
>>> 
>>> Invalid array type, expecting Seq and found Bag [prefix=photoshop; 
>>> name=TextLayers]
>>> 
>>> 3
>>> 
>>> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>>> 
>>> 3
>>> 
>>> no message (NPE)
>>> 
>>> 2
>>> 
>>> Cannot find a definition for the namespace 
>>> http://laserfiche.com/xmp/schema/1.0/
>>> 
>>> 2
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>>> 
>>> 2
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/camera-raw-settings/1.0/
>>> 
>>> 2
>>> 
>>> Failed to instanciate property in xapRights:Marked
>>> 
>>> 2
>>> 
>>> Invalid array type, expecting Alt and found Bag [prefix=dc; 
>>> name=title]
>>> 
>>> 2
>>> 
>>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>>> name=title]
>>> 
>>> 2
>>> 
>>> Invalid array type, expecting Seq and found Alt [prefix=dc; 
>>> name=creator]
>>> 
>>> 2
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.cambridgeassociates.com/status/1.0/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.computershare.com.au/ccs/1.0/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.esko-graphics.com/grinfo/1.0/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.tripletriangle.com/ns/tripletri/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://prismstandard.org/namespaces/basic/2.1/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.aiim.org/pdfa/ns/id.html
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.aiim.org/pdfe/ns/id/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.enfocus.com/ns/CertifiedPDF/2.0/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.northplains.com/xmpnps/cov/1.0/
>>> 
>>> 1
>>> 
>>> Failed to instanciate property in xmpRights:Marked
>>> 
>>> 1
>>> 
>>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>>> name=date]
>>> 
>>> 1
>>> 
>>> This namespace is not a schema or a structured type : 
>>> http://ns.adobe.com/xap/1.0/sType/Job#
>>> 
>>> 1
>>> 
>>> 
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
>> additional commands, e-mail: dev-help@pdfbox.apache.org
>> 
> 


Re: [DISCUSS] options for XMP parsing?

Posted by Ray Gauss <ra...@alfresco.com>.
To clarify... the 'we' in my third sentence was referring to Alfresco, not Tika.

I'm not sure how much of that code would be useful but I may be able to contribute some of it.

Regards,

Ray


> On Mar 8, 2016, at 2:07 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
> 
> Thank you.  Will take a look.
> 
> -----Original Message-----
> From: Ray Gauss [mailto:ray.gauss@alfresco.com] 
> Sent: Tuesday, March 08, 2016 1:55 PM
> To: dev@tika.apache.org
> Subject: Re: [DISCUSS] options for XMP parsing?
> 
> Hi Tim,
> 
> We're already using Adobe's xmpcore in tika-xmp which works fine for parsing XMP (though has not seen updates in a while), but getting the XMP packets out of the files is tricker.  
> 
> We have XMPPacketScanner which works for many cases, but not all.  InDesign files for example do some strange things.
> 
> In the past we've used different packet scanners depending on the file type (including Exiftool command-line) to get the XMP out then used xmpcore to parse into simple flattened properties.
> 
> Regards,
> 
> Ray
> 
> 
>> On Mar 8, 2016, at 12:50 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
>> 
>> All,
>> 
>> PDFBox 2.0 is soon to be released.  In the course of its development, the project has migrated from Jempbox (which we're now using) to XmpBox; and Jempbox is now on its last legs.  
>> 
>> XmpBox was "written for PDF/A checking," not for robust processing of common variants of XMPs in the wild; I found that it fails on roughly 40% of XMPs I pulled out of PDFs from govdocs1/commoncrawl.
>> 
>> In short, we can't migrate to XmpBox, and Jempbox is at the end of its life.
>> 
>> Has anyone had any luck with an Apache-friendly XMP parser?  Are there better options than copying and pasting jempbox into Tika and maintaining it ourselves (yuk!)?
>> 
>>         Best,
>> 
>>                Tim
>> 
>> -----Original Message-----
>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>> Sent: Tuesday, March 08, 2016 12:13 PM
>> To: dev@pdfbox.apache.org
>> Subject: Re: roadmap for XMPBox?
>> 
>> I think the problem is that XmpBox was written for PDF/A checking, so it fails with XMPs that are not PDF/A. For example, file 000142.pdf has the schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
>> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_p
>> roperties_in_pdfa-1_2008-03-20.pdf
>> 
>> And no, there are no plans for anything on XMP at this time...
>> 
>> Tilman
>> 
>> 
>> Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
>>> All,
>>> 
>>> 
>>> 
>>>  When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs.
>>> 
>>> 
>>> 
>>>  I’m including a table below of the counts of exception messages.  Are there any plans to make XMPBox more lenient or is this what we can expect going forward?
>>> 
>>> 
>>> 
>>>  As always, I’m more than happy to help with files and tests.  Let me know what I can do.
>>> 
>>> 
>>> 
>>>             Cheers,
>>> 
>>> 
>>> 
>>>                      Tim
>>> 
>>> 
>>> 
>>> No XmpParsingException on 42,022 files.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Exceptions:
>>> 
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/pdfx/1.3/
>>> 
>>> 13403
>>> 
>>> Type 'originalDocumentID' not defined in 
>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>> 
>>> 3710
>>> 
>>> Missing pdfaSchema:property in type definition
>>> 
>>> 3113
>>> 
>>> Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>>> 
>>> 2867
>>> 
>>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>>> name=creator]
>>> 
>>> 927
>>> 
>>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>>> name=description]
>>> 
>>> 723
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/xmp/InDesign/private
>>> 
>>> 710
>>> 
>>> Invalid array type, expecting Bag and found Seq [prefix=dc; 
>>> name=subject]
>>> 
>>> 654
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>>> 
>>> 522
>>> 
>>> Failed to parse
>>> 
>>> 492
>>> 
>>> Invalid array definition, expecting Seq and found 
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>> name=date]
>>> 
>>> 370
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/illustrator/1.0/
>>> 
>>> 262
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/xfa/promoted-desc/
>>> 
>>> 188
>>> 
>>> Failed to instanciate property in xmp:CreateDate
>>> 
>>> 144
>>> 
>>> Schema is not set in this document : 
>>> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>>> 
>>> 125
>>> 
>>> Expecting local name 'xmpmeta' and found 'xapmeta'
>>> 
>>> 94
>>> 
>>> Cannot find a definition for the namespace
>>> http://www.rwjf.org/rwjf/1.0
>>> 
>>> 84
>>> 
>>> Failed to instanciate property in xap:CreateDate
>>> 
>>> 74
>>> 
>>> Invalid array definition, expecting Bag and found 
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>> name=language]
>>> 
>>> 68
>>> 
>>> Invalid array definition, expecting Alt and found 
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>> name=title]
>>> 
>>> 49
>>> 
>>> Cannot find a definition for the namespace http://www.sap.com
>>> 
>>> 46
>>> 
>>> Failed to instanciate property in exif:ColorSpace
>>> 
>>> 33
>>> 
>>> Failed to instanciate property in xmpMM:History
>>> 
>>> 28
>>> 
>>> xmp should start with a processing instruction
>>> 
>>> 26
>>> 
>>> Cannot find a definition for the namespace 
>>> http://prismstandard.org/namespaces/basic/2.0/
>>> 
>>> 24
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.npes.org/pdfx/ns/id/
>>> 
>>> 21
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.InsiderSoftware.com/fontlist/1.0/
>>> 
>>> 14
>>> 
>>> Invalid array definition, expecting Seq and found 
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>>> name=creator]
>>> 
>>> 14
>>> 
>>> Failed to instanciate property in xmp:MetadataDate
>>> 
>>> 12
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.xinet.com/webnative/private/1.0/
>>> 
>>> 10
>>> 
>>> Failed to instanciate property in xap:ModifyDate
>>> 
>>> 10
>>> 
>>> Failed to instanciate property in xmp:ModifyDate
>>> 
>>> 10
>>> 
>>> Type 'params' not defined in
>>> http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>>> 
>>> 9
>>> 
>>> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; 
>>> name=History]
>>> 
>>> 8
>>> 
>>> Type 'documentName' not defined in
>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>> 
>>> 8
>>> 
>>> Cannot find a definition for the namespace http://www.day.com/dam/1.0
>>> 
>>> 7
>>> 
>>> Cannot find a definition for the namespace ptc
>>> 
>>> 7
>>> 
>>> Failed to instanciate property in xapMM:History
>>> 
>>> 6
>>> 
>>> Invalid array definition, expecting Seq and found 
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff; 
>>> name=YCbCrPositioning]
>>> 
>>> 5
>>> 
>>> Schema is not set in this document : http://purl.org/dc/elements/1.1/
>>> 
>>> 5
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.extensis.com/meta/FontSense/
>>> 
>>> 4
>>> 
>>> Excepted xpacket 'end' attribute (must be present and placed in 
>>> first)
>>> 
>>> 4
>>> 
>>> Invalid array type, expecting Seq and found Bag [prefix=photoshop; 
>>> name=TextLayers]
>>> 
>>> 3
>>> 
>>> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>>> 
>>> 3
>>> 
>>> no message (NPE)
>>> 
>>> 2
>>> 
>>> Cannot find a definition for the namespace 
>>> http://laserfiche.com/xmp/schema/1.0/
>>> 
>>> 2
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>>> 
>>> 2
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.adobe.com/camera-raw-settings/1.0/
>>> 
>>> 2
>>> 
>>> Failed to instanciate property in xapRights:Marked
>>> 
>>> 2
>>> 
>>> Invalid array type, expecting Alt and found Bag [prefix=dc; 
>>> name=title]
>>> 
>>> 2
>>> 
>>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>>> name=title]
>>> 
>>> 2
>>> 
>>> Invalid array type, expecting Seq and found Alt [prefix=dc; 
>>> name=creator]
>>> 
>>> 2
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.cambridgeassociates.com/status/1.0/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.computershare.com.au/ccs/1.0/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.esko-graphics.com/grinfo/1.0/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://ns.tripletriangle.com/ns/tripletri/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://prismstandard.org/namespaces/basic/2.1/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.aiim.org/pdfa/ns/id.html
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.aiim.org/pdfe/ns/id/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.enfocus.com/ns/CertifiedPDF/2.0/
>>> 
>>> 1
>>> 
>>> Cannot find a definition for the namespace 
>>> http://www.northplains.com/xmpnps/cov/1.0/
>>> 
>>> 1
>>> 
>>> Failed to instanciate property in xmpRights:Marked
>>> 
>>> 1
>>> 
>>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>>> name=date]
>>> 
>>> 1
>>> 
>>> This namespace is not a schema or a structured type : 
>>> http://ns.adobe.com/xap/1.0/sType/Job#
>>> 
>>> 1
>>> 
>>> 
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
>> additional commands, e-mail: dev-help@pdfbox.apache.org
>> 
> 


RE: [DISCUSS] options for XMP parsing?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Thank you.  Will take a look.

-----Original Message-----
From: Ray Gauss [mailto:ray.gauss@alfresco.com] 
Sent: Tuesday, March 08, 2016 1:55 PM
To: dev@tika.apache.org
Subject: Re: [DISCUSS] options for XMP parsing?

Hi Tim,

We're already using Adobe's xmpcore in tika-xmp which works fine for parsing XMP (though has not seen updates in a while), but getting the XMP packets out of the files is tricker.  

We have XMPPacketScanner which works for many cases, but not all.  InDesign files for example do some strange things.

In the past we've used different packet scanners depending on the file type (including Exiftool command-line) to get the XMP out then used xmpcore to parse into simple flattened properties.

Regards,

Ray


> On Mar 8, 2016, at 12:50 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
> 
> All,
> 
>  PDFBox 2.0 is soon to be released.  In the course of its development, the project has migrated from Jempbox (which we're now using) to XmpBox; and Jempbox is now on its last legs.  
> 
>  XmpBox was "written for PDF/A checking," not for robust processing of common variants of XMPs in the wild; I found that it fails on roughly 40% of XMPs I pulled out of PDFs from govdocs1/commoncrawl.
> 
>  In short, we can't migrate to XmpBox, and Jempbox is at the end of its life.
> 
>  Has anyone had any luck with an Apache-friendly XMP parser?  Are there better options than copying and pasting jempbox into Tika and maintaining it ourselves (yuk!)?
> 
>          Best,
> 
>                 Tim
> 
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Tuesday, March 08, 2016 12:13 PM
> To: dev@pdfbox.apache.org
> Subject: Re: roadmap for XMPBox?
> 
> I think the problem is that XmpBox was written for PDF/A checking, so it fails with XMPs that are not PDF/A. For example, file 000142.pdf has the schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_p
> roperties_in_pdfa-1_2008-03-20.pdf
> 
> And no, there are no plans for anything on XMP at this time...
> 
> Tilman
> 
> 
> Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
>> All,
>> 
>> 
>> 
>>   When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs.
>> 
>> 
>> 
>>   I’m including a table below of the counts of exception messages.  Are there any plans to make XMPBox more lenient or is this what we can expect going forward?
>> 
>> 
>> 
>>   As always, I’m more than happy to help with files and tests.  Let me know what I can do.
>> 
>> 
>> 
>>              Cheers,
>> 
>> 
>> 
>>                       Tim
>> 
>> 
>> 
>> No XmpParsingException on 42,022 files.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Exceptions:
>> 
>> 
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/pdfx/1.3/
>> 
>> 13403
>> 
>> Type 'originalDocumentID' not defined in 
>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>> 
>> 3710
>> 
>> Missing pdfaSchema:property in type definition
>> 
>> 3113
>> 
>> Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>> 
>> 2867
>> 
>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>> name=creator]
>> 
>> 927
>> 
>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>> name=description]
>> 
>> 723
>> 
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/xmp/InDesign/private
>> 
>> 710
>> 
>> Invalid array type, expecting Bag and found Seq [prefix=dc; 
>> name=subject]
>> 
>> 654
>> 
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>> 
>> 522
>> 
>> Failed to parse
>> 
>> 492
>> 
>> Invalid array definition, expecting Seq and found 
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>> name=date]
>> 
>> 370
>> 
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/illustrator/1.0/
>> 
>> 262
>> 
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/xfa/promoted-desc/
>> 
>> 188
>> 
>> Failed to instanciate property in xmp:CreateDate
>> 
>> 144
>> 
>> Schema is not set in this document : 
>> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>> 
>> 125
>> 
>> Expecting local name 'xmpmeta' and found 'xapmeta'
>> 
>> 94
>> 
>> Cannot find a definition for the namespace
>> http://www.rwjf.org/rwjf/1.0
>> 
>> 84
>> 
>> Failed to instanciate property in xap:CreateDate
>> 
>> 74
>> 
>> Invalid array definition, expecting Bag and found 
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>> name=language]
>> 
>> 68
>> 
>> Invalid array definition, expecting Alt and found 
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>> name=title]
>> 
>> 49
>> 
>> Cannot find a definition for the namespace http://www.sap.com
>> 
>> 46
>> 
>> Failed to instanciate property in exif:ColorSpace
>> 
>> 33
>> 
>> Failed to instanciate property in xmpMM:History
>> 
>> 28
>> 
>> xmp should start with a processing instruction
>> 
>> 26
>> 
>> Cannot find a definition for the namespace 
>> http://prismstandard.org/namespaces/basic/2.0/
>> 
>> 24
>> 
>> Cannot find a definition for the namespace 
>> http://www.npes.org/pdfx/ns/id/
>> 
>> 21
>> 
>> Cannot find a definition for the namespace 
>> http://ns.InsiderSoftware.com/fontlist/1.0/
>> 
>> 14
>> 
>> Invalid array definition, expecting Seq and found 
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>> name=creator]
>> 
>> 14
>> 
>> Failed to instanciate property in xmp:MetadataDate
>> 
>> 12
>> 
>> Cannot find a definition for the namespace 
>> http://ns.xinet.com/webnative/private/1.0/
>> 
>> 10
>> 
>> Failed to instanciate property in xap:ModifyDate
>> 
>> 10
>> 
>> Failed to instanciate property in xmp:ModifyDate
>> 
>> 10
>> 
>> Type 'params' not defined in
>> http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>> 
>> 9
>> 
>> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; 
>> name=History]
>> 
>> 8
>> 
>> Type 'documentName' not defined in
>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>> 
>> 8
>> 
>> Cannot find a definition for the namespace http://www.day.com/dam/1.0
>> 
>> 7
>> 
>> Cannot find a definition for the namespace ptc
>> 
>> 7
>> 
>> Failed to instanciate property in xapMM:History
>> 
>> 6
>> 
>> Invalid array definition, expecting Seq and found 
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff; 
>> name=YCbCrPositioning]
>> 
>> 5
>> 
>> Schema is not set in this document : http://purl.org/dc/elements/1.1/
>> 
>> 5
>> 
>> Cannot find a definition for the namespace 
>> http://www.extensis.com/meta/FontSense/
>> 
>> 4
>> 
>> Excepted xpacket 'end' attribute (must be present and placed in 
>> first)
>> 
>> 4
>> 
>> Invalid array type, expecting Seq and found Bag [prefix=photoshop; 
>> name=TextLayers]
>> 
>> 3
>> 
>> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>> 
>> 3
>> 
>> no message (NPE)
>> 
>> 2
>> 
>> Cannot find a definition for the namespace 
>> http://laserfiche.com/xmp/schema/1.0/
>> 
>> 2
>> 
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>> 
>> 2
>> 
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/camera-raw-settings/1.0/
>> 
>> 2
>> 
>> Failed to instanciate property in xapRights:Marked
>> 
>> 2
>> 
>> Invalid array type, expecting Alt and found Bag [prefix=dc; 
>> name=title]
>> 
>> 2
>> 
>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>> name=title]
>> 
>> 2
>> 
>> Invalid array type, expecting Seq and found Alt [prefix=dc; 
>> name=creator]
>> 
>> 2
>> 
>> Cannot find a definition for the namespace 
>> http://ns.cambridgeassociates.com/status/1.0/
>> 
>> 1
>> 
>> Cannot find a definition for the namespace 
>> http://ns.computershare.com.au/ccs/1.0/
>> 
>> 1
>> 
>> Cannot find a definition for the namespace 
>> http://ns.esko-graphics.com/grinfo/1.0/
>> 
>> 1
>> 
>> Cannot find a definition for the namespace 
>> http://ns.tripletriangle.com/ns/tripletri/
>> 
>> 1
>> 
>> Cannot find a definition for the namespace 
>> http://prismstandard.org/namespaces/basic/2.1/
>> 
>> 1
>> 
>> Cannot find a definition for the namespace 
>> http://www.aiim.org/pdfa/ns/id.html
>> 
>> 1
>> 
>> Cannot find a definition for the namespace 
>> http://www.aiim.org/pdfe/ns/id/
>> 
>> 1
>> 
>> Cannot find a definition for the namespace 
>> http://www.enfocus.com/ns/CertifiedPDF/2.0/
>> 
>> 1
>> 
>> Cannot find a definition for the namespace 
>> http://www.northplains.com/xmpnps/cov/1.0/
>> 
>> 1
>> 
>> Failed to instanciate property in xmpRights:Marked
>> 
>> 1
>> 
>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>> name=date]
>> 
>> 1
>> 
>> This namespace is not a schema or a structured type : 
>> http://ns.adobe.com/xap/1.0/sType/Job#
>> 
>> 1
>> 
>> 
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
> 


Re: [DISCUSS] options for XMP parsing?

Posted by Ray Gauss <ra...@alfresco.com>.
Hi Tim,

We're already using Adobe's xmpcore in tika-xmp which works fine for parsing XMP (though has not seen updates in a while), but getting the XMP packets out of the files is tricker.  

We have XMPPacketScanner which works for many cases, but not all.  InDesign files for example do some strange things.

In the past we've used different packet scanners depending on the file type (including Exiftool command-line) to get the XMP out then used xmpcore to parse into simple flattened properties.

Regards,

Ray


> On Mar 8, 2016, at 12:50 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
> 
> All,
> 
>  PDFBox 2.0 is soon to be released.  In the course of its development, the project has migrated from Jempbox (which we're now using) to XmpBox; and Jempbox is now on its last legs.  
> 
>  XmpBox was "written for PDF/A checking," not for robust processing of common variants of XMPs in the wild; I found that it fails on roughly 40% of XMPs I pulled out of PDFs from govdocs1/commoncrawl.
> 
>  In short, we can't migrate to XmpBox, and Jempbox is at the end of its life.
> 
>  Has anyone had any luck with an Apache-friendly XMP parser?  Are there better options than copying and pasting jempbox into Tika and maintaining it ourselves (yuk!)?
> 
>          Best,
> 
>                 Tim
> 
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de] 
> Sent: Tuesday, March 08, 2016 12:13 PM
> To: dev@pdfbox.apache.org
> Subject: Re: roadmap for XMPBox?
> 
> I think the problem is that XmpBox was written for PDF/A checking, so it fails with XMPs that are not PDF/A. For example, file 000142.pdf has the schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_properties_in_pdfa-1_2008-03-20.pdf
> 
> And no, there are no plans for anything on XMP at this time...
> 
> Tilman
> 
> 
> Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
>> All,
>> 
>> 
>> 
>>   When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs.
>> 
>> 
>> 
>>   I’m including a table below of the counts of exception messages.  Are there any plans to make XMPBox more lenient or is this what we can expect going forward?
>> 
>> 
>> 
>>   As always, I’m more than happy to help with files and tests.  Let me know what I can do.
>> 
>> 
>> 
>>              Cheers,
>> 
>> 
>> 
>>                       Tim
>> 
>> 
>> 
>> No XmpParsingException on 42,022 files.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Exceptions:
>> 
>> 
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/pdfx/1.3/
>> 
>> 13403
>> 
>> Type 'originalDocumentID' not defined in 
>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>> 
>> 3710
>> 
>> Missing pdfaSchema:property in type definition
>> 
>> 3113
>> 
>> Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>> 
>> 2867
>> 
>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>> name=creator]
>> 
>> 927
>> 
>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>> name=description]
>> 
>> 723
>> 
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/xmp/InDesign/private
>> 
>> 710
>> 
>> Invalid array type, expecting Bag and found Seq [prefix=dc; 
>> name=subject]
>> 
>> 654
>> 
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>> 
>> 522
>> 
>> Failed to parse
>> 
>> 492
>> 
>> Invalid array definition, expecting Seq and found 
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>> name=date]
>> 
>> 370
>> 
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/illustrator/1.0/
>> 
>> 262
>> 
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/xfa/promoted-desc/
>> 
>> 188
>> 
>> Failed to instanciate property in xmp:CreateDate
>> 
>> 144
>> 
>> Schema is not set in this document : 
>> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>> 
>> 125
>> 
>> Expecting local name 'xmpmeta' and found 'xapmeta'
>> 
>> 94
>> 
>> Cannot find a definition for the namespace 
>> http://www.rwjf.org/rwjf/1.0
>> 
>> 84
>> 
>> Failed to instanciate property in xap:CreateDate
>> 
>> 74
>> 
>> Invalid array definition, expecting Bag and found 
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>> name=language]
>> 
>> 68
>> 
>> Invalid array definition, expecting Alt and found 
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>> name=title]
>> 
>> 49
>> 
>> Cannot find a definition for the namespace http://www.sap.com
>> 
>> 46
>> 
>> Failed to instanciate property in exif:ColorSpace
>> 
>> 33
>> 
>> Failed to instanciate property in xmpMM:History
>> 
>> 28
>> 
>> xmp should start with a processing instruction
>> 
>> 26
>> 
>> Cannot find a definition for the namespace 
>> http://prismstandard.org/namespaces/basic/2.0/
>> 
>> 24
>> 
>> Cannot find a definition for the namespace 
>> http://www.npes.org/pdfx/ns/id/
>> 
>> 21
>> 
>> Cannot find a definition for the namespace 
>> http://ns.InsiderSoftware.com/fontlist/1.0/
>> 
>> 14
>> 
>> Invalid array definition, expecting Seq and found 
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>> name=creator]
>> 
>> 14
>> 
>> Failed to instanciate property in xmp:MetadataDate
>> 
>> 12
>> 
>> Cannot find a definition for the namespace 
>> http://ns.xinet.com/webnative/private/1.0/
>> 
>> 10
>> 
>> Failed to instanciate property in xap:ModifyDate
>> 
>> 10
>> 
>> Failed to instanciate property in xmp:ModifyDate
>> 
>> 10
>> 
>> Type 'params' not defined in 
>> http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>> 
>> 9
>> 
>> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; 
>> name=History]
>> 
>> 8
>> 
>> Type 'documentName' not defined in 
>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>> 
>> 8
>> 
>> Cannot find a definition for the namespace http://www.day.com/dam/1.0
>> 
>> 7
>> 
>> Cannot find a definition for the namespace ptc
>> 
>> 7
>> 
>> Failed to instanciate property in xapMM:History
>> 
>> 6
>> 
>> Invalid array definition, expecting Seq and found 
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff; 
>> name=YCbCrPositioning]
>> 
>> 5
>> 
>> Schema is not set in this document : http://purl.org/dc/elements/1.1/
>> 
>> 5
>> 
>> Cannot find a definition for the namespace 
>> http://www.extensis.com/meta/FontSense/
>> 
>> 4
>> 
>> Excepted xpacket 'end' attribute (must be present and placed in first)
>> 
>> 4
>> 
>> Invalid array type, expecting Seq and found Bag [prefix=photoshop; 
>> name=TextLayers]
>> 
>> 3
>> 
>> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>> 
>> 3
>> 
>> no message (NPE)
>> 
>> 2
>> 
>> Cannot find a definition for the namespace 
>> http://laserfiche.com/xmp/schema/1.0/
>> 
>> 2
>> 
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>> 
>> 2
>> 
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/camera-raw-settings/1.0/
>> 
>> 2
>> 
>> Failed to instanciate property in xapRights:Marked
>> 
>> 2
>> 
>> Invalid array type, expecting Alt and found Bag [prefix=dc; 
>> name=title]
>> 
>> 2
>> 
>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>> name=title]
>> 
>> 2
>> 
>> Invalid array type, expecting Seq and found Alt [prefix=dc; 
>> name=creator]
>> 
>> 2
>> 
>> Cannot find a definition for the namespace 
>> http://ns.cambridgeassociates.com/status/1.0/
>> 
>> 1
>> 
>> Cannot find a definition for the namespace 
>> http://ns.computershare.com.au/ccs/1.0/
>> 
>> 1
>> 
>> Cannot find a definition for the namespace 
>> http://ns.esko-graphics.com/grinfo/1.0/
>> 
>> 1
>> 
>> Cannot find a definition for the namespace 
>> http://ns.tripletriangle.com/ns/tripletri/
>> 
>> 1
>> 
>> Cannot find a definition for the namespace 
>> http://prismstandard.org/namespaces/basic/2.1/
>> 
>> 1
>> 
>> Cannot find a definition for the namespace 
>> http://www.aiim.org/pdfa/ns/id.html
>> 
>> 1
>> 
>> Cannot find a definition for the namespace 
>> http://www.aiim.org/pdfe/ns/id/
>> 
>> 1
>> 
>> Cannot find a definition for the namespace 
>> http://www.enfocus.com/ns/CertifiedPDF/2.0/
>> 
>> 1
>> 
>> Cannot find a definition for the namespace 
>> http://www.northplains.com/xmpnps/cov/1.0/
>> 
>> 1
>> 
>> Failed to instanciate property in xmpRights:Marked
>> 
>> 1
>> 
>> Invalid array type, expecting Seq and found Bag [prefix=dc; name=date]
>> 
>> 1
>> 
>> This namespace is not a schema or a structured type : 
>> http://ns.adobe.com/xap/1.0/sType/Job#
>> 
>> 1
>> 
>> 
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org
>