You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2016/03/07 19:31:01 UTC

roadmap for XMPBox?

All,



  When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs.



  I’m including a table below of the counts of exception messages.  Are there any plans to make XMPBox more lenient or is this what we can expect going forward?



  As always, I’m more than happy to help with files and tests.  Let me know what I can do.



             Cheers,



                      Tim



No XmpParsingException on 42,022 files.







Exceptions:


Cannot find a definition for the namespace http://ns.adobe.com/pdfx/1.3/

13403

Type 'originalDocumentID' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceRef#

3710

Missing pdfaSchema:property in type definition

3113

Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'

2867

Invalid array type, expecting Seq and found Bag [prefix=dc; name=creator]

927

Invalid array type, expecting Alt and found Seq [prefix=dc; name=description]

723

Cannot find a definition for the namespace http://ns.adobe.com/xmp/InDesign/private

710

Invalid array type, expecting Bag and found Seq [prefix=dc; name=subject]

654

Cannot find a definition for the namespace http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/

522

Failed to parse

492

Invalid array definition, expecting Seq and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=date]

370

Cannot find a definition for the namespace http://ns.adobe.com/illustrator/1.0/

262

Cannot find a definition for the namespace http://ns.adobe.com/xfa/promoted-desc/

188

Failed to instanciate property in xmp:CreateDate

144

Schema is not set in this document : http://www.w3.org/1999/02/22-rdf-syntax-ns#

125

Expecting local name 'xmpmeta' and found 'xapmeta'

94

Cannot find a definition for the namespace http://www.rwjf.org/rwjf/1.0

84

Failed to instanciate property in xap:CreateDate

74

Invalid array definition, expecting Bag and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=language]

68

Invalid array definition, expecting Alt and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=title]

49

Cannot find a definition for the namespace http://www.sap.com

46

Failed to instanciate property in exif:ColorSpace

33

Failed to instanciate property in xmpMM:History

28

xmp should start with a processing instruction

26

Cannot find a definition for the namespace http://prismstandard.org/namespaces/basic/2.0/

24

Cannot find a definition for the namespace http://www.npes.org/pdfx/ns/id/

21

Cannot find a definition for the namespace http://ns.InsiderSoftware.com/fontlist/1.0/

14

Invalid array definition, expecting Seq and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=creator]

14

Failed to instanciate property in xmp:MetadataDate

12

Cannot find a definition for the namespace http://ns.xinet.com/webnative/private/1.0/

10

Failed to instanciate property in xap:ModifyDate

10

Failed to instanciate property in xmp:ModifyDate

10

Type 'params' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceEvent#

9

Invalid array type, expecting Seq and found Bag [prefix=xmpMM; name=History]

8

Type 'documentName' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceRef#

8

Cannot find a definition for the namespace http://www.day.com/dam/1.0

7

Cannot find a definition for the namespace ptc

7

Failed to instanciate property in xapMM:History

6

Invalid array definition, expecting Seq and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff; name=YCbCrPositioning]

5

Schema is not set in this document : http://purl.org/dc/elements/1.1/

5

Cannot find a definition for the namespace http://www.extensis.com/meta/FontSense/

4

Excepted xpacket 'end' attribute (must be present and placed in first)

4

Invalid array type, expecting Seq and found Bag [prefix=photoshop; name=TextLayers]

3

Schema is not set in this document : http://ns.adobe.com/xap/1.0/

3

no message (NPE)

2

Cannot find a definition for the namespace http://laserfiche.com/xmp/schema/1.0/

2

Cannot find a definition for the namespace http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/

2

Cannot find a definition for the namespace http://ns.adobe.com/camera-raw-settings/1.0/

2

Failed to instanciate property in xapRights:Marked

2

Invalid array type, expecting Alt and found Bag [prefix=dc; name=title]

2

Invalid array type, expecting Alt and found Seq [prefix=dc; name=title]

2

Invalid array type, expecting Seq and found Alt [prefix=dc; name=creator]

2

Cannot find a definition for the namespace http://ns.cambridgeassociates.com/status/1.0/

1

Cannot find a definition for the namespace http://ns.computershare.com.au/ccs/1.0/

1

Cannot find a definition for the namespace http://ns.esko-graphics.com/grinfo/1.0/

1

Cannot find a definition for the namespace http://ns.tripletriangle.com/ns/tripletri/

1

Cannot find a definition for the namespace http://prismstandard.org/namespaces/basic/2.1/

1

Cannot find a definition for the namespace http://www.aiim.org/pdfa/ns/id.html

1

Cannot find a definition for the namespace http://www.aiim.org/pdfe/ns/id/

1

Cannot find a definition for the namespace http://www.enfocus.com/ns/CertifiedPDF/2.0/

1

Cannot find a definition for the namespace http://www.northplains.com/xmpnps/cov/1.0/

1

Failed to instanciate property in xmpRights:Marked

1

Invalid array type, expecting Seq and found Bag [prefix=dc; name=date]

1

This namespace is not a schema or a structured type : http://ns.adobe.com/xap/1.0/sType/Job#

1




RE: roadmap for XMPBox?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Thank you, Beat.  Y, as one of our devs pointed out, we're using that already in Tika in our XMP module for writing XMP...we haven't looked into using it for extraction.

-----Original Message-----
From: Beat Weisskopf [mailto:weisskopf@glue.ch] 
Sent: Friday, March 11, 2016 3:40 AM
To: dev@pdfbox.apache.org
Subject: Re: roadmap for XMPBox?

Hi all

As a third option: What about the BSD-licensed Adobe XMP Toolkit? At least verapdf seems to use a fork it: https://github.com/veraPDF/veraPDF-xmp

Cheers, beat


Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
> All,
>
>
>
>    When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs.
>
>
>
>    I’m including a table below of the counts of exception messages.  Are there any plans to make XMPBox more lenient or is this what we can expect going forward?
>
>
>
>    As always, I’m more than happy to help with files and tests.  Let me know what I can do.
>
>
>
>               Cheers,
>
>
>
>                        Tim
>
>
>
> No XmpParsingException on 42,022 files.
>
>
>
>
>
>
>
> Exceptions:
>
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/pdfx/1.3/
>
> 13403
>
> Type 'originalDocumentID' not defined in 
> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>
> 3710
>
> Missing pdfaSchema:property in type definition
>
> 3113
>
> Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>
> 2867
>
> Invalid array type, expecting Seq and found Bag [prefix=dc; 
> name=creator]
>
> 927
>
> Invalid array type, expecting Alt and found Seq [prefix=dc; 
> name=description]
>
> 723
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/xmp/InDesign/private
>
> 710
>
> Invalid array type, expecting Bag and found Seq [prefix=dc; 
> name=subject]
>
> 654
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>
> 522
>
> Failed to parse
>
> 492
>
> Invalid array definition, expecting Seq and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=date]
>
> 370
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/illustrator/1.0/
>
> 262
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/xfa/promoted-desc/
>
> 188
>
> Failed to instanciate property in xmp:CreateDate
>
> 144
>
> Schema is not set in this document : 
> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>
> 125
>
> Expecting local name 'xmpmeta' and found 'xapmeta'
>
> 94
>
> Cannot find a definition for the namespace 
> http://www.rwjf.org/rwjf/1.0
>
> 84
>
> Failed to instanciate property in xap:CreateDate
>
> 74
>
> Invalid array definition, expecting Bag and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=language]
>
> 68
>
> Invalid array definition, expecting Alt and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=title]
>
> 49
>
> Cannot find a definition for the namespace http://www.sap.com
>
> 46
>
> Failed to instanciate property in exif:ColorSpace
>
> 33
>
> Failed to instanciate property in xmpMM:History
>
> 28
>
> xmp should start with a processing instruction
>
> 26
>
> Cannot find a definition for the namespace 
> http://prismstandard.org/namespaces/basic/2.0/
>
> 24
>
> Cannot find a definition for the namespace 
> http://www.npes.org/pdfx/ns/id/
>
> 21
>
> Cannot find a definition for the namespace 
> http://ns.InsiderSoftware.com/fontlist/1.0/
>
> 14
>
> Invalid array definition, expecting Seq and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=creator]
>
> 14
>
> Failed to instanciate property in xmp:MetadataDate
>
> 12
>
> Cannot find a definition for the namespace 
> http://ns.xinet.com/webnative/private/1.0/
>
> 10
>
> Failed to instanciate property in xap:ModifyDate
>
> 10
>
> Failed to instanciate property in xmp:ModifyDate
>
> 10
>
> Type 'params' not defined in 
> http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>
> 9
>
> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; 
> name=History]
>
> 8
>
> Type 'documentName' not defined in 
> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>
> 8
>
> Cannot find a definition for the namespace http://www.day.com/dam/1.0
>
> 7
>
> Cannot find a definition for the namespace ptc
>
> 7
>
> Failed to instanciate property in xapMM:History
>
> 6
>
> Invalid array definition, expecting Seq and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff; 
> name=YCbCrPositioning]
>
> 5
>
> Schema is not set in this document : http://purl.org/dc/elements/1.1/
>
> 5
>
> Cannot find a definition for the namespace 
> http://www.extensis.com/meta/FontSense/
>
> 4
>
> Excepted xpacket 'end' attribute (must be present and placed in first)
>
> 4
>
> Invalid array type, expecting Seq and found Bag [prefix=photoshop; 
> name=TextLayers]
>
> 3
>
> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>
> 3
>
> no message (NPE)
>
> 2
>
> Cannot find a definition for the namespace 
> http://laserfiche.com/xmp/schema/1.0/
>
> 2
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>
> 2
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/camera-raw-settings/1.0/
>
> 2
>
> Failed to instanciate property in xapRights:Marked
>
> 2
>
> Invalid array type, expecting Alt and found Bag [prefix=dc; 
> name=title]
>
> 2
>
> Invalid array type, expecting Alt and found Seq [prefix=dc; 
> name=title]
>
> 2
>
> Invalid array type, expecting Seq and found Alt [prefix=dc; 
> name=creator]
>
> 2
>
> Cannot find a definition for the namespace 
> http://ns.cambridgeassociates.com/status/1.0/
>
> 1
>
> Cannot find a definition for the namespace 
> http://ns.computershare.com.au/ccs/1.0/
>
> 1
>
> Cannot find a definition for the namespace 
> http://ns.esko-graphics.com/grinfo/1.0/
>
> 1
>
> Cannot find a definition for the namespace 
> http://ns.tripletriangle.com/ns/tripletri/
>
> 1
>
> Cannot find a definition for the namespace 
> http://prismstandard.org/namespaces/basic/2.1/
>
> 1
>
> Cannot find a definition for the namespace 
> http://www.aiim.org/pdfa/ns/id.html
>
> 1
>
> Cannot find a definition for the namespace 
> http://www.aiim.org/pdfe/ns/id/
>
> 1
>
> Cannot find a definition for the namespace 
> http://www.enfocus.com/ns/CertifiedPDF/2.0/
>
> 1
>
> Cannot find a definition for the namespace 
> http://www.northplains.com/xmpnps/cov/1.0/
>
> 1
>
> Failed to instanciate property in xmpRights:Marked
>
> 1
>
> Invalid array type, expecting Seq and found Bag [prefix=dc; name=date]
>
> 1
>
> This namespace is not a schema or a structured type : 
> http://ns.adobe.com/xap/1.0/sType/Job#
>
> 1
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: roadmap for XMPBox?

Posted by Beat Weisskopf <we...@glue.ch>.
Hi all

As a third option: What about the BSD-licensed Adobe XMP Toolkit? At 
least verapdf seems to use a fork it: https://github.com/veraPDF/veraPDF-xmp

Cheers, beat


Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
> All,
>
>
>
>    When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs.
>
>
>
>    I’m including a table below of the counts of exception messages.  Are there any plans to make XMPBox more lenient or is this what we can expect going forward?
>
>
>
>    As always, I’m more than happy to help with files and tests.  Let me know what I can do.
>
>
>
>               Cheers,
>
>
>
>                        Tim
>
>
>
> No XmpParsingException on 42,022 files.
>
>
>
>
>
>
>
> Exceptions:
>
>
> Cannot find a definition for the namespace http://ns.adobe.com/pdfx/1.3/
>
> 13403
>
> Type 'originalDocumentID' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>
> 3710
>
> Missing pdfaSchema:property in type definition
>
> 3113
>
> Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>
> 2867
>
> Invalid array type, expecting Seq and found Bag [prefix=dc; name=creator]
>
> 927
>
> Invalid array type, expecting Alt and found Seq [prefix=dc; name=description]
>
> 723
>
> Cannot find a definition for the namespace http://ns.adobe.com/xmp/InDesign/private
>
> 710
>
> Invalid array type, expecting Bag and found Seq [prefix=dc; name=subject]
>
> 654
>
> Cannot find a definition for the namespace http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>
> 522
>
> Failed to parse
>
> 492
>
> Invalid array definition, expecting Seq and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=date]
>
> 370
>
> Cannot find a definition for the namespace http://ns.adobe.com/illustrator/1.0/
>
> 262
>
> Cannot find a definition for the namespace http://ns.adobe.com/xfa/promoted-desc/
>
> 188
>
> Failed to instanciate property in xmp:CreateDate
>
> 144
>
> Schema is not set in this document : http://www.w3.org/1999/02/22-rdf-syntax-ns#
>
> 125
>
> Expecting local name 'xmpmeta' and found 'xapmeta'
>
> 94
>
> Cannot find a definition for the namespace http://www.rwjf.org/rwjf/1.0
>
> 84
>
> Failed to instanciate property in xap:CreateDate
>
> 74
>
> Invalid array definition, expecting Bag and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=language]
>
> 68
>
> Invalid array definition, expecting Alt and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=title]
>
> 49
>
> Cannot find a definition for the namespace http://www.sap.com
>
> 46
>
> Failed to instanciate property in exif:ColorSpace
>
> 33
>
> Failed to instanciate property in xmpMM:History
>
> 28
>
> xmp should start with a processing instruction
>
> 26
>
> Cannot find a definition for the namespace http://prismstandard.org/namespaces/basic/2.0/
>
> 24
>
> Cannot find a definition for the namespace http://www.npes.org/pdfx/ns/id/
>
> 21
>
> Cannot find a definition for the namespace http://ns.InsiderSoftware.com/fontlist/1.0/
>
> 14
>
> Invalid array definition, expecting Seq and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=creator]
>
> 14
>
> Failed to instanciate property in xmp:MetadataDate
>
> 12
>
> Cannot find a definition for the namespace http://ns.xinet.com/webnative/private/1.0/
>
> 10
>
> Failed to instanciate property in xap:ModifyDate
>
> 10
>
> Failed to instanciate property in xmp:ModifyDate
>
> 10
>
> Type 'params' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>
> 9
>
> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; name=History]
>
> 8
>
> Type 'documentName' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>
> 8
>
> Cannot find a definition for the namespace http://www.day.com/dam/1.0
>
> 7
>
> Cannot find a definition for the namespace ptc
>
> 7
>
> Failed to instanciate property in xapMM:History
>
> 6
>
> Invalid array definition, expecting Seq and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff; name=YCbCrPositioning]
>
> 5
>
> Schema is not set in this document : http://purl.org/dc/elements/1.1/
>
> 5
>
> Cannot find a definition for the namespace http://www.extensis.com/meta/FontSense/
>
> 4
>
> Excepted xpacket 'end' attribute (must be present and placed in first)
>
> 4
>
> Invalid array type, expecting Seq and found Bag [prefix=photoshop; name=TextLayers]
>
> 3
>
> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>
> 3
>
> no message (NPE)
>
> 2
>
> Cannot find a definition for the namespace http://laserfiche.com/xmp/schema/1.0/
>
> 2
>
> Cannot find a definition for the namespace http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>
> 2
>
> Cannot find a definition for the namespace http://ns.adobe.com/camera-raw-settings/1.0/
>
> 2
>
> Failed to instanciate property in xapRights:Marked
>
> 2
>
> Invalid array type, expecting Alt and found Bag [prefix=dc; name=title]
>
> 2
>
> Invalid array type, expecting Alt and found Seq [prefix=dc; name=title]
>
> 2
>
> Invalid array type, expecting Seq and found Alt [prefix=dc; name=creator]
>
> 2
>
> Cannot find a definition for the namespace http://ns.cambridgeassociates.com/status/1.0/
>
> 1
>
> Cannot find a definition for the namespace http://ns.computershare.com.au/ccs/1.0/
>
> 1
>
> Cannot find a definition for the namespace http://ns.esko-graphics.com/grinfo/1.0/
>
> 1
>
> Cannot find a definition for the namespace http://ns.tripletriangle.com/ns/tripletri/
>
> 1
>
> Cannot find a definition for the namespace http://prismstandard.org/namespaces/basic/2.1/
>
> 1
>
> Cannot find a definition for the namespace http://www.aiim.org/pdfa/ns/id.html
>
> 1
>
> Cannot find a definition for the namespace http://www.aiim.org/pdfe/ns/id/
>
> 1
>
> Cannot find a definition for the namespace http://www.enfocus.com/ns/CertifiedPDF/2.0/
>
> 1
>
> Cannot find a definition for the namespace http://www.northplains.com/xmpnps/cov/1.0/
>
> 1
>
> Failed to instanciate property in xmpRights:Marked
>
> 1
>
> Invalid array type, expecting Seq and found Bag [prefix=dc; name=date]
>
> 1
>
> This namespace is not a schema or a structured type : http://ns.adobe.com/xap/1.0/sType/Job#
>
> 1
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: roadmap for XMPBox?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 08.03.2016 um 19:30 schrieb Allison, Timothy B.:
>> The comment I made is just my personal opinion. ... Maybe improve XMPBox as you suggested (I did have a look but it doesn't seem easy).
>
> Oh, ok, so it isn't necessarily set in stone.
IMHO, no, we are always open for proposals :-)

> What do other PDFBox devs think?  Is there interest in modifying XmpBox to be more lenient?  Not for 2.0.0, obviously... :)
We should try to improve XmpBox as long as it is reasonable. XmpBox should not 
be limited to PDF/A. But we need proper documentation for the missing namespaces.

BR
Andreas

>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Tuesday, March 08, 2016 12:56 PM
> To: dev@pdfbox.apache.org
> Subject: Re: roadmap for XMPBox?
>
> Am 08.03.2016 um 18:44 schrieb Allison, Timothy B.:
>> Got it.  Thank you.  I wanted to confirm that nothing had changed since last summer (PDFBOX-2855).
>>
>> Are you taking bug reports for jempbox or is that entirely eol'd?
>
> Yes, I recently fixed a bug there.
>
>> Any recommendations for a somewhat lenient, Apache license-compatible XMP parser?
>
> Sorry, don't know.
>
>> Might it make sense to include in the README or in the package
>> javadocs something about the goals for XmpBox?  It is entirely
>> possible that I missed the warning. ;)
>
> The comment I made is just my personal opinion. It's your comment that made me realize that with XMPBox, we can't parse some files that are not PDF/A compatible but are correct XMP files. I don't have an idea what to do. Maybe improve XMPBox as you suggested (I did have a look but it doesn't seem easy). Maybe resurrect Jempbox, or use the 1.8 version.
>
> Tilman
>
>
>>
>> Thank you, again.
>>
>>           Best,
>>
>>                     Tim
>>
>> -----Original Message-----
>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>> Sent: Tuesday, March 08, 2016 12:13 PM
>> To: dev@pdfbox.apache.org
>> Subject: Re: roadmap for XMPBox?
>>
>> I think the problem is that XmpBox was written for PDF/A checking, so it fails with XMPs that are not PDF/A. For example, file 000142.pdf has the schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
>> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_p
>> roperties_in_pdfa-1_2008-03-20.pdf
>>
>> And no, there are no plans for anything on XMP at this time...
>>
>> Tilman
>>
>>
>> Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
>>> All,
>>>
>>>
>>>
>>>      When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs.
>>>
>>>
>>>
>>>      I’m including a table below of the counts of exception messages.  Are there any plans to make XMPBox more lenient or is this what we can expect going forward?
>>>
>>>
>>>
>>>      As always, I’m more than happy to help with files and tests.  Let me know what I can do.
>>>
>>>
>>>
>>>                 Cheers,
>>>
>>>
>>>
>>>                          Tim
>>>
>>>
>>>
>>> No XmpParsingException on 42,022 files.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Exceptions:
>>>
>>>
>>> Cannot find a definition for the namespace
>>> http://ns.adobe.com/pdfx/1.3/
>>>
>>> 13403
>>>
>>> Type 'originalDocumentID' not defined in
>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>>
>>> 3710
>>>
>>> Missing pdfaSchema:property in type definition
>>>
>>> 3113
>>>
>>> Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>>>
>>> 2867
>>>
>>> Invalid array type, expecting Seq and found Bag [prefix=dc;
>>> name=creator]
>>>
>>> 927
>>>
>>> Invalid array type, expecting Alt and found Seq [prefix=dc;
>>> name=description]
>>>
>>> 723
>>>
>>> Cannot find a definition for the namespace
>>> http://ns.adobe.com/xmp/InDesign/private
>>>
>>> 710
>>>
>>> Invalid array type, expecting Bag and found Seq [prefix=dc;
>>> name=subject]
>>>
>>> 654
>>>
>>> Cannot find a definition for the namespace
>>> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>>>
>>> 522
>>>
>>> Failed to parse
>>>
>>> 492
>>>
>>> Invalid array definition, expecting Seq and found
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc;
>>> name=date]
>>>
>>> 370
>>>
>>> Cannot find a definition for the namespace
>>> http://ns.adobe.com/illustrator/1.0/
>>>
>>> 262
>>>
>>> Cannot find a definition for the namespace
>>> http://ns.adobe.com/xfa/promoted-desc/
>>>
>>> 188
>>>
>>> Failed to instanciate property in xmp:CreateDate
>>>
>>> 144
>>>
>>> Schema is not set in this document :
>>> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>>>
>>> 125
>>>
>>> Expecting local name 'xmpmeta' and found 'xapmeta'
>>>
>>> 94
>>>
>>> Cannot find a definition for the namespace
>>> http://www.rwjf.org/rwjf/1.0
>>>
>>> 84
>>>
>>> Failed to instanciate property in xap:CreateDate
>>>
>>> 74
>>>
>>> Invalid array definition, expecting Bag and found
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc;
>>> name=language]
>>>
>>> 68
>>>
>>> Invalid array definition, expecting Alt and found
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc;
>>> name=title]
>>>
>>> 49
>>>
>>> Cannot find a definition for the namespace http://www.sap.com
>>>
>>> 46
>>>
>>> Failed to instanciate property in exif:ColorSpace
>>>
>>> 33
>>>
>>> Failed to instanciate property in xmpMM:History
>>>
>>> 28
>>>
>>> xmp should start with a processing instruction
>>>
>>> 26
>>>
>>> Cannot find a definition for the namespace
>>> http://prismstandard.org/namespaces/basic/2.0/
>>>
>>> 24
>>>
>>> Cannot find a definition for the namespace
>>> http://www.npes.org/pdfx/ns/id/
>>>
>>> 21
>>>
>>> Cannot find a definition for the namespace
>>> http://ns.InsiderSoftware.com/fontlist/1.0/
>>>
>>> 14
>>>
>>> Invalid array definition, expecting Seq and found
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc;
>>> name=creator]
>>>
>>> 14
>>>
>>> Failed to instanciate property in xmp:MetadataDate
>>>
>>> 12
>>>
>>> Cannot find a definition for the namespace
>>> http://ns.xinet.com/webnative/private/1.0/
>>>
>>> 10
>>>
>>> Failed to instanciate property in xap:ModifyDate
>>>
>>> 10
>>>
>>> Failed to instanciate property in xmp:ModifyDate
>>>
>>> 10
>>>
>>> Type 'params' not defined in
>>> http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>>>
>>> 9
>>>
>>> Invalid array type, expecting Seq and found Bag [prefix=xmpMM;
>>> name=History]
>>>
>>> 8
>>>
>>> Type 'documentName' not defined in
>>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>>
>>> 8
>>>
>>> Cannot find a definition for the namespace http://www.day.com/dam/1.0
>>>
>>> 7
>>>
>>> Cannot find a definition for the namespace ptc
>>>
>>> 7
>>>
>>> Failed to instanciate property in xapMM:History
>>>
>>> 6
>>>
>>> Invalid array definition, expecting Seq and found
>>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff;
>>> name=YCbCrPositioning]
>>>
>>> 5
>>>
>>> Schema is not set in this document : http://purl.org/dc/elements/1.1/
>>>
>>> 5
>>>
>>> Cannot find a definition for the namespace
>>> http://www.extensis.com/meta/FontSense/
>>>
>>> 4
>>>
>>> Excepted xpacket 'end' attribute (must be present and placed in
>>> first)
>>>
>>> 4
>>>
>>> Invalid array type, expecting Seq and found Bag [prefix=photoshop;
>>> name=TextLayers]
>>>
>>> 3
>>>
>>> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>>>
>>> 3
>>>
>>> no message (NPE)
>>>
>>> 2
>>>
>>> Cannot find a definition for the namespace
>>> http://laserfiche.com/xmp/schema/1.0/
>>>
>>> 2
>>>
>>> Cannot find a definition for the namespace
>>> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>>>
>>> 2
>>>
>>> Cannot find a definition for the namespace
>>> http://ns.adobe.com/camera-raw-settings/1.0/
>>>
>>> 2
>>>
>>> Failed to instanciate property in xapRights:Marked
>>>
>>> 2
>>>
>>> Invalid array type, expecting Alt and found Bag [prefix=dc;
>>> name=title]
>>>
>>> 2
>>>
>>> Invalid array type, expecting Alt and found Seq [prefix=dc;
>>> name=title]
>>>
>>> 2
>>>
>>> Invalid array type, expecting Seq and found Alt [prefix=dc;
>>> name=creator]
>>>
>>> 2
>>>
>>> Cannot find a definition for the namespace
>>> http://ns.cambridgeassociates.com/status/1.0/
>>>
>>> 1
>>>
>>> Cannot find a definition for the namespace
>>> http://ns.computershare.com.au/ccs/1.0/
>>>
>>> 1
>>>
>>> Cannot find a definition for the namespace
>>> http://ns.esko-graphics.com/grinfo/1.0/
>>>
>>> 1
>>>
>>> Cannot find a definition for the namespace
>>> http://ns.tripletriangle.com/ns/tripletri/
>>>
>>> 1
>>>
>>> Cannot find a definition for the namespace
>>> http://prismstandard.org/namespaces/basic/2.1/
>>>
>>> 1
>>>
>>> Cannot find a definition for the namespace
>>> http://www.aiim.org/pdfa/ns/id.html
>>>
>>> 1
>>>
>>> Cannot find a definition for the namespace
>>> http://www.aiim.org/pdfe/ns/id/
>>>
>>> 1
>>>
>>> Cannot find a definition for the namespace
>>> http://www.enfocus.com/ns/CertifiedPDF/2.0/
>>>
>>> 1
>>>
>>> Cannot find a definition for the namespace
>>> http://www.northplains.com/xmpnps/cov/1.0/
>>>
>>> 1
>>>
>>> Failed to instanciate property in xmpRights:Marked
>>>
>>> 1
>>>
>>> Invalid array type, expecting Seq and found Bag [prefix=dc;
>>> name=date]
>>>
>>> 1
>>>
>>> This namespace is not a schema or a structured type :
>>> http://ns.adobe.com/xap/1.0/sType/Job#
>>>
>>> 1
>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For
>> additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For
>> additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: roadmap for XMPBox?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
> The comment I made is just my personal opinion. ... Maybe improve XMPBox as you suggested (I did have a look but it doesn't seem easy).

Oh, ok, so it isn't necessarily set in stone.

What do other PDFBox devs think?  Is there interest in modifying XmpBox to be more lenient?  Not for 2.0.0, obviously... :)

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Tuesday, March 08, 2016 12:56 PM
To: dev@pdfbox.apache.org
Subject: Re: roadmap for XMPBox?

Am 08.03.2016 um 18:44 schrieb Allison, Timothy B.:
> Got it.  Thank you.  I wanted to confirm that nothing had changed since last summer (PDFBOX-2855).
>
> Are you taking bug reports for jempbox or is that entirely eol'd?

Yes, I recently fixed a bug there.

> Any recommendations for a somewhat lenient, Apache license-compatible XMP parser?

Sorry, don't know.

> Might it make sense to include in the README or in the package 
> javadocs something about the goals for XmpBox?  It is entirely 
> possible that I missed the warning. ;)

The comment I made is just my personal opinion. It's your comment that made me realize that with XMPBox, we can't parse some files that are not PDF/A compatible but are correct XMP files. I don't have an idea what to do. Maybe improve XMPBox as you suggested (I did have a look but it doesn't seem easy). Maybe resurrect Jempbox, or use the 1.8 version.

Tilman


>
> Thank you, again.
>
>          Best,
>
>                    Tim
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Tuesday, March 08, 2016 12:13 PM
> To: dev@pdfbox.apache.org
> Subject: Re: roadmap for XMPBox?
>
> I think the problem is that XmpBox was written for PDF/A checking, so it fails with XMPs that are not PDF/A. For example, file 000142.pdf has the schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_p
> roperties_in_pdfa-1_2008-03-20.pdf
>
> And no, there are no plans for anything on XMP at this time...
>
> Tilman
>
>
> Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
>> All,
>>
>>
>>
>>     When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs.
>>
>>
>>
>>     I’m including a table below of the counts of exception messages.  Are there any plans to make XMPBox more lenient or is this what we can expect going forward?
>>
>>
>>
>>     As always, I’m more than happy to help with files and tests.  Let me know what I can do.
>>
>>
>>
>>                Cheers,
>>
>>
>>
>>                         Tim
>>
>>
>>
>> No XmpParsingException on 42,022 files.
>>
>>
>>
>>
>>
>>
>>
>> Exceptions:
>>
>>
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/pdfx/1.3/
>>
>> 13403
>>
>> Type 'originalDocumentID' not defined in 
>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>
>> 3710
>>
>> Missing pdfaSchema:property in type definition
>>
>> 3113
>>
>> Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>>
>> 2867
>>
>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>> name=creator]
>>
>> 927
>>
>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>> name=description]
>>
>> 723
>>
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/xmp/InDesign/private
>>
>> 710
>>
>> Invalid array type, expecting Bag and found Seq [prefix=dc; 
>> name=subject]
>>
>> 654
>>
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>>
>> 522
>>
>> Failed to parse
>>
>> 492
>>
>> Invalid array definition, expecting Seq and found 
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>> name=date]
>>
>> 370
>>
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/illustrator/1.0/
>>
>> 262
>>
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/xfa/promoted-desc/
>>
>> 188
>>
>> Failed to instanciate property in xmp:CreateDate
>>
>> 144
>>
>> Schema is not set in this document :
>> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>>
>> 125
>>
>> Expecting local name 'xmpmeta' and found 'xapmeta'
>>
>> 94
>>
>> Cannot find a definition for the namespace
>> http://www.rwjf.org/rwjf/1.0
>>
>> 84
>>
>> Failed to instanciate property in xap:CreateDate
>>
>> 74
>>
>> Invalid array definition, expecting Bag and found 
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>> name=language]
>>
>> 68
>>
>> Invalid array definition, expecting Alt and found 
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>> name=title]
>>
>> 49
>>
>> Cannot find a definition for the namespace http://www.sap.com
>>
>> 46
>>
>> Failed to instanciate property in exif:ColorSpace
>>
>> 33
>>
>> Failed to instanciate property in xmpMM:History
>>
>> 28
>>
>> xmp should start with a processing instruction
>>
>> 26
>>
>> Cannot find a definition for the namespace 
>> http://prismstandard.org/namespaces/basic/2.0/
>>
>> 24
>>
>> Cannot find a definition for the namespace 
>> http://www.npes.org/pdfx/ns/id/
>>
>> 21
>>
>> Cannot find a definition for the namespace 
>> http://ns.InsiderSoftware.com/fontlist/1.0/
>>
>> 14
>>
>> Invalid array definition, expecting Seq and found 
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
>> name=creator]
>>
>> 14
>>
>> Failed to instanciate property in xmp:MetadataDate
>>
>> 12
>>
>> Cannot find a definition for the namespace 
>> http://ns.xinet.com/webnative/private/1.0/
>>
>> 10
>>
>> Failed to instanciate property in xap:ModifyDate
>>
>> 10
>>
>> Failed to instanciate property in xmp:ModifyDate
>>
>> 10
>>
>> Type 'params' not defined in
>> http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>>
>> 9
>>
>> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; 
>> name=History]
>>
>> 8
>>
>> Type 'documentName' not defined in
>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>
>> 8
>>
>> Cannot find a definition for the namespace http://www.day.com/dam/1.0
>>
>> 7
>>
>> Cannot find a definition for the namespace ptc
>>
>> 7
>>
>> Failed to instanciate property in xapMM:History
>>
>> 6
>>
>> Invalid array definition, expecting Seq and found 
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff; 
>> name=YCbCrPositioning]
>>
>> 5
>>
>> Schema is not set in this document : http://purl.org/dc/elements/1.1/
>>
>> 5
>>
>> Cannot find a definition for the namespace 
>> http://www.extensis.com/meta/FontSense/
>>
>> 4
>>
>> Excepted xpacket 'end' attribute (must be present and placed in 
>> first)
>>
>> 4
>>
>> Invalid array type, expecting Seq and found Bag [prefix=photoshop; 
>> name=TextLayers]
>>
>> 3
>>
>> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>>
>> 3
>>
>> no message (NPE)
>>
>> 2
>>
>> Cannot find a definition for the namespace 
>> http://laserfiche.com/xmp/schema/1.0/
>>
>> 2
>>
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>>
>> 2
>>
>> Cannot find a definition for the namespace 
>> http://ns.adobe.com/camera-raw-settings/1.0/
>>
>> 2
>>
>> Failed to instanciate property in xapRights:Marked
>>
>> 2
>>
>> Invalid array type, expecting Alt and found Bag [prefix=dc; 
>> name=title]
>>
>> 2
>>
>> Invalid array type, expecting Alt and found Seq [prefix=dc; 
>> name=title]
>>
>> 2
>>
>> Invalid array type, expecting Seq and found Alt [prefix=dc; 
>> name=creator]
>>
>> 2
>>
>> Cannot find a definition for the namespace 
>> http://ns.cambridgeassociates.com/status/1.0/
>>
>> 1
>>
>> Cannot find a definition for the namespace 
>> http://ns.computershare.com.au/ccs/1.0/
>>
>> 1
>>
>> Cannot find a definition for the namespace 
>> http://ns.esko-graphics.com/grinfo/1.0/
>>
>> 1
>>
>> Cannot find a definition for the namespace 
>> http://ns.tripletriangle.com/ns/tripletri/
>>
>> 1
>>
>> Cannot find a definition for the namespace 
>> http://prismstandard.org/namespaces/basic/2.1/
>>
>> 1
>>
>> Cannot find a definition for the namespace 
>> http://www.aiim.org/pdfa/ns/id.html
>>
>> 1
>>
>> Cannot find a definition for the namespace 
>> http://www.aiim.org/pdfe/ns/id/
>>
>> 1
>>
>> Cannot find a definition for the namespace 
>> http://www.enfocus.com/ns/CertifiedPDF/2.0/
>>
>> 1
>>
>> Cannot find a definition for the namespace 
>> http://www.northplains.com/xmpnps/cov/1.0/
>>
>> 1
>>
>> Failed to instanciate property in xmpRights:Marked
>>
>> 1
>>
>> Invalid array type, expecting Seq and found Bag [prefix=dc; 
>> name=date]
>>
>> 1
>>
>> This namespace is not a schema or a structured type :
>> http://ns.adobe.com/xap/1.0/sType/Job#
>>
>> 1
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For 
> additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: roadmap for XMPBox?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 08.03.2016 um 18:44 schrieb Allison, Timothy B.:
> Got it.  Thank you.  I wanted to confirm that nothing had changed since last summer (PDFBOX-2855).
>
> Are you taking bug reports for jempbox or is that entirely eol'd?

Yes, I recently fixed a bug there.

> Any recommendations for a somewhat lenient, Apache license-compatible XMP parser?

Sorry, don't know.

> Might it make sense to include in the README or in the package javadocs something about the goals for XmpBox?  It is entirely possible that I missed the warning. ;)

The comment I made is just my personal opinion. It's your comment that 
made me realize that with XMPBox, we can't parse some files that are not 
PDF/A compatible but are correct XMP files. I don't have an idea what to 
do. Maybe improve XMPBox as you suggested (I did have a look but it 
doesn't seem easy). Maybe resurrect Jempbox, or use the 1.8 version.

Tilman


>
> Thank you, again.
>
>          Best,
>
>                    Tim
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Tuesday, March 08, 2016 12:13 PM
> To: dev@pdfbox.apache.org
> Subject: Re: roadmap for XMPBox?
>
> I think the problem is that XmpBox was written for PDF/A checking, so it fails with XMPs that are not PDF/A. For example, file 000142.pdf has the schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
> http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_properties_in_pdfa-1_2008-03-20.pdf
>
> And no, there are no plans for anything on XMP at this time...
>
> Tilman
>
>
> Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
>> All,
>>
>>
>>
>>     When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs.
>>
>>
>>
>>     I’m including a table below of the counts of exception messages.  Are there any plans to make XMPBox more lenient or is this what we can expect going forward?
>>
>>
>>
>>     As always, I’m more than happy to help with files and tests.  Let me know what I can do.
>>
>>
>>
>>                Cheers,
>>
>>
>>
>>                         Tim
>>
>>
>>
>> No XmpParsingException on 42,022 files.
>>
>>
>>
>>
>>
>>
>>
>> Exceptions:
>>
>>
>> Cannot find a definition for the namespace
>> http://ns.adobe.com/pdfx/1.3/
>>
>> 13403
>>
>> Type 'originalDocumentID' not defined in
>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>
>> 3710
>>
>> Missing pdfaSchema:property in type definition
>>
>> 3113
>>
>> Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>>
>> 2867
>>
>> Invalid array type, expecting Seq and found Bag [prefix=dc;
>> name=creator]
>>
>> 927
>>
>> Invalid array type, expecting Alt and found Seq [prefix=dc;
>> name=description]
>>
>> 723
>>
>> Cannot find a definition for the namespace
>> http://ns.adobe.com/xmp/InDesign/private
>>
>> 710
>>
>> Invalid array type, expecting Bag and found Seq [prefix=dc;
>> name=subject]
>>
>> 654
>>
>> Cannot find a definition for the namespace
>> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>>
>> 522
>>
>> Failed to parse
>>
>> 492
>>
>> Invalid array definition, expecting Seq and found
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc;
>> name=date]
>>
>> 370
>>
>> Cannot find a definition for the namespace
>> http://ns.adobe.com/illustrator/1.0/
>>
>> 262
>>
>> Cannot find a definition for the namespace
>> http://ns.adobe.com/xfa/promoted-desc/
>>
>> 188
>>
>> Failed to instanciate property in xmp:CreateDate
>>
>> 144
>>
>> Schema is not set in this document :
>> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>>
>> 125
>>
>> Expecting local name 'xmpmeta' and found 'xapmeta'
>>
>> 94
>>
>> Cannot find a definition for the namespace
>> http://www.rwjf.org/rwjf/1.0
>>
>> 84
>>
>> Failed to instanciate property in xap:CreateDate
>>
>> 74
>>
>> Invalid array definition, expecting Bag and found
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc;
>> name=language]
>>
>> 68
>>
>> Invalid array definition, expecting Alt and found
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc;
>> name=title]
>>
>> 49
>>
>> Cannot find a definition for the namespace http://www.sap.com
>>
>> 46
>>
>> Failed to instanciate property in exif:ColorSpace
>>
>> 33
>>
>> Failed to instanciate property in xmpMM:History
>>
>> 28
>>
>> xmp should start with a processing instruction
>>
>> 26
>>
>> Cannot find a definition for the namespace
>> http://prismstandard.org/namespaces/basic/2.0/
>>
>> 24
>>
>> Cannot find a definition for the namespace
>> http://www.npes.org/pdfx/ns/id/
>>
>> 21
>>
>> Cannot find a definition for the namespace
>> http://ns.InsiderSoftware.com/fontlist/1.0/
>>
>> 14
>>
>> Invalid array definition, expecting Seq and found
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc;
>> name=creator]
>>
>> 14
>>
>> Failed to instanciate property in xmp:MetadataDate
>>
>> 12
>>
>> Cannot find a definition for the namespace
>> http://ns.xinet.com/webnative/private/1.0/
>>
>> 10
>>
>> Failed to instanciate property in xap:ModifyDate
>>
>> 10
>>
>> Failed to instanciate property in xmp:ModifyDate
>>
>> 10
>>
>> Type 'params' not defined in
>> http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>>
>> 9
>>
>> Invalid array type, expecting Seq and found Bag [prefix=xmpMM;
>> name=History]
>>
>> 8
>>
>> Type 'documentName' not defined in
>> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>>
>> 8
>>
>> Cannot find a definition for the namespace http://www.day.com/dam/1.0
>>
>> 7
>>
>> Cannot find a definition for the namespace ptc
>>
>> 7
>>
>> Failed to instanciate property in xapMM:History
>>
>> 6
>>
>> Invalid array definition, expecting Seq and found
>> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff;
>> name=YCbCrPositioning]
>>
>> 5
>>
>> Schema is not set in this document : http://purl.org/dc/elements/1.1/
>>
>> 5
>>
>> Cannot find a definition for the namespace
>> http://www.extensis.com/meta/FontSense/
>>
>> 4
>>
>> Excepted xpacket 'end' attribute (must be present and placed in first)
>>
>> 4
>>
>> Invalid array type, expecting Seq and found Bag [prefix=photoshop;
>> name=TextLayers]
>>
>> 3
>>
>> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>>
>> 3
>>
>> no message (NPE)
>>
>> 2
>>
>> Cannot find a definition for the namespace
>> http://laserfiche.com/xmp/schema/1.0/
>>
>> 2
>>
>> Cannot find a definition for the namespace
>> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>>
>> 2
>>
>> Cannot find a definition for the namespace
>> http://ns.adobe.com/camera-raw-settings/1.0/
>>
>> 2
>>
>> Failed to instanciate property in xapRights:Marked
>>
>> 2
>>
>> Invalid array type, expecting Alt and found Bag [prefix=dc;
>> name=title]
>>
>> 2
>>
>> Invalid array type, expecting Alt and found Seq [prefix=dc;
>> name=title]
>>
>> 2
>>
>> Invalid array type, expecting Seq and found Alt [prefix=dc;
>> name=creator]
>>
>> 2
>>
>> Cannot find a definition for the namespace
>> http://ns.cambridgeassociates.com/status/1.0/
>>
>> 1
>>
>> Cannot find a definition for the namespace
>> http://ns.computershare.com.au/ccs/1.0/
>>
>> 1
>>
>> Cannot find a definition for the namespace
>> http://ns.esko-graphics.com/grinfo/1.0/
>>
>> 1
>>
>> Cannot find a definition for the namespace
>> http://ns.tripletriangle.com/ns/tripletri/
>>
>> 1
>>
>> Cannot find a definition for the namespace
>> http://prismstandard.org/namespaces/basic/2.1/
>>
>> 1
>>
>> Cannot find a definition for the namespace
>> http://www.aiim.org/pdfa/ns/id.html
>>
>> 1
>>
>> Cannot find a definition for the namespace
>> http://www.aiim.org/pdfe/ns/id/
>>
>> 1
>>
>> Cannot find a definition for the namespace
>> http://www.enfocus.com/ns/CertifiedPDF/2.0/
>>
>> 1
>>
>> Cannot find a definition for the namespace
>> http://www.northplains.com/xmpnps/cov/1.0/
>>
>> 1
>>
>> Failed to instanciate property in xmpRights:Marked
>>
>> 1
>>
>> Invalid array type, expecting Seq and found Bag [prefix=dc; name=date]
>>
>> 1
>>
>> This namespace is not a schema or a structured type :
>> http://ns.adobe.com/xap/1.0/sType/Job#
>>
>> 1
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: roadmap for XMPBox?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Got it.  Thank you.  I wanted to confirm that nothing had changed since last summer (PDFBOX-2855).  

Are you taking bug reports for jempbox or is that entirely eol'd?  

Any recommendations for a somewhat lenient, Apache license-compatible XMP parser?

Might it make sense to include in the README or in the package javadocs something about the goals for XmpBox?  It is entirely possible that I missed the warning. ;)

Thank you, again.

        Best,

                  Tim

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Tuesday, March 08, 2016 12:13 PM
To: dev@pdfbox.apache.org
Subject: Re: roadmap for XMPBox?

I think the problem is that XmpBox was written for PDF/A checking, so it fails with XMPs that are not PDF/A. For example, file 000142.pdf has the schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_properties_in_pdfa-1_2008-03-20.pdf

And no, there are no plans for anything on XMP at this time...

Tilman


Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
> All,
>
>
>
>    When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs.
>
>
>
>    I’m including a table below of the counts of exception messages.  Are there any plans to make XMPBox more lenient or is this what we can expect going forward?
>
>
>
>    As always, I’m more than happy to help with files and tests.  Let me know what I can do.
>
>
>
>               Cheers,
>
>
>
>                        Tim
>
>
>
> No XmpParsingException on 42,022 files.
>
>
>
>
>
>
>
> Exceptions:
>
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/pdfx/1.3/
>
> 13403
>
> Type 'originalDocumentID' not defined in 
> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>
> 3710
>
> Missing pdfaSchema:property in type definition
>
> 3113
>
> Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>
> 2867
>
> Invalid array type, expecting Seq and found Bag [prefix=dc; 
> name=creator]
>
> 927
>
> Invalid array type, expecting Alt and found Seq [prefix=dc; 
> name=description]
>
> 723
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/xmp/InDesign/private
>
> 710
>
> Invalid array type, expecting Bag and found Seq [prefix=dc; 
> name=subject]
>
> 654
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>
> 522
>
> Failed to parse
>
> 492
>
> Invalid array definition, expecting Seq and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=date]
>
> 370
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/illustrator/1.0/
>
> 262
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/xfa/promoted-desc/
>
> 188
>
> Failed to instanciate property in xmp:CreateDate
>
> 144
>
> Schema is not set in this document : 
> http://www.w3.org/1999/02/22-rdf-syntax-ns#
>
> 125
>
> Expecting local name 'xmpmeta' and found 'xapmeta'
>
> 94
>
> Cannot find a definition for the namespace 
> http://www.rwjf.org/rwjf/1.0
>
> 84
>
> Failed to instanciate property in xap:CreateDate
>
> 74
>
> Invalid array definition, expecting Bag and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=language]
>
> 68
>
> Invalid array definition, expecting Alt and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=title]
>
> 49
>
> Cannot find a definition for the namespace http://www.sap.com
>
> 46
>
> Failed to instanciate property in exif:ColorSpace
>
> 33
>
> Failed to instanciate property in xmpMM:History
>
> 28
>
> xmp should start with a processing instruction
>
> 26
>
> Cannot find a definition for the namespace 
> http://prismstandard.org/namespaces/basic/2.0/
>
> 24
>
> Cannot find a definition for the namespace 
> http://www.npes.org/pdfx/ns/id/
>
> 21
>
> Cannot find a definition for the namespace 
> http://ns.InsiderSoftware.com/fontlist/1.0/
>
> 14
>
> Invalid array definition, expecting Seq and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; 
> name=creator]
>
> 14
>
> Failed to instanciate property in xmp:MetadataDate
>
> 12
>
> Cannot find a definition for the namespace 
> http://ns.xinet.com/webnative/private/1.0/
>
> 10
>
> Failed to instanciate property in xap:ModifyDate
>
> 10
>
> Failed to instanciate property in xmp:ModifyDate
>
> 10
>
> Type 'params' not defined in 
> http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>
> 9
>
> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; 
> name=History]
>
> 8
>
> Type 'documentName' not defined in 
> http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>
> 8
>
> Cannot find a definition for the namespace http://www.day.com/dam/1.0
>
> 7
>
> Cannot find a definition for the namespace ptc
>
> 7
>
> Failed to instanciate property in xapMM:History
>
> 6
>
> Invalid array definition, expecting Seq and found 
> com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff; 
> name=YCbCrPositioning]
>
> 5
>
> Schema is not set in this document : http://purl.org/dc/elements/1.1/
>
> 5
>
> Cannot find a definition for the namespace 
> http://www.extensis.com/meta/FontSense/
>
> 4
>
> Excepted xpacket 'end' attribute (must be present and placed in first)
>
> 4
>
> Invalid array type, expecting Seq and found Bag [prefix=photoshop; 
> name=TextLayers]
>
> 3
>
> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>
> 3
>
> no message (NPE)
>
> 2
>
> Cannot find a definition for the namespace 
> http://laserfiche.com/xmp/schema/1.0/
>
> 2
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>
> 2
>
> Cannot find a definition for the namespace 
> http://ns.adobe.com/camera-raw-settings/1.0/
>
> 2
>
> Failed to instanciate property in xapRights:Marked
>
> 2
>
> Invalid array type, expecting Alt and found Bag [prefix=dc; 
> name=title]
>
> 2
>
> Invalid array type, expecting Alt and found Seq [prefix=dc; 
> name=title]
>
> 2
>
> Invalid array type, expecting Seq and found Alt [prefix=dc; 
> name=creator]
>
> 2
>
> Cannot find a definition for the namespace 
> http://ns.cambridgeassociates.com/status/1.0/
>
> 1
>
> Cannot find a definition for the namespace 
> http://ns.computershare.com.au/ccs/1.0/
>
> 1
>
> Cannot find a definition for the namespace 
> http://ns.esko-graphics.com/grinfo/1.0/
>
> 1
>
> Cannot find a definition for the namespace 
> http://ns.tripletriangle.com/ns/tripletri/
>
> 1
>
> Cannot find a definition for the namespace 
> http://prismstandard.org/namespaces/basic/2.1/
>
> 1
>
> Cannot find a definition for the namespace 
> http://www.aiim.org/pdfa/ns/id.html
>
> 1
>
> Cannot find a definition for the namespace 
> http://www.aiim.org/pdfe/ns/id/
>
> 1
>
> Cannot find a definition for the namespace 
> http://www.enfocus.com/ns/CertifiedPDF/2.0/
>
> 1
>
> Cannot find a definition for the namespace 
> http://www.northplains.com/xmpnps/cov/1.0/
>
> 1
>
> Failed to instanciate property in xmpRights:Marked
>
> 1
>
> Invalid array type, expecting Seq and found Bag [prefix=dc; name=date]
>
> 1
>
> This namespace is not a schema or a structured type : 
> http://ns.adobe.com/xap/1.0/sType/Job#
>
> 1
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: roadmap for XMPBox?

Posted by Tilman Hausherr <TH...@t-online.de>.
I think the problem is that XmpBox was written for PDF/A checking, so it 
fails with XMPs that are not PDF/A. For example, file 000142.pdf has the 
schema http://ns.adobe.com/pdfx/1.3/ which is not allowed for PDF/A:
http://www.pdfa.org/wp-content/uploads/2011/08/tn0008_predefined_xmp_properties_in_pdfa-1_2008-03-20.pdf

And no, there are no plans for anything on XMP at this time...

Tilman


Am 07.03.2016 um 19:31 schrieb Allison, Timothy B.:
> All,
>
>
>
>    When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs.
>
>
>
>    I’m including a table below of the counts of exception messages.  Are there any plans to make XMPBox more lenient or is this what we can expect going forward?
>
>
>
>    As always, I’m more than happy to help with files and tests.  Let me know what I can do.
>
>
>
>               Cheers,
>
>
>
>                        Tim
>
>
>
> No XmpParsingException on 42,022 files.
>
>
>
>
>
>
>
> Exceptions:
>
>
> Cannot find a definition for the namespace http://ns.adobe.com/pdfx/1.3/
>
> 13403
>
> Type 'originalDocumentID' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>
> 3710
>
> Missing pdfaSchema:property in type definition
>
> 3113
>
> Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
>
> 2867
>
> Invalid array type, expecting Seq and found Bag [prefix=dc; name=creator]
>
> 927
>
> Invalid array type, expecting Alt and found Seq [prefix=dc; name=description]
>
> 723
>
> Cannot find a definition for the namespace http://ns.adobe.com/xmp/InDesign/private
>
> 710
>
> Invalid array type, expecting Bag and found Seq [prefix=dc; name=subject]
>
> 654
>
> Cannot find a definition for the namespace http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/
>
> 522
>
> Failed to parse
>
> 492
>
> Invalid array definition, expecting Seq and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=date]
>
> 370
>
> Cannot find a definition for the namespace http://ns.adobe.com/illustrator/1.0/
>
> 262
>
> Cannot find a definition for the namespace http://ns.adobe.com/xfa/promoted-desc/
>
> 188
>
> Failed to instanciate property in xmp:CreateDate
>
> 144
>
> Schema is not set in this document : http://www.w3.org/1999/02/22-rdf-syntax-ns#
>
> 125
>
> Expecting local name 'xmpmeta' and found 'xapmeta'
>
> 94
>
> Cannot find a definition for the namespace http://www.rwjf.org/rwjf/1.0
>
> 84
>
> Failed to instanciate property in xap:CreateDate
>
> 74
>
> Invalid array definition, expecting Bag and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=language]
>
> 68
>
> Invalid array definition, expecting Alt and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=title]
>
> 49
>
> Cannot find a definition for the namespace http://www.sap.com
>
> 46
>
> Failed to instanciate property in exif:ColorSpace
>
> 33
>
> Failed to instanciate property in xmpMM:History
>
> 28
>
> xmp should start with a processing instruction
>
> 26
>
> Cannot find a definition for the namespace http://prismstandard.org/namespaces/basic/2.0/
>
> 24
>
> Cannot find a definition for the namespace http://www.npes.org/pdfx/ns/id/
>
> 21
>
> Cannot find a definition for the namespace http://ns.InsiderSoftware.com/fontlist/1.0/
>
> 14
>
> Invalid array definition, expecting Seq and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=creator]
>
> 14
>
> Failed to instanciate property in xmp:MetadataDate
>
> 12
>
> Cannot find a definition for the namespace http://ns.xinet.com/webnative/private/1.0/
>
> 10
>
> Failed to instanciate property in xap:ModifyDate
>
> 10
>
> Failed to instanciate property in xmp:ModifyDate
>
> 10
>
> Type 'params' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceEvent#
>
> 9
>
> Invalid array type, expecting Seq and found Bag [prefix=xmpMM; name=History]
>
> 8
>
> Type 'documentName' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceRef#
>
> 8
>
> Cannot find a definition for the namespace http://www.day.com/dam/1.0
>
> 7
>
> Cannot find a definition for the namespace ptc
>
> 7
>
> Failed to instanciate property in xapMM:History
>
> 6
>
> Invalid array definition, expecting Seq and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff; name=YCbCrPositioning]
>
> 5
>
> Schema is not set in this document : http://purl.org/dc/elements/1.1/
>
> 5
>
> Cannot find a definition for the namespace http://www.extensis.com/meta/FontSense/
>
> 4
>
> Excepted xpacket 'end' attribute (must be present and placed in first)
>
> 4
>
> Invalid array type, expecting Seq and found Bag [prefix=photoshop; name=TextLayers]
>
> 3
>
> Schema is not set in this document : http://ns.adobe.com/xap/1.0/
>
> 3
>
> no message (NPE)
>
> 2
>
> Cannot find a definition for the namespace http://laserfiche.com/xmp/schema/1.0/
>
> 2
>
> Cannot find a definition for the namespace http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/
>
> 2
>
> Cannot find a definition for the namespace http://ns.adobe.com/camera-raw-settings/1.0/
>
> 2
>
> Failed to instanciate property in xapRights:Marked
>
> 2
>
> Invalid array type, expecting Alt and found Bag [prefix=dc; name=title]
>
> 2
>
> Invalid array type, expecting Alt and found Seq [prefix=dc; name=title]
>
> 2
>
> Invalid array type, expecting Seq and found Alt [prefix=dc; name=creator]
>
> 2
>
> Cannot find a definition for the namespace http://ns.cambridgeassociates.com/status/1.0/
>
> 1
>
> Cannot find a definition for the namespace http://ns.computershare.com.au/ccs/1.0/
>
> 1
>
> Cannot find a definition for the namespace http://ns.esko-graphics.com/grinfo/1.0/
>
> 1
>
> Cannot find a definition for the namespace http://ns.tripletriangle.com/ns/tripletri/
>
> 1
>
> Cannot find a definition for the namespace http://prismstandard.org/namespaces/basic/2.1/
>
> 1
>
> Cannot find a definition for the namespace http://www.aiim.org/pdfa/ns/id.html
>
> 1
>
> Cannot find a definition for the namespace http://www.aiim.org/pdfe/ns/id/
>
> 1
>
> Cannot find a definition for the namespace http://www.enfocus.com/ns/CertifiedPDF/2.0/
>
> 1
>
> Cannot find a definition for the namespace http://www.northplains.com/xmpnps/cov/1.0/
>
> 1
>
> Failed to instanciate property in xmpRights:Marked
>
> 1
>
> Invalid array type, expecting Seq and found Bag [prefix=dc; name=date]
>
> 1
>
> This namespace is not a schema or a structured type : http://ns.adobe.com/xap/1.0/sType/Job#
>
> 1
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


RE: roadmap for XMPBox?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
XLSX summary and 89MB of XMPs available here: 

http://162.242.228.174/xmp_work/ 

-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org] 
Sent: Monday, March 07, 2016 1:31 PM
To: dev@pdfbox.apache.org
Subject: roadmap for XMPBox?

All,



  When we migrate to PDFBox 2.x  over on Tika, I'd much prefer to switch from our current reliance on jempbox to XMPBox.  I recently extracted ~70k XMPs from PDFs with PDFBox 2.0.0-SNAPSHOT, and when I ran XMPBox's parser, there were exceptions on roughly 40% of the XMPs.



  I’m including a table below of the counts of exception messages.  Are there any plans to make XMPBox more lenient or is this what we can expect going forward?



  As always, I’m more than happy to help with files and tests.  Let me know what I can do.



             Cheers,



                      Tim



No XmpParsingException on 42,022 files.







Exceptions:


Cannot find a definition for the namespace http://ns.adobe.com/pdfx/1.3/

13403

Type 'originalDocumentID' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceRef#

3710

Missing pdfaSchema:property in type definition

3113

Expecting namespace 'adobe:ns:meta/' and found 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'

2867

Invalid array type, expecting Seq and found Bag [prefix=dc; name=creator]

927

Invalid array type, expecting Alt and found Seq [prefix=dc; name=description]

723

Cannot find a definition for the namespace http://ns.adobe.com/xmp/InDesign/private

710

Invalid array type, expecting Bag and found Seq [prefix=dc; name=subject]

654

Cannot find a definition for the namespace http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/

522

Failed to parse

492

Invalid array definition, expecting Seq and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=date]

370

Cannot find a definition for the namespace http://ns.adobe.com/illustrator/1.0/

262

Cannot find a definition for the namespace http://ns.adobe.com/xfa/promoted-desc/

188

Failed to instanciate property in xmp:CreateDate

144

Schema is not set in this document : http://www.w3.org/1999/02/22-rdf-syntax-ns#

125

Expecting local name 'xmpmeta' and found 'xapmeta'

94

Cannot find a definition for the namespace http://www.rwjf.org/rwjf/1.0

84

Failed to instanciate property in xap:CreateDate

74

Invalid array definition, expecting Bag and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=language]

68

Invalid array definition, expecting Alt and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=title]

49

Cannot find a definition for the namespace http://www.sap.com

46

Failed to instanciate property in exif:ColorSpace

33

Failed to instanciate property in xmpMM:History

28

xmp should start with a processing instruction

26

Cannot find a definition for the namespace http://prismstandard.org/namespaces/basic/2.0/

24

Cannot find a definition for the namespace http://www.npes.org/pdfx/ns/id/

21

Cannot find a definition for the namespace http://ns.InsiderSoftware.com/fontlist/1.0/

14

Invalid array definition, expecting Seq and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=dc; name=creator]

14

Failed to instanciate property in xmp:MetadataDate

12

Cannot find a definition for the namespace http://ns.xinet.com/webnative/private/1.0/

10

Failed to instanciate property in xap:ModifyDate

10

Failed to instanciate property in xmp:ModifyDate

10

Type 'params' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceEvent#

9

Invalid array type, expecting Seq and found Bag [prefix=xmpMM; name=History]

8

Type 'documentName' not defined in http://ns.adobe.com/xap/1.0/sType/ResourceRef#

8

Cannot find a definition for the namespace http://www.day.com/dam/1.0

7

Cannot find a definition for the namespace ptc

7

Failed to instanciate property in xapMM:History

6

Invalid array definition, expecting Seq and found com.sun.org.apache.xerces.internal.dom.DeferredTextImpl [prefix=tiff; name=YCbCrPositioning]

5

Schema is not set in this document : http://purl.org/dc/elements/1.1/

5

Cannot find a definition for the namespace http://www.extensis.com/meta/FontSense/

4

Excepted xpacket 'end' attribute (must be present and placed in first)

4

Invalid array type, expecting Seq and found Bag [prefix=photoshop; name=TextLayers]

3

Schema is not set in this document : http://ns.adobe.com/xap/1.0/

3

no message (NPE)

2

Cannot find a definition for the namespace http://laserfiche.com/xmp/schema/1.0/

2

Cannot find a definition for the namespace http://ns.adobe.com/AdobeFormsCentralWorkflow/1.0/

2

Cannot find a definition for the namespace http://ns.adobe.com/camera-raw-settings/1.0/

2

Failed to instanciate property in xapRights:Marked

2

Invalid array type, expecting Alt and found Bag [prefix=dc; name=title]

2

Invalid array type, expecting Alt and found Seq [prefix=dc; name=title]

2

Invalid array type, expecting Seq and found Alt [prefix=dc; name=creator]

2

Cannot find a definition for the namespace http://ns.cambridgeassociates.com/status/1.0/

1

Cannot find a definition for the namespace http://ns.computershare.com.au/ccs/1.0/

1

Cannot find a definition for the namespace http://ns.esko-graphics.com/grinfo/1.0/

1

Cannot find a definition for the namespace http://ns.tripletriangle.com/ns/tripletri/

1

Cannot find a definition for the namespace http://prismstandard.org/namespaces/basic/2.1/

1

Cannot find a definition for the namespace http://www.aiim.org/pdfa/ns/id.html

1

Cannot find a definition for the namespace http://www.aiim.org/pdfe/ns/id/

1

Cannot find a definition for the namespace http://www.enfocus.com/ns/CertifiedPDF/2.0/

1

Cannot find a definition for the namespace http://www.northplains.com/xmpnps/cov/1.0/

1

Failed to instanciate property in xmpRights:Marked

1

Invalid array type, expecting Seq and found Bag [prefix=dc; name=date]

1

This namespace is not a schema or a structured type : http://ns.adobe.com/xap/1.0/sType/Job#

1