You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Maruan Sahyoun <sa...@fileaffairs.de> on 2013/08/02 14:52:26 UTC

PDF/A & xmp

being a member of the PDFAssociation (pdfa.org) there was a discussion about some edge cases in xml that we interpret differently when doing PDF/A validation than Acrobat and bro which I'm allowed to share

<snip>
In this case we have a PDF with an XMP metadata stream containing two <rdf:RDF> entries, one with rdf:about set to a blank string, the other with it set to a UUID. The PDF/A specification (ISO-19005-1:2005(E) para 6.7.2) simply says that the stream must conform to the "XMP specification 2004 revision" which reads (p21):

The rdf:about attribute on the rdf:Description element is a required attribute that identifies the resource whose metadata this XMP describes. The value of this attribute must follow URI syntax and may be either:

●  an empty string (as in the example above), which means that the XMP is physically local to the resource being described. Applications must rely on knowledge of the file format to correctly associate the XMP with the resource.

●  a unique instance ID that is generated every time a file is saved. The next section gives guidelines for creating instance IDs.

The XMP packet must describe a single entity, and my reading of the above is a combination of empty-string and a unique UUID can meet this requirement - this is how both our software and Acrobat X and XI behave. However it's ambiguous, and this clause was revised in the 2012 revision (ISO 16684-1:2011(E) para 7.4) to this:

If the XMP data model has an AboutURI (6.1, “XMP packets”), that same URI shall be the value of an rdf:about attribute in each top-level rdf:Description element. Otherwise, the rdf:about attributes for all top- level rdf:Description elements shall be present with an empty value. The rdf:about attribute shall not be used in more deeply nested rdf:Description elements.
For compatibility with very early XMP usage, it is recommended that XMP readers tolerate a missing rdf:about attribute and treat it as present with an empty value. It is also recommended that XMP readers tolerate a mix of empty and non-empty rdf:about values, as long as all non-empty values are identical.

Which means that an empty string and a unique UUID are technically incorrect, but it's recommended they be tolerated for compatibility purposes.

I concede this is a very fine hair to split, but if you're writing software to validate or create PDF/A you have to make a decision on way or another. BFO and Acrobat X and XI think this is valid, PDFBox and pdf-tools.com online validator lean the other and classify this document as invalid. The end result is a document which might be PDF/A compliant, but no-one is really sure (and if anyone can give me a definitive answer please do - email me off-list if you want a copy of the document, I will need to get permission from my customer to forward it).
</snip>

I can also share a sample file if one is interested working on that. 

BR


Maruan Sahyoun


Re: PDF/X

Posted by Hartmann Toël <To...@elanders.com>.
Thank you.

I see what I can do...

Best regards
Toël Hartmann

On 7 aug 2013, at 22:06, Leleu Eric <er...@gmail.com> wrote:

> Hi,
> 
> Currently the only way to check an other format than PDF/A is to replace
> ValidationProcess in the PreflightConfiguration. To do this, you have to
> extend the PreflightDocument to access the config attribute in order to
> call replaceXXX method on the configuration object.
> 
> Each instance of ValidationProcess class is called in the validate method
> of the PreflightDocument.
> 
> 
> 
> BR,
> Eric
> 
> 
> 2013/8/6 Hartmann Toël <To...@elanders.com>
> 
>> Hi,
>> 
>> I am interested in helping adding support for such a validator, but will
>> need guidance on where to start.
>> The predefined constants in PreflightConstants does not seem to be used,
>> and the enum in Format looks to only be referred to in FilterHelper but the
>> code seems to do the same thing anyway.
>> 
>> What would be the preferred strategy for adding such validation support?
>> 
>> 
>> On 6 aug 2013, at 12:06, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
>> 
>>> Hi,
>>> 
>>> it's technically possible to add checks for PDF/X compliance but at that
>> point in time I don't think that someone is working on it. For an immediate
>> solution you should use the PDF/A validation output.
>>> 
>>> Of course, if you are willing to add support for a validator …
>>> 
>>> BR
>>> Maruan Sahyoun
>>> 
>>> Am 06.08.2013 um 10:26 schrieb Hartmann Toël <Toel.Hartmann@elanders.com
>>> :
>>> 
>>>> Hi,
>>>> 
>>>> I would like to check pdf for PDF/X-1a:2003 compliance.
>>>> 
>>>> Would it be possible to add support for PDF/X specific checks in PDFBOX
>> or should I analyze the XML output from PDF/A?
>>>> 
>>>> 
>>>> 
>>>> Best regards
>>>> Toël Hartmann
>>> 
>> 
>> 


Re: PDF/X

Posted by Leleu Eric <er...@gmail.com>.
Hi,

Currently the only way to check an other format than PDF/A is to replace
ValidationProcess in the PreflightConfiguration. To do this, you have to
extend the PreflightDocument to access the config attribute in order to
call replaceXXX method on the configuration object.

Each instance of ValidationProcess class is called in the validate method
of the PreflightDocument.



BR,
Eric


2013/8/6 Hartmann Toël <To...@elanders.com>

> Hi,
>
> I am interested in helping adding support for such a validator, but will
> need guidance on where to start.
> The predefined constants in PreflightConstants does not seem to be used,
> and the enum in Format looks to only be referred to in FilterHelper but the
> code seems to do the same thing anyway.
>
> What would be the preferred strategy for adding such validation support?
>
>
> On 6 aug 2013, at 12:06, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
>
> > Hi,
> >
> > it's technically possible to add checks for PDF/X compliance but at that
> point in time I don't think that someone is working on it. For an immediate
> solution you should use the PDF/A validation output.
> >
> > Of course, if you are willing to add support for a validator …
> >
> > BR
> > Maruan Sahyoun
> >
> > Am 06.08.2013 um 10:26 schrieb Hartmann Toël <Toel.Hartmann@elanders.com
> >:
> >
> >> Hi,
> >>
> >> I would like to check pdf for PDF/X-1a:2003 compliance.
> >>
> >> Would it be possible to add support for PDF/X specific checks in PDFBOX
> or should I analyze the XML output from PDF/A?
> >>
> >>
> >>
> >> Best regards
> >> Toël Hartmann
> >
>
>

Re: PDF/X

Posted by Hartmann Toël <To...@elanders.com>.
Hi,

I am interested in helping adding support for such a validator, but will need guidance on where to start.
The predefined constants in PreflightConstants does not seem to be used, and the enum in Format looks to only be referred to in FilterHelper but the code seems to do the same thing anyway.

What would be the preferred strategy for adding such validation support?


On 6 aug 2013, at 12:06, Maruan Sahyoun <sa...@fileaffairs.de> wrote:

> Hi,
> 
> it's technically possible to add checks for PDF/X compliance but at that point in time I don't think that someone is working on it. For an immediate solution you should use the PDF/A validation output. 
> 
> Of course, if you are willing to add support for a validator …
> 
> BR
> Maruan Sahyoun
> 
> Am 06.08.2013 um 10:26 schrieb Hartmann Toël <To...@elanders.com>:
> 
>> Hi,
>> 
>> I would like to check pdf for PDF/X-1a:2003 compliance.
>> 
>> Would it be possible to add support for PDF/X specific checks in PDFBOX or should I analyze the XML output from PDF/A?
>> 
>> 
>> 
>> Best regards
>> Toël Hartmann
> 


Re: PDF/X

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi,

it's technically possible to add checks for PDF/X compliance but at that point in time I don't think that someone is working on it. For an immediate solution you should use the PDF/A validation output. 

Of course, if you are willing to add support for a validator …

BR
Maruan Sahyoun

Am 06.08.2013 um 10:26 schrieb Hartmann Toël <To...@elanders.com>:

> Hi,
> 
> I would like to check pdf for PDF/X-1a:2003 compliance.
> 
> Would it be possible to add support for PDF/X specific checks in PDFBOX or should I analyze the XML output from PDF/A?
> 
> 
> 
> Best regards
> Toël Hartmann


PDF/X

Posted by Hartmann Toël <To...@elanders.com>.
Hi,

I would like to check pdf for PDF/X-1a:2003 compliance.

Would it be possible to add support for PDF/X specific checks in PDFBOX or should I analyze the XML output from PDF/A?



Best regards
Toël Hartmann

Re: PDF/A & xmp

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Guillaume,

let's wait for a settlement on that topic @ pdfa.org. I'll open a ticket with a description as soon as there are comments on it.

BR

Maruan Sahyoun

Am 02.08.2013 um 20:04 schrieb Guillaume Bailleul <gb...@gmail.com>:

> Hi Maruan,
> 
> When we developed preflight and xmpbox we did not take any decision on that
> point. We could have considered it valid ... it is a matter of luck.
> 
> Changing that behavior will not be difficult.
> 
> IMO, we should wait that someone is really sure before changing something.
> 
> Can you create an issue to not forget this point ?
> 
> KR,
> 
> Guillaume
> 
> 
> 
> 
> On Fri, Aug 2, 2013 at 2:52 PM, Maruan Sahyoun <sa...@fileaffairs.de>wrote:
> 
>> being a member of the PDFAssociation (pdfa.org) there was a discussion
>> about some edge cases in xml that we interpret differently when doing PDF/A
>> validation than Acrobat and bro which I'm allowed to share
>> 
>> <snip>
>> In this case we have a PDF with an XMP metadata stream containing two
>> <rdf:RDF> entries, one with rdf:about set to a blank string, the other with
>> it set to a UUID. The PDF/A specification (ISO-19005-1:2005(E) para 6.7.2)
>> simply says that the stream must conform to the "XMP specification 2004
>> revision" which reads (p21):
>> 
>> The rdf:about attribute on the rdf:Description element is a required
>> attribute that identifies the resource whose metadata this XMP describes.
>> The value of this attribute must follow URI syntax and may be either:
>> 
>> ●  an empty string (as in the example above), which means that the XMP is
>> physically local to the resource being described. Applications must rely on
>> knowledge of the file format to correctly associate the XMP with the
>> resource.
>> 
>> ●  a unique instance ID that is generated every time a file is saved. The
>> next section gives guidelines for creating instance IDs.
>> 
>> The XMP packet must describe a single entity, and my reading of the above
>> is a combination of empty-string and a unique UUID can meet this
>> requirement - this is how both our software and Acrobat X and XI behave.
>> However it's ambiguous, and this clause was revised in the 2012 revision
>> (ISO 16684-1:2011(E) para 7.4) to this:
>> 
>> If the XMP data model has an AboutURI (6.1, “XMP packets”), that same URI
>> shall be the value of an rdf:about attribute in each top-level
>> rdf:Description element. Otherwise, the rdf:about attributes for all top-
>> level rdf:Description elements shall be present with an empty value. The
>> rdf:about attribute shall not be used in more deeply nested rdf:Description
>> elements.
>> For compatibility with very early XMP usage, it is recommended that XMP
>> readers tolerate a missing rdf:about attribute and treat it as present with
>> an empty value. It is also recommended that XMP readers tolerate a mix of
>> empty and non-empty rdf:about values, as long as all non-empty values are
>> identical.
>> 
>> Which means that an empty string and a unique UUID are technically
>> incorrect, but it's recommended they be tolerated for compatibility
>> purposes.
>> 
>> I concede this is a very fine hair to split, but if you're writing
>> software to validate or create PDF/A you have to make a decision on way or
>> another. BFO and Acrobat X and XI think this is valid, PDFBox and
>> pdf-tools.com online validator lean the other and classify this document
>> as invalid. The end result is a document which might be PDF/A compliant,
>> but no-one is really sure (and if anyone can give me a definitive answer
>> please do - email me off-list if you want a copy of the document, I will
>> need to get permission from my customer to forward it).
>> </snip>
>> 
>> I can also share a sample file if one is interested working on that.
>> 
>> BR
>> 
>> 
>> Maruan Sahyoun
>> 
>> 


Re: PDF/A & xmp

Posted by Guillaume Bailleul <gb...@gmail.com>.
Hi Maruan,

When we developed preflight and xmpbox we did not take any decision on that
point. We could have considered it valid ... it is a matter of luck.

Changing that behavior will not be difficult.

IMO, we should wait that someone is really sure before changing something.

Can you create an issue to not forget this point ?

KR,

Guillaume




On Fri, Aug 2, 2013 at 2:52 PM, Maruan Sahyoun <sa...@fileaffairs.de>wrote:

> being a member of the PDFAssociation (pdfa.org) there was a discussion
> about some edge cases in xml that we interpret differently when doing PDF/A
> validation than Acrobat and bro which I'm allowed to share
>
> <snip>
> In this case we have a PDF with an XMP metadata stream containing two
> <rdf:RDF> entries, one with rdf:about set to a blank string, the other with
> it set to a UUID. The PDF/A specification (ISO-19005-1:2005(E) para 6.7.2)
> simply says that the stream must conform to the "XMP specification 2004
> revision" which reads (p21):
>
> The rdf:about attribute on the rdf:Description element is a required
> attribute that identifies the resource whose metadata this XMP describes.
> The value of this attribute must follow URI syntax and may be either:
>
> ●  an empty string (as in the example above), which means that the XMP is
> physically local to the resource being described. Applications must rely on
> knowledge of the file format to correctly associate the XMP with the
> resource.
>
> ●  a unique instance ID that is generated every time a file is saved. The
> next section gives guidelines for creating instance IDs.
>
> The XMP packet must describe a single entity, and my reading of the above
> is a combination of empty-string and a unique UUID can meet this
> requirement - this is how both our software and Acrobat X and XI behave.
> However it's ambiguous, and this clause was revised in the 2012 revision
> (ISO 16684-1:2011(E) para 7.4) to this:
>
> If the XMP data model has an AboutURI (6.1, “XMP packets”), that same URI
> shall be the value of an rdf:about attribute in each top-level
> rdf:Description element. Otherwise, the rdf:about attributes for all top-
> level rdf:Description elements shall be present with an empty value. The
> rdf:about attribute shall not be used in more deeply nested rdf:Description
> elements.
> For compatibility with very early XMP usage, it is recommended that XMP
> readers tolerate a missing rdf:about attribute and treat it as present with
> an empty value. It is also recommended that XMP readers tolerate a mix of
> empty and non-empty rdf:about values, as long as all non-empty values are
> identical.
>
> Which means that an empty string and a unique UUID are technically
> incorrect, but it's recommended they be tolerated for compatibility
> purposes.
>
> I concede this is a very fine hair to split, but if you're writing
> software to validate or create PDF/A you have to make a decision on way or
> another. BFO and Acrobat X and XI think this is valid, PDFBox and
> pdf-tools.com online validator lean the other and classify this document
> as invalid. The end result is a document which might be PDF/A compliant,
> but no-one is really sure (and if anyone can give me a definitive answer
> please do - email me off-list if you want a copy of the document, I will
> need to get permission from my customer to forward it).
> </snip>
>
> I can also share a sample file if one is interested working on that.
>
> BR
>
>
> Maruan Sahyoun
>
>