You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Maruan Sahyoun <sa...@fileaffairs.de> on 2014/03/10 08:21:28 UTC

[DISCUSS] PDFBox and support for PDF versions, PDF standards

Hi,

as I’m currently looking at the parsing part of PDFBox one question came to my mind which is a more formal support for PDF versions and PDF standards such as PDF/A, PDF/UA …

As of today PDFBox has no formal support for specific PDF versions in a way that a specific version can be enforced, validated ... The PDFBox PDF/A validation does a good job for PDF/A 1b but it can not be easily extended to other standards.

Do you think that there is a need for a more formal support of such standards and versions? The would influence some of the design decisions for the parser and affect the base objects.

BR
Maruan Sahyoun


Re: [DISCUSS] PDFBox and support for PDF versions, PDF standards

Posted by John Hewson <jo...@jahewson.com>.
>>> To get that completed we need to revisit the PD model as not all features of PDF are reflected in the matching PD model. That could be done when implementing the profiles.
>> 
>> All the PD classes provide access to the underlying COS model, so there’s no need to expose low-level details in the PD model.
> 
> Yes I know. Working on the PD model would make the ‚profile‘ easier to build and understand but thinking about it, as one can work on the COS level, that’s the one which needs to be checked. WDYT?

Ultimately it is the COS model which describes the raw content of the PDF file, so yes, most of the checks should probably operate at that level. It will probably be simpler this way too.

-- John


Re: [DISCUSS] PDFBox and support for PDF versions, PDF standards

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
> Great. One more thing...
> 
>> To get that completed we need to revisit the PD model as not all features of PDF are reflected in the matching PD model. That could be done when implementing the profiles.
> 
> All the PD classes provide access to the underlying COS model, so there’s no need to expose low-level details in the PD model.

Yes I know. Working on the PD model would make the ‚profile‘ easier to build and understand but thinking about it, as one can work on the COS level, that’s the one which needs to be checked. WDYT?

Maruan


> 
> -- John
> 
> On 11 Mar 2014, at 00:24, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
> 
>> 
>>> 
>>>> OK - wasn’t precise enough - token types didn’t change but there are newer tokens introduced. 
>>> 
>>> Yes.
>>> 
>>>> As the syntax has changed do we need version and standards support in the parsing phase then?
>>> 
>>> I don’t think so, no. I don’t know what the use-case would be. You’d have to go back and read all seven versions of the PDF Reference and make sure that the parser implements the correct handling for each version, that’s an awful lot of work.
>> 
>> OK - so the parser should concentrate on getting the parsing done according to the spec (which is mostly the case with NonSequentialParser today) and we also have a way that there is some standards/relaxed way of parsing for files where the base syntax is not correct as we need to catch such circumstances for standards compliant parsing (which we don’t have in core but in the PDF/A project) but would ignore such errors if they can be corrected for relaxed parsing. 
>> 
>>> 
>>>> Other way would be to parse what’s in there and do validation etc. purely on the parsing result (COS model, PD model). Need to do that anyway.
>>> 
>>> Yes, I prefer this approach, you can always write a tool which inspects a PDDocument and determines whether or not it uses features available in a given PDF version. It seems better to do this as a separate feature than to try and build it into the parser or the PD model directly.
>> 
>> Fine for me - would be something like a ‚profile' per standard which could be used for validation as well as writing.
>> 
>> To get that completed we need to revisit the PD model as not all features of PDF are reflected in the matching PD model. That could be done when implementing the profiles.
>> 
>>> 
>>>> What about writing?
>>> 
>>> Yes, we want versions for writing, because a user may want to generate e.g a PDF 1.6 file. This is going to be even more important in the near future because the PDF 2.0 standard is supposed to be introduced in 2014.
>> 
>> There are some base features missing in writing a PDF today but I think Andreas has something in the works. The ‚profile‘ mentioned above could be used for writing too e.g. to check if PD model keys are permitted for a certain standard/version or not.
>> 
>>> 
>>> -- John
>> 
> 


Re: [DISCUSS] PDFBox and support for PDF versions, PDF standards

Posted by John Hewson <jo...@jahewson.com>.
Great. One more thing...

> To get that completed we need to revisit the PD model as not all features of PDF are reflected in the matching PD model. That could be done when implementing the profiles.

All the PD classes provide access to the underlying COS model, so there’s no need to expose low-level details in the PD model.

-- John

On 11 Mar 2014, at 00:24, Maruan Sahyoun <sa...@fileaffairs.de> wrote:

> 
>> 
>>> OK - wasn’t precise enough - token types didn’t change but there are newer tokens introduced. 
>> 
>> Yes.
>> 
>>> As the syntax has changed do we need version and standards support in the parsing phase then?
>> 
>> I don’t think so, no. I don’t know what the use-case would be. You’d have to go back and read all seven versions of the PDF Reference and make sure that the parser implements the correct handling for each version, that’s an awful lot of work.
> 
> OK - so the parser should concentrate on getting the parsing done according to the spec (which is mostly the case with NonSequentialParser today) and we also have a way that there is some standards/relaxed way of parsing for files where the base syntax is not correct as we need to catch such circumstances for standards compliant parsing (which we don’t have in core but in the PDF/A project) but would ignore such errors if they can be corrected for relaxed parsing. 
> 
>> 
>>> Other way would be to parse what’s in there and do validation etc. purely on the parsing result (COS model, PD model). Need to do that anyway.
>> 
>> Yes, I prefer this approach, you can always write a tool which inspects a PDDocument and determines whether or not it uses features available in a given PDF version. It seems better to do this as a separate feature than to try and build it into the parser or the PD model directly.
> 
> Fine for me - would be something like a ‚profile' per standard which could be used for validation as well as writing.
> 
> To get that completed we need to revisit the PD model as not all features of PDF are reflected in the matching PD model. That could be done when implementing the profiles.
> 
>> 
>>> What about writing?
>> 
>> Yes, we want versions for writing, because a user may want to generate e.g a PDF 1.6 file. This is going to be even more important in the near future because the PDF 2.0 standard is supposed to be introduced in 2014.
> 
> There are some base features missing in writing a PDF today but I think Andreas has something in the works. The ‚profile‘ mentioned above could be used for writing too e.g. to check if PD model keys are permitted for a certain standard/version or not.
> 
>> 
>> -- John
> 


Re: [DISCUSS] PDFBox and support for PDF versions, PDF standards

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
> 
>> OK - wasn’t precise enough - token types didn’t change but there are newer tokens introduced. 
> 
> Yes.
> 
>> As the syntax has changed do we need version and standards support in the parsing phase then?
> 
> I don’t think so, no. I don’t know what the use-case would be. You’d have to go back and read all seven versions of the PDF Reference and make sure that the parser implements the correct handling for each version, that’s an awful lot of work.

OK - so the parser should concentrate on getting the parsing done according to the spec (which is mostly the case with NonSequentialParser today) and we also have a way that there is some standards/relaxed way of parsing for files where the base syntax is not correct as we need to catch such circumstances for standards compliant parsing (which we don’t have in core but in the PDF/A project) but would ignore such errors if they can be corrected for relaxed parsing. 

> 
>> Other way would be to parse what’s in there and do validation etc. purely on the parsing result (COS model, PD model). Need to do that anyway.
> 
> Yes, I prefer this approach, you can always write a tool which inspects a PDDocument and determines whether or not it uses features available in a given PDF version. It seems better to do this as a separate feature than to try and build it into the parser or the PD model directly.

Fine for me - would be something like a ‚profile' per standard which could be used for validation as well as writing.

To get that completed we need to revisit the PD model as not all features of PDF are reflected in the matching PD model. That could be done when implementing the profiles.

> 
>> What about writing?
> 
> Yes, we want versions for writing, because a user may want to generate e.g a PDF 1.6 file. This is going to be even more important in the near future because the PDF 2.0 standard is supposed to be introduced in 2014.

There are some base features missing in writing a PDF today but I think Andreas has something in the works. The ‚profile‘ mentioned above could be used for writing too e.g. to check if PD model keys are permitted for a certain standard/version or not.

> 
> -- John


Re: [DISCUSS] PDFBox and support for PDF versions, PDF standards

Posted by John Hewson <jo...@jahewson.com>.
> OK - wasn’t precise enough - token types didn’t change but there are newer tokens introduced. 

Yes.

> As the syntax has changed do we need version and standards support in the parsing phase then?

I don’t think so, no. I don’t know what the use-case would be. You’d have to go back and read all seven versions of the PDF Reference and make sure that the parser implements the correct handling for each version, that’s an awful lot of work.

> Other way would be to parse what’s in there and do validation etc. purely on the parsing result (COS model, PD model). Need to do that anyway.

Yes, I prefer this approach, you can always write a tool which inspects a PDDocument and determines whether or not it uses features available in a given PDF version. It seems better to do this as a separate feature than to try and build it into the parser or the PD model directly.

> What about writing?

Yes, we want versions for writing, because a user may want to generate e.g a PDF 1.6 file. This is going to be even more important in the near future because the PDF 2.0 standard is supposed to be introduced in 2014.

-- John

Re: [DISCUSS] PDFBox and support for PDF versions, PDF standards

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
OK - wasn’t precise enough - token types didn’t change but there are newer tokens introduced. 

As the syntax has changed do we need version and standards support in the parsing phase then? Other way would be to parse what’s in there and do validation etc. purely on the parsing result (COS model, PD model). Need to do that anyway.

What about writing?

BR
Maruan Sahyoun

Am 10.03.2014 um 11:43 schrieb John Hewson <jo...@jahewson.com>:

>>> If the syntax hasn’t changed then there can’t be anything in the parser which is version-specific.
>> 
>> I think we are talking about two different things here. The parsing process to get the tokens and the parsing process to follow the PDF file layout and to form and follow the higher level structures such as Xref.
> 
> Yes, there are two phases, tokenizing and parsing; sometimes both are called parsing.
> 
>> Tokens didn’t change. File layout and higher level structures did like - Linerization or Xref Streams. Dependent on the PDF standard some are permitted some are not. 
> 
> That’s not right. The tokens have changed: “xref” is a keyword and therefore a token. Also, as I said originally, the syntax has changed, because what you call "higher level structures” is actually the syntax.
> 
> -- John
> 
> On 10 Mar 2014, at 02:32, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
> 
>> I think we are talking about two different things here. The parsing process to get the tokens, and the parsing process to follow the PDF file layout and to form and follow the higher level structures such as Xref. Tokens didn’t change. File layout and higher level structures did like - Linerization or Xref Streams. Dependent on the PDF standard some are permitted some are not. 
>> 
>> BR
>> Maruan
>> 
>> Am 10.03.2014 um 10:06 schrieb John Hewson <jo...@jahewson.com>:
>> 
>>>> The base syntax has not changed. But the elements described by the base have.
>>> 
>>> 
>>> If the syntax hasn’t changed then there can’t be anything in the parser which is version-specific.
>>> 
>>> -- John
>>> 
>>> On 10 Mar 2014, at 01:43, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
>>> 
>>>> Hi John,
>>>> 
>>>> it’s not about PDF versions but PDF versions and standards.
>>>> 
>>>> The base syntax has not changed. But the elements described by the base have.
>>>> 
>>>> BR
>>>> Maruan Sahyoun
>>>> 
>>>> Am 10.03.2014 um 09:20 schrieb John Hewson <jo...@jahewson.com>:
>>>> 
>>>>> Hi Maruan
>>>>> 
>>>>>> As of today PDFBox has no formal support for specific PDF versions in a way that a specific version can be enforced, validated ...
>>>>> 
>>>>> Perhaps that is because there is not much demand for this? Nowadays everyone has instant access to the latest version of Adobe Reader so checking that a PDF can be opened with a specific version of Adobe Reader is not that useful anymore. There might be some niche cases, but I can’t think what they would be. For cases where it’s important that a PDF file is valid then a format such as PDF/A or PDF/X must be used instead as “vanilla" PDF is ambiguous.
>>>>> 
>>>>>> The PDFBox PDF/A validation does a good job for PDF/A 1b but it can not be easily extended to other standards.
>>>>> 
>>>>> Yes, PDF/A is carefully validated because it is for archival purposes, unlike regular PDF files.
>>>>> 
>>>>>> Do you think that there is a need for a more formal support of such standards and versions? The would influence some of the design decisions for the parser and affect the base objects.
>>>>> 
>>>>> 
>>>>> I can’t think of a reason why someone would want to parse a specific PDF version, so my answer is no, I don’t think there is such a need. Has the syntax of PDF even changed that much over the different versions?
>>>>> 
>>>>> — John
>>>>> 
>>>> 
>>> 
>> 
> 


Re: [DISCUSS] PDFBox and support for PDF versions, PDF standards

Posted by John Hewson <jo...@jahewson.com>.
>> If the syntax hasn’t changed then there can’t be anything in the parser which is version-specific.
> 
> I think we are talking about two different things here. The parsing process to get the tokens and the parsing process to follow the PDF file layout and to form and follow the higher level structures such as Xref.

Yes, there are two phases, tokenizing and parsing; sometimes both are called parsing.

> Tokens didn’t change. File layout and higher level structures did like - Linerization or Xref Streams. Dependent on the PDF standard some are permitted some are not. 

That’s not right. The tokens have changed: “xref” is a keyword and therefore a token. Also, as I said originally, the syntax has changed, because what you call "higher level structures” is actually the syntax.

-- John

On 10 Mar 2014, at 02:32, Maruan Sahyoun <sa...@fileaffairs.de> wrote:

> I think we are talking about two different things here. The parsing process to get the tokens, and the parsing process to follow the PDF file layout and to form and follow the higher level structures such as Xref. Tokens didn’t change. File layout and higher level structures did like - Linerization or Xref Streams. Dependent on the PDF standard some are permitted some are not. 
> 
> BR
> Maruan
> 
> Am 10.03.2014 um 10:06 schrieb John Hewson <jo...@jahewson.com>:
> 
>>> The base syntax has not changed. But the elements described by the base have.
>> 
>> 
>> If the syntax hasn’t changed then there can’t be anything in the parser which is version-specific.
>> 
>> -- John
>> 
>> On 10 Mar 2014, at 01:43, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
>> 
>>> Hi John,
>>> 
>>> it’s not about PDF versions but PDF versions and standards.
>>> 
>>> The base syntax has not changed. But the elements described by the base have.
>>> 
>>> BR
>>> Maruan Sahyoun
>>> 
>>> Am 10.03.2014 um 09:20 schrieb John Hewson <jo...@jahewson.com>:
>>> 
>>>> Hi Maruan
>>>> 
>>>>> As of today PDFBox has no formal support for specific PDF versions in a way that a specific version can be enforced, validated ...
>>>> 
>>>> Perhaps that is because there is not much demand for this? Nowadays everyone has instant access to the latest version of Adobe Reader so checking that a PDF can be opened with a specific version of Adobe Reader is not that useful anymore. There might be some niche cases, but I can’t think what they would be. For cases where it’s important that a PDF file is valid then a format such as PDF/A or PDF/X must be used instead as “vanilla" PDF is ambiguous.
>>>> 
>>>>> The PDFBox PDF/A validation does a good job for PDF/A 1b but it can not be easily extended to other standards.
>>>> 
>>>> Yes, PDF/A is carefully validated because it is for archival purposes, unlike regular PDF files.
>>>> 
>>>>> Do you think that there is a need for a more formal support of such standards and versions? The would influence some of the design decisions for the parser and affect the base objects.
>>>> 
>>>> 
>>>> I can’t think of a reason why someone would want to parse a specific PDF version, so my answer is no, I don’t think there is such a need. Has the syntax of PDF even changed that much over the different versions?
>>>> 
>>>> — John
>>>> 
>>> 
>> 
> 


Re: [DISCUSS] PDFBox and support for PDF versions, PDF standards

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
I think we are talking about two different things here. The parsing process to get the tokens, and the parsing process to follow the PDF file layout and to form and follow the higher level structures such as Xref. Tokens didn’t change. File layout and higher level structures did like - Linerization or Xref Streams. Dependent on the PDF standard some are permitted some are not. 

BR
Maruan

Am 10.03.2014 um 10:06 schrieb John Hewson <jo...@jahewson.com>:

>> The base syntax has not changed. But the elements described by the base have.
> 
> 
> If the syntax hasn’t changed then there can’t be anything in the parser which is version-specific.
> 
> -- John
> 
> On 10 Mar 2014, at 01:43, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
> 
>> Hi John,
>> 
>> it’s not about PDF versions but PDF versions and standards.
>> 
>> The base syntax has not changed. But the elements described by the base have.
>> 
>> BR
>> Maruan Sahyoun
>> 
>> Am 10.03.2014 um 09:20 schrieb John Hewson <jo...@jahewson.com>:
>> 
>>> Hi Maruan
>>> 
>>>> As of today PDFBox has no formal support for specific PDF versions in a way that a specific version can be enforced, validated ...
>>> 
>>> Perhaps that is because there is not much demand for this? Nowadays everyone has instant access to the latest version of Adobe Reader so checking that a PDF can be opened with a specific version of Adobe Reader is not that useful anymore. There might be some niche cases, but I can’t think what they would be. For cases where it’s important that a PDF file is valid then a format such as PDF/A or PDF/X must be used instead as “vanilla" PDF is ambiguous.
>>> 
>>>> The PDFBox PDF/A validation does a good job for PDF/A 1b but it can not be easily extended to other standards.
>>> 
>>> Yes, PDF/A is carefully validated because it is for archival purposes, unlike regular PDF files.
>>> 
>>>> Do you think that there is a need for a more formal support of such standards and versions? The would influence some of the design decisions for the parser and affect the base objects.
>>> 
>>> 
>>> I can’t think of a reason why someone would want to parse a specific PDF version, so my answer is no, I don’t think there is such a need. Has the syntax of PDF even changed that much over the different versions?
>>> 
>>> — John
>>> 
>> 
> 


Re: [DISCUSS] PDFBox and support for PDF versions, PDF standards

Posted by John Hewson <jo...@jahewson.com>.
> The base syntax has not changed. But the elements described by the base have.


If the syntax hasn’t changed then there can’t be anything in the parser which is version-specific.

-- John

On 10 Mar 2014, at 01:43, Maruan Sahyoun <sa...@fileaffairs.de> wrote:

> Hi John,
> 
> it’s not about PDF versions but PDF versions and standards.
> 
> The base syntax has not changed. But the elements described by the base have.
> 
> BR
> Maruan Sahyoun
> 
> Am 10.03.2014 um 09:20 schrieb John Hewson <jo...@jahewson.com>:
> 
>> Hi Maruan
>> 
>>> As of today PDFBox has no formal support for specific PDF versions in a way that a specific version can be enforced, validated ...
>> 
>> Perhaps that is because there is not much demand for this? Nowadays everyone has instant access to the latest version of Adobe Reader so checking that a PDF can be opened with a specific version of Adobe Reader is not that useful anymore. There might be some niche cases, but I can’t think what they would be. For cases where it’s important that a PDF file is valid then a format such as PDF/A or PDF/X must be used instead as “vanilla" PDF is ambiguous.
>> 
>>> The PDFBox PDF/A validation does a good job for PDF/A 1b but it can not be easily extended to other standards.
>> 
>> Yes, PDF/A is carefully validated because it is for archival purposes, unlike regular PDF files.
>> 
>>> Do you think that there is a need for a more formal support of such standards and versions? The would influence some of the design decisions for the parser and affect the base objects.
>> 
>> 
>> I can’t think of a reason why someone would want to parse a specific PDF version, so my answer is no, I don’t think there is such a need. Has the syntax of PDF even changed that much over the different versions?
>> 
>> — John
>> 
> 


Re: [DISCUSS] PDFBox and support for PDF versions, PDF standards

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi John,

it’s not about PDF versions but PDF versions and standards.

The base syntax has not changed. But the elements described by the base have.

BR
Maruan Sahyoun

Am 10.03.2014 um 09:20 schrieb John Hewson <jo...@jahewson.com>:

> Hi Maruan
> 
>> As of today PDFBox has no formal support for specific PDF versions in a way that a specific version can be enforced, validated ...
> 
> Perhaps that is because there is not much demand for this? Nowadays everyone has instant access to the latest version of Adobe Reader so checking that a PDF can be opened with a specific version of Adobe Reader is not that useful anymore. There might be some niche cases, but I can’t think what they would be. For cases where it’s important that a PDF file is valid then a format such as PDF/A or PDF/X must be used instead as “vanilla" PDF is ambiguous.
> 
>> The PDFBox PDF/A validation does a good job for PDF/A 1b but it can not be easily extended to other standards.
> 
> Yes, PDF/A is carefully validated because it is for archival purposes, unlike regular PDF files.
> 
>> Do you think that there is a need for a more formal support of such standards and versions? The would influence some of the design decisions for the parser and affect the base objects.
> 
> 
> I can’t think of a reason why someone would want to parse a specific PDF version, so my answer is no, I don’t think there is such a need. Has the syntax of PDF even changed that much over the different versions?
> 
> — John
> 


Re: [DISCUSS] PDFBox and support for PDF versions, PDF standards

Posted by John Hewson <jo...@jahewson.com>.
Hi Maruan

> As of today PDFBox has no formal support for specific PDF versions in a way that a specific version can be enforced, validated ...

Perhaps that is because there is not much demand for this? Nowadays everyone has instant access to the latest version of Adobe Reader so checking that a PDF can be opened with a specific version of Adobe Reader is not that useful anymore. There might be some niche cases, but I can’t think what they would be. For cases where it’s important that a PDF file is valid then a format such as PDF/A or PDF/X must be used instead as “vanilla" PDF is ambiguous.

> The PDFBox PDF/A validation does a good job for PDF/A 1b but it can not be easily extended to other standards.

Yes, PDF/A is carefully validated because it is for archival purposes, unlike regular PDF files.

> Do you think that there is a need for a more formal support of such standards and versions? The would influence some of the design decisions for the parser and affect the base objects.


I can’t think of a reason why someone would want to parse a specific PDF version, so my answer is no, I don’t think there is such a need. Has the syntax of PDF even changed that much over the different versions?

— John