You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Guillaume Bailleul <gb...@gmail.com> on 2011/06/09 22:08:29 UTC

Proposition of donation of a PDF/A validator to the PDFBox project

Hi,

Last year, colleagues and I, developed a PDF/A validator. The result of our
work is now distributed under the Apache License 2. We did it because we had
not found any open source validator. Now, it looks working. Indeed, it works
for a while, but it was hard to have time for it. And we are ready to donate
it to Apache. Today this validator is called PaDaF, because there only are
PDF and A in the name. Source repository is on github :
https://github.com/gba-awl/padaf

Let me now explain what is PDF/A, why I think it could be a part of PDFBox
and how we made it.

PDF/A is an ISO standard for long term archiving of documents. It describe
how should be a PDF document to ensure it may be reproduced in unforeseeable
future. Our tool check conformance of the document with these specification.
You can learn more about PDF/A on the competence center web site :
http://www.pdfa.org
This web site propose an bunch of more than 200 invalid PDF/A. A complete
PDF/A validator should find the error for at least each document.

PaDaF is mainly based on a stream parser (done with javacc) and PDFBox.
That's why I think it could be integrated to PDFBox suite. The main
artefacts are 2 jars : preflight the validator API and xmpbox an API for xmp
manipulation. A 'jar with dependencies' exists and can be used in command
line.

First we tried to use jempbox for the xmp metadata(an xml block in the PDF
that contains metadata). But last year, jempbox was too light for our needs.
And because it was necessary to modify interface of jempbox, we decided to
do from scratch xmpbox. Today xmpbox is able to read or generate the xmp
block of a PDF. We use it to generate metadata in PDF files. It can be used
alone without PaDaF.

The validation of the file is done with a javacc grammar and PDFBox is used
to load objects when more checks must be done.

In previous version, we had to patch PDFBox to make all our tests working.
Since all the patches we proposed were included (mostly stuffs on fonts), we
can now use the standard 1.5.0 version of PDFBox. It is also compatible with
current head version of PDFBox when I write that message.

So today, we are ready to donate it and let it evolve with PDFBox. There is
work to do on the code to make it fitting Apache rule. Let us know if this
donation have its place in PDFBox and if there are some hands (and brains)
to help us.

I know that this mail was quite long, and its english was quite clumsy
(making it longer!) but maybe I forgot some piece of information or you have
some questions, so don't hesitate, ask...

Cordialement,

Guillaume

Re: Proposition of donation of a PDF/A validator to the PDFBox project

Posted by Jukka Zitting <ju...@gmail.com>.
Hi Guillaume,

On Thu, Jun 9, 2011 at 10:08 PM, Guillaume Bailleul
<gb...@gmail.com> wrote:
> So today, we are ready to donate it and let it evolve with PDFBox. There is
> work to do on the code to make it fitting Apache rule. Let us know if this
> donation have its place in PDFBox and if there are some hands (and brains)
> to help us.

The consensus seems to be that we'd like to accept PaDaF as a part of
PDFBox and welcome you to join us in continuing its development here.

Since PaDaF has been developed outside Apache, we'll need to follow
the IP clearance process [1,2] to make sure all the legal bits are in
order. Here's a rough outline of how we could proceed with this:

a) You file a new feature request [3] for this and attach the latest
PaDaF sources there.

b) You submit a Software Grant [4] for the PaDaF sources.

c) You submit Individual Contributor License Agreements (ICLA) [5] for
anyone who has been working on the PaDaF sources and wants to continue
doing so within Apache.

d) If your work on PaDaF/PDFBox is a part of your work, your employer
should also submit a Corporate Contributor License Agreement (CCLA)
[6]. Note that the software grant from step b) can be combined with
the CCLA (see schedule B).

e) We review the submission and vote on accepting it.

f) Assuming the vote passes, we'll fill in the IP clearance form [2]
and submit it to the Apache Incubator for review and for the record

g) Once everything is clear, we import the PaDaF sources to PDFBox and
set up your committer accounts.

I'd expect the whole process to take about one or two weeks. If you
have any questions, feel free to ask. :-)

[1] http://incubator.apache.org/ip-clearance/index.html
[2] http://incubator.apache.org/ip-clearance/ip-clearance-template.html
[3] https://issues.apache.org/jira/browse/PDFBOX
[4] http://www.apache.org/licenses/software-grant.txt
[5] http://www.apache.org/licenses/icla.txt
[6] http://www.apache.org/licenses/cla-corporate.txt

BR,

Jukka Zitting