You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@corinthia.apache.org by Louis Suárez-Potts <lu...@gmail.com> on 2015/02/03 17:17:07 UTC

Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

https://blogs.apache.org/foundation/entry/apache_pdfbox_named_an_open

My guess is that everyone already knows about this :-)

Cheers,
louis


AW: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Posted by Edward Zimmermann <Ed...@cib.de>.
Sure,

ISARTOR (not far from where I'm sitting) is a set of PDFs that are not PDF/A-1b compliant. One uses them to make sure that one's PDF/A-1b tests correctly trap the errors. The files are about testing for non-conformance. They are a good first step!
The other collection is just a set of 10 PDF/UA documents to show what good PDF/UA can look like. The reference suite says nothing really about conformance but is a nice "reference" for getting the UA ball rolling....
Both are quite useful but don't really....


-----Ursprüngliche Nachricht-----
Von: Dennis E. Hamilton [mailto:dennis.hamilton@acm.org] 
Gesendet: Mittwoch, 4. Februar 2015 16:36
An: dev@corinthia.incubator.apache.org
Betreff: RE: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Oh, sorry.  I should not have said "conformance."  What I was thinking of was

<http://www.pdfa.org/2011/08/isartor-test-suite/>
and
<http://www.pdfa.org/publication/pdfua-reference-suite/>.

I note that "conforming" is used with respect to the reference suite, but I don't think those are offered as any determination of compliance.

I do favor what I see of the approach to test suites there.

 - Dennis


-----Original Message-----
From: Edward Zimmermann [mailto:Edward.Zimmermann@cib.de] 
Sent: Wednesday, February 4, 2015 02:47
To: dev@corinthia.incubator.apache.org
Cc: dennis.hamilton@acm.org
Subject: AW: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Hi All,

I'm not familiar with any conformance testing, much less strong, within the PDFA. The Association is mainly about PDF advocacy and as a networking umbrella for most of the world's PDF experts. As a Class-A liaison to ISO on PDF we've, for example, had via the Association early access to PDF 2.0 and were able to contribute our reviews. The Association also maintains a PDF/A competence center and provides expertise to a number of organizations.

Conformance testing? Of What?

[ ... ]



RE: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Posted by "Dennis E. Hamilton" <de...@acm.org>.
Oh, sorry.  I should not have said "conformance."  What I was thinking of was

<http://www.pdfa.org/2011/08/isartor-test-suite/>
and
<http://www.pdfa.org/publication/pdfua-reference-suite/>.

I note that "conforming" is used with respect to the reference suite, but I don't think those are offered as any determination of compliance.

I do favor what I see of the approach to test suites there.

 - Dennis


-----Original Message-----
From: Edward Zimmermann [mailto:Edward.Zimmermann@cib.de] 
Sent: Wednesday, February 4, 2015 02:47
To: dev@corinthia.incubator.apache.org
Cc: dennis.hamilton@acm.org
Subject: AW: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Hi All,

I'm not familiar with any conformance testing, much less strong, within the PDFA. The Association is mainly about PDF advocacy and as a networking umbrella for most of the world's PDF experts. As a Class-A liaison to ISO on PDF we've, for example, had via the Association early access to PDF 2.0 and were able to contribute our reviews. The Association also maintains a PDF/A competence center and provides expertise to a number of organizations.

Conformance testing? Of What?

[ ... ]



Re: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Posted by jan i <ja...@apache.org>.
On Wednesday, February 4, 2015, Dave Fisher <da...@comcast.net> wrote:

> Yes, it is interesting to me. I know that PDF is a markup that is based on
> a set of PostScript functions and an object layout specification. It is not
> like PNG - that's a raster bitmap. It is a vector drawing spec. My interest
> is pulling out the content - both text and shapes into a useful set of
> objects. I am not so interested at this time in other features like forms,
> embedded files, and output.
>
> I can read the PDF into an object structure and output HTML5. I can also
> output the objects into roughly equivalent PPTX slides using Apache POI.
>
> Corinthia comes in two ways for me.
>
> (1) An HTML5 format that is targeting interchange with Office Document
> formats.
>
> (2) An intermediate format the may be exported in any format that makes
> sense.
>
> So I am looking for Corinthia to allow pluggable DocFormats.


plugable filters is something I tried to persuade peter to earlier, maybe
it will be easier when the new core API is ready.

rgds
jan i

>
> Regards,
> Dave
>
> On Feb 4, 2015, at 11:13 AM, Louis S wrote:
>
> >
> >
> > Louis
> >
> >> On 4 Feb 2015, at 13:55, jan i <jani@apache.org <javascript:;>> wrote:
> >>
> >>> On 4 February 2015 at 19:51, Louis S <luispo@gmail.com <javascript:;>>
> wrote:
> >>>
> >>> I posted on this to see if pdfbox could offer insight s it is taken up.
> >>> Dave pointed out that the functionality of pdfbox ws interesting to his
> >>> company.
> >>>
> >>
> >> And I think your posting was interesting information (such information
> is
> >> needed to see what moves out there). But I do not think we currently
> should
> >> think about putting it into Corinthia.
> >>
> > No objections.
> >
> >> rgds
> >> jan i.
> >>
> >>
> >>> Louis
> >>>
> >>>> On 4 Feb 2015, at 12:03, jan i <jani@apache.org <javascript:;>>
> wrote:
> >>>>
> >>>> On Wednesday, February 4, 2015, Peter Kelly <kellypmk@gmail.com
> <javascript:;>> wrote:
> >>>>
> >>>>>> On 4 Feb 2015, at 5:47 pm, Edward Zimmermann <
> edward.zimmermann@cib.de <javascript:;>
> >>>>> <javascript:;>> wrote:
> >>>>>>
> >>>>>> Does this have anything to do with Corinthia? No. Corinthia is about
> >>>>> content and especially word processing formats (OOXML, ODF etc.)..
> >>>>> Corinthia is at its core about pragmatic fidelity. The point of the
> >>>>> bidirectional transformation model is to be able to reduce fidelity
> >>>>> demands. Unless the project wants to get sidetracked into HiFi
> rendering
> >>>>> (of DOCX or ODT) it's completely outside of the scope….
> >>>>>
> >>>>> I think of PDF in the same way as I do PNG. It’s intended as an
> output
> >>>>> format, not an input format. I know there are tools out there which
> are
> >>>>> effectively half of an OCR system which can reconstruct a source
> >>> document
> >>>>> by inferring the logical structure from the layout (e.g. where a
> >>> paragraph
> >>>>> begins and ends), though this is quite a difficult problem and I’m
> not
> >>> sure
> >>>>> that it’d be within the scope of Corinthia (though if someone has
> ideas
> >>> on
> >>>>> this and wants to work on it, I’m all for it - it’s just a very
> >>> difficult
> >>>>> and very different task to writing filters for all the other formats
> >>> we’ve
> >>>>> discussed).
> >>>>
> >>>> +1 I think we currently have other more important tasks in corinthia.
> >>>>
> >>>>
> >>>> rgds
> >>>> jan i
> >>>>
> >>>>>
> >>>>> On the other side is output to PDF - that is, typesetting. This is
> >>>>> something I also think would be outside the scope of the project (at
> >>> least
> >>>>> based on my understanding of people’s interests to date). We
> basically
> >>> rely
> >>>>> on separate programs to do the typesetting of a document produced by
> the
> >>>>> library, e.g. LaTeX, WebKit/other browser engines.
> >>>>>
> >>>>> --
> >>>>> Dr. Peter M. Kelly
> >>>>> kellypmk@gmail.com <javascript:;> <javascript:;>
> >>>>> http://www.kellypmk.net/
> >>>>>
> >>>>> PGP key: http://www.kellypmk.net/pgp-key <
> >>> http://www.kellypmk.net/pgp-key>
> >>>>> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
> >>>>
> >>>> --
> >>>> Sent from My iPad, sorry for any misspellings.
> >>>
>
>

-- 
Sent from My iPad, sorry for any misspellings.

Re: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Posted by Dave Fisher <da...@comcast.net>.
Yes, it is interesting to me. I know that PDF is a markup that is based on a set of PostScript functions and an object layout specification. It is not like PNG - that's a raster bitmap. It is a vector drawing spec. My interest is pulling out the content - both text and shapes into a useful set of objects. I am not so interested at this time in other features like forms, embedded files, and output.

I can read the PDF into an object structure and output HTML5. I can also output the objects into roughly equivalent PPTX slides using Apache POI.

Corinthia comes in two ways for me.

(1) An HTML5 format that is targeting interchange with Office Document formats.

(2) An intermediate format the may be exported in any format that makes sense.

So I am looking for Corinthia to allow pluggable DocFormats.

Regards,
Dave

On Feb 4, 2015, at 11:13 AM, Louis S wrote:

> 
> 
> Louis
> 
>> On 4 Feb 2015, at 13:55, jan i <ja...@apache.org> wrote:
>> 
>>> On 4 February 2015 at 19:51, Louis S <lu...@gmail.com> wrote:
>>> 
>>> I posted on this to see if pdfbox could offer insight s it is taken up.
>>> Dave pointed out that the functionality of pdfbox ws interesting to his
>>> company.
>>> 
>> 
>> And I think your posting was interesting information (such information is
>> needed to see what moves out there). But I do not think we currently should
>> think about putting it into Corinthia.
>> 
> No objections.
> 
>> rgds
>> jan i.
>> 
>> 
>>> Louis
>>> 
>>>> On 4 Feb 2015, at 12:03, jan i <ja...@apache.org> wrote:
>>>> 
>>>> On Wednesday, February 4, 2015, Peter Kelly <ke...@gmail.com> wrote:
>>>> 
>>>>>> On 4 Feb 2015, at 5:47 pm, Edward Zimmermann <edward.zimmermann@cib.de
>>>>> <javascript:;>> wrote:
>>>>>> 
>>>>>> Does this have anything to do with Corinthia? No. Corinthia is about
>>>>> content and especially word processing formats (OOXML, ODF etc.)..
>>>>> Corinthia is at its core about pragmatic fidelity. The point of the
>>>>> bidirectional transformation model is to be able to reduce fidelity
>>>>> demands. Unless the project wants to get sidetracked into HiFi rendering
>>>>> (of DOCX or ODT) it's completely outside of the scope….
>>>>> 
>>>>> I think of PDF in the same way as I do PNG. It’s intended as an output
>>>>> format, not an input format. I know there are tools out there which are
>>>>> effectively half of an OCR system which can reconstruct a source
>>> document
>>>>> by inferring the logical structure from the layout (e.g. where a
>>> paragraph
>>>>> begins and ends), though this is quite a difficult problem and I’m not
>>> sure
>>>>> that it’d be within the scope of Corinthia (though if someone has ideas
>>> on
>>>>> this and wants to work on it, I’m all for it - it’s just a very
>>> difficult
>>>>> and very different task to writing filters for all the other formats
>>> we’ve
>>>>> discussed).
>>>> 
>>>> +1 I think we currently have other more important tasks in corinthia.
>>>> 
>>>> 
>>>> rgds
>>>> jan i
>>>> 
>>>>> 
>>>>> On the other side is output to PDF - that is, typesetting. This is
>>>>> something I also think would be outside the scope of the project (at
>>> least
>>>>> based on my understanding of people’s interests to date). We basically
>>> rely
>>>>> on separate programs to do the typesetting of a document produced by the
>>>>> library, e.g. LaTeX, WebKit/other browser engines.
>>>>> 
>>>>> --
>>>>> Dr. Peter M. Kelly
>>>>> kellypmk@gmail.com <javascript:;>
>>>>> http://www.kellypmk.net/
>>>>> 
>>>>> PGP key: http://www.kellypmk.net/pgp-key <
>>> http://www.kellypmk.net/pgp-key>
>>>>> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
>>>> 
>>>> --
>>>> Sent from My iPad, sorry for any misspellings.
>>> 


Re: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Posted by Louis S <lu...@gmail.com>.

Louis

> On 4 Feb 2015, at 13:55, jan i <ja...@apache.org> wrote:
> 
>> On 4 February 2015 at 19:51, Louis S <lu...@gmail.com> wrote:
>> 
>> I posted on this to see if pdfbox could offer insight s it is taken up.
>> Dave pointed out that the functionality of pdfbox ws interesting to his
>> company.
>> 
> 
> And I think your posting was interesting information (such information is
> needed to see what moves out there). But I do not think we currently should
> think about putting it into Corinthia.
> 
No objections.

> rgds
> jan i.
> 
> 
>> Louis
>> 
>>> On 4 Feb 2015, at 12:03, jan i <ja...@apache.org> wrote:
>>> 
>>> On Wednesday, February 4, 2015, Peter Kelly <ke...@gmail.com> wrote:
>>> 
>>>>> On 4 Feb 2015, at 5:47 pm, Edward Zimmermann <edward.zimmermann@cib.de
>>>> <javascript:;>> wrote:
>>>>> 
>>>>> Does this have anything to do with Corinthia? No. Corinthia is about
>>>> content and especially word processing formats (OOXML, ODF etc.)..
>>>> Corinthia is at its core about pragmatic fidelity. The point of the
>>>> bidirectional transformation model is to be able to reduce fidelity
>>>> demands. Unless the project wants to get sidetracked into HiFi rendering
>>>> (of DOCX or ODT) it's completely outside of the scope….
>>>> 
>>>> I think of PDF in the same way as I do PNG. It’s intended as an output
>>>> format, not an input format. I know there are tools out there which are
>>>> effectively half of an OCR system which can reconstruct a source
>> document
>>>> by inferring the logical structure from the layout (e.g. where a
>> paragraph
>>>> begins and ends), though this is quite a difficult problem and I’m not
>> sure
>>>> that it’d be within the scope of Corinthia (though if someone has ideas
>> on
>>>> this and wants to work on it, I’m all for it - it’s just a very
>> difficult
>>>> and very different task to writing filters for all the other formats
>> we’ve
>>>> discussed).
>>> 
>>> +1 I think we currently have other more important tasks in corinthia.
>>> 
>>> 
>>> rgds
>>> jan i
>>> 
>>>> 
>>>> On the other side is output to PDF - that is, typesetting. This is
>>>> something I also think would be outside the scope of the project (at
>> least
>>>> based on my understanding of people’s interests to date). We basically
>> rely
>>>> on separate programs to do the typesetting of a document produced by the
>>>> library, e.g. LaTeX, WebKit/other browser engines.
>>>> 
>>>> --
>>>> Dr. Peter M. Kelly
>>>> kellypmk@gmail.com <javascript:;>
>>>> http://www.kellypmk.net/
>>>> 
>>>> PGP key: http://www.kellypmk.net/pgp-key <
>> http://www.kellypmk.net/pgp-key>
>>>> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
>>> 
>>> --
>>> Sent from My iPad, sorry for any misspellings.
>> 

Re: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Posted by jan i <ja...@apache.org>.
On 4 February 2015 at 19:51, Louis S <lu...@gmail.com> wrote:

> I posted on this to see if pdfbox could offer insight s it is taken up.
> Dave pointed out that the functionality of pdfbox ws interesting to his
> company.
>

And I think your posting was interesting information (such information is
needed to see what moves out there). But I do not think we currently should
think about putting it into Corinthia.

rgds
jan i.


> Louis
>
> > On 4 Feb 2015, at 12:03, jan i <ja...@apache.org> wrote:
> >
> > On Wednesday, February 4, 2015, Peter Kelly <ke...@gmail.com> wrote:
> >
> >>> On 4 Feb 2015, at 5:47 pm, Edward Zimmermann <edward.zimmermann@cib.de
> >> <javascript:;>> wrote:
> >>>
> >>> Does this have anything to do with Corinthia? No. Corinthia is about
> >> content and especially word processing formats (OOXML, ODF etc.)..
> >> Corinthia is at its core about pragmatic fidelity. The point of the
> >> bidirectional transformation model is to be able to reduce fidelity
> >> demands. Unless the project wants to get sidetracked into HiFi rendering
> >> (of DOCX or ODT) it's completely outside of the scope….
> >>
> >> I think of PDF in the same way as I do PNG. It’s intended as an output
> >> format, not an input format. I know there are tools out there which are
> >> effectively half of an OCR system which can reconstruct a source
> document
> >> by inferring the logical structure from the layout (e.g. where a
> paragraph
> >> begins and ends), though this is quite a difficult problem and I’m not
> sure
> >> that it’d be within the scope of Corinthia (though if someone has ideas
> on
> >> this and wants to work on it, I’m all for it - it’s just a very
> difficult
> >> and very different task to writing filters for all the other formats
> we’ve
> >> discussed).
> >
> > +1 I think we currently have other more important tasks in corinthia.
> >
> >
> > rgds
> > jan i
> >
> >>
> >> On the other side is output to PDF - that is, typesetting. This is
> >> something I also think would be outside the scope of the project (at
> least
> >> based on my understanding of people’s interests to date). We basically
> rely
> >> on separate programs to do the typesetting of a document produced by the
> >> library, e.g. LaTeX, WebKit/other browser engines.
> >>
> >> --
> >> Dr. Peter M. Kelly
> >> kellypmk@gmail.com <javascript:;>
> >> http://www.kellypmk.net/
> >>
> >> PGP key: http://www.kellypmk.net/pgp-key <
> http://www.kellypmk.net/pgp-key>
> >> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
> >
> > --
> > Sent from My iPad, sorry for any misspellings.
>

Re: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Posted by Louis S <lu...@gmail.com>.
I posted on this to see if pdfbox could offer insight s it is taken up. Dave pointed out that the functionality of pdfbox ws interesting to his company.

Louis

> On 4 Feb 2015, at 12:03, jan i <ja...@apache.org> wrote:
> 
> On Wednesday, February 4, 2015, Peter Kelly <ke...@gmail.com> wrote:
> 
>>> On 4 Feb 2015, at 5:47 pm, Edward Zimmermann <edward.zimmermann@cib.de
>> <javascript:;>> wrote:
>>> 
>>> Does this have anything to do with Corinthia? No. Corinthia is about
>> content and especially word processing formats (OOXML, ODF etc.)..
>> Corinthia is at its core about pragmatic fidelity. The point of the
>> bidirectional transformation model is to be able to reduce fidelity
>> demands. Unless the project wants to get sidetracked into HiFi rendering
>> (of DOCX or ODT) it's completely outside of the scope….
>> 
>> I think of PDF in the same way as I do PNG. It’s intended as an output
>> format, not an input format. I know there are tools out there which are
>> effectively half of an OCR system which can reconstruct a source document
>> by inferring the logical structure from the layout (e.g. where a paragraph
>> begins and ends), though this is quite a difficult problem and I’m not sure
>> that it’d be within the scope of Corinthia (though if someone has ideas on
>> this and wants to work on it, I’m all for it - it’s just a very difficult
>> and very different task to writing filters for all the other formats we’ve
>> discussed).
> 
> +1 I think we currently have other more important tasks in corinthia.
> 
> 
> rgds
> jan i
> 
>> 
>> On the other side is output to PDF - that is, typesetting. This is
>> something I also think would be outside the scope of the project (at least
>> based on my understanding of people’s interests to date). We basically rely
>> on separate programs to do the typesetting of a document produced by the
>> library, e.g. LaTeX, WebKit/other browser engines.
>> 
>> --
>> Dr. Peter M. Kelly
>> kellypmk@gmail.com <javascript:;>
>> http://www.kellypmk.net/
>> 
>> PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
>> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
> 
> -- 
> Sent from My iPad, sorry for any misspellings.

Re: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Posted by jan i <ja...@apache.org>.
On Wednesday, February 4, 2015, Peter Kelly <ke...@gmail.com> wrote:

> > On 4 Feb 2015, at 5:47 pm, Edward Zimmermann <edward.zimmermann@cib.de
> <javascript:;>> wrote:
> >
> > Does this have anything to do with Corinthia? No. Corinthia is about
> content and especially word processing formats (OOXML, ODF etc.)..
> Corinthia is at its core about pragmatic fidelity. The point of the
> bidirectional transformation model is to be able to reduce fidelity
> demands. Unless the project wants to get sidetracked into HiFi rendering
> (of DOCX or ODT) it's completely outside of the scope….
>
> I think of PDF in the same way as I do PNG. It’s intended as an output
> format, not an input format. I know there are tools out there which are
> effectively half of an OCR system which can reconstruct a source document
> by inferring the logical structure from the layout (e.g. where a paragraph
> begins and ends), though this is quite a difficult problem and I’m not sure
> that it’d be within the scope of Corinthia (though if someone has ideas on
> this and wants to work on it, I’m all for it - it’s just a very difficult
> and very different task to writing filters for all the other formats we’ve
> discussed).

+1 I think we currently have other more important tasks in corinthia.


rgds
jan i

>
> On the other side is output to PDF - that is, typesetting. This is
> something I also think would be outside the scope of the project (at least
> based on my understanding of people’s interests to date). We basically rely
> on separate programs to do the typesetting of a document produced by the
> library, e.g. LaTeX, WebKit/other browser engines.
>
> --
> Dr. Peter M. Kelly
> kellypmk@gmail.com <javascript:;>
> http://www.kellypmk.net/
>
> PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
>
>

-- 
Sent from My iPad, sorry for any misspellings.

Re: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Posted by Peter Kelly <ke...@gmail.com>.
> On 4 Feb 2015, at 5:47 pm, Edward Zimmermann <ed...@cib.de> wrote:
> 
> Does this have anything to do with Corinthia? No. Corinthia is about content and especially word processing formats (OOXML, ODF etc.).. Corinthia is at its core about pragmatic fidelity. The point of the bidirectional transformation model is to be able to reduce fidelity demands. Unless the project wants to get sidetracked into HiFi rendering (of DOCX or ODT) it's completely outside of the scope….

I think of PDF in the same way as I do PNG. It’s intended as an output format, not an input format. I know there are tools out there which are effectively half of an OCR system which can reconstruct a source document by inferring the logical structure from the layout (e.g. where a paragraph begins and ends), though this is quite a difficult problem and I’m not sure that it’d be within the scope of Corinthia (though if someone has ideas on this and wants to work on it, I’m all for it - it’s just a very difficult and very different task to writing filters for all the other formats we’ve discussed).

On the other side is output to PDF - that is, typesetting. This is something I also think would be outside the scope of the project (at least based on my understanding of people’s interests to date). We basically rely on separate programs to do the typesetting of a document produced by the library, e.g. LaTeX, WebKit/other browser engines.

--
Dr. Peter M. Kelly
kellypmk@gmail.com
http://www.kellypmk.net/

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)


AW: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Posted by Edward Zimmermann <Ed...@cib.de>.
Hi All,

I'm not familiar with any conformance testing, much less strong, within the PDFA. The Association is mainly about PDF advocacy and as a networking umbrella for most of the world's PDF experts. As a Class-A liaison to ISO on PDF we've, for example, had via the Association early access to PDF 2.0 and were able to contribute our reviews. The Association also maintains a PDF/A competence center and provides expertise to a number of organizations.

Conformance testing? Of What?

The PDF standard does NOT specify any methods for conformance testing of PDF files or renderers (readers, printers etc.). A conforming application is one that follows the standard! :-)

Rendering? One of the weaknesses of PDF has, in fact, been that a few features in the specification have not been well-defined (leading many to view Adobe's Reader as a kind of test reference despite a number of clear cases where Adobe does the wrong thing) and many others not widely implemented. To my knowledge no vendor implements the complete standard--- not even Adobe (and Adobe does not do everything that they do right).

There are a number of conformance testing suites about (for example from Quality Logic) and nearly all use Adobe's rendering engine as reference. To my knowledge I know of no freely available exhaustive PDF rendering conformance test suite. At CIB we have our own regression test sets for our own use cases and when in doubt we test our rendering against not just Adobe but a number of other engines. 

Part of the goal of PDF 2.0 is to remove (or depreciate) features that are hardly implemented or more-or-less a vendor standard such as XFA, try to clear up the ambiguities and tighten things up. It's within 2.0 that we might eventually be able to speak about a grammar-- unfortunately PDF 2.0 builds on PDF 1.7 so needs to carry a lot of historical baggage about.

The other kind of conformance is to a number of PDF standards (or profiles) such as PDF/A--- which itself have a number of sub-flavors. Here it's about trying to gauge if the PDF is OK-- not that an application is doing the right thing with it. The standard tool is Adobe Preflight from Callas. It's expensive, constantly changing (what in one version passes as OK might fail in the next or visa versa). PDFBox has its own preflight for PDF/A-1.

PDF/A testing is important since the point of PDF/A is the hope that it can better survive.

http://duff-johnson.com/wp-content/uploads/2014/01/PDFValidationDreamOrYawn.pdf


Within the Preforma project http://www.preforma-project.eu/ is a sub-task to develop a PDF/A preflight. 

http://www.dpconline.org/newsroom/latest-news/1399-dpc-members-invited-to-take-part-in-new-verapdf-project-webinar-for-review-of-functional-specification-tuesday-3rd-january-2015

Does this have anything to do with Corinthia? No. Corinthia is about content and especially word processing formats (OOXML, ODF etc.).. Corinthia is at its core about pragmatic fidelity. The point of the bidirectional transformation model is to be able to reduce fidelity demands. Unless the project wants to get sidetracked into HiFi rendering (of DOCX or ODT) it's completely outside of the scope....


-----Ursprüngliche Nachricht-----
Von: Dennis E. Hamilton [mailto:dennis.hamilton@acm.org] 
Gesendet: Dienstag, 3. Februar 2015 18:13
An: dev@corinthia.incubator.apache.org
Betreff: RE: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Adding to the remark by Louis.  The PDF Association also has some strong conformance testing and that, combined with a way to examine and test PDFs handled/produced by Corinthia using ODFBox, even if outside of Corinthia proper, is a valuable (side)-opportunity.

-----Original Message-----
From: Louis Suárez-Potts [mailto:luispo@gmail.com] 
Sent: Tuesday, February 3, 2015 08:29
To: dev@corinthia.incubator.apache.org
Subject: Re: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

[ ... ]. I was thinking more in terms that PDFBox grants those with JVMs to manipulate PDFs ad hoc. As a lot of enterprise docs. are PDFs, the utility of the service seems plain. (That OpenOffice can do this, too, to a limited degree, as can other open source applications is known; that they are not used this way and instead the maximally expensive options are used just goes to show you that nature doesn’t just abhor a vacuum, it sucks.)

But to return to the point. If one aspect of Corinthia is to enable the manipulation of documents, then it bears watching how other similar, if by no means congruent or identical, services fare in the market. More expanded: to investigate the possibility of cooperation if not collaboration; of mutual interest.

louis


Re: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Posted by Louis Suárez-Potts <lu...@gmail.com>.
> On 03-02-2015, at 19:22, Dave Fisher <da...@comcast.net> wrote:
> 
> PDFBox is pretty sweet.
> 
> You might recall my Osmosis talk at Apachecon Denver on PDF to PPTX.

> 
> 
Is the talk online? 


> At work my team has also created PDF to HTML5 (SVG) conversion with recomposition of text and shapes. This is why I want a good API and some way to plug DocFormat.

I think it would be pretty powerful, as a combo., and open a world of archives (national, gov’t., educational, and also corporate) to the consideration of applications like Apache OpenOffice which could then work with the range of static documents without committing the institution to any serious deviation of course. Ie, it could have its cake and eat it too.

Right now, I think one must buy costly Adobe products to obtain full PDF editing; .doc(x) is also included? But probably not full OOXML (or MS’s implementation thereof).


> 
> This is definitely a part of my interest in Corinthia.

I think focusing on a specific functionality—PDF/PPTX/Corinthia as a specific goal could be a means of attracting developers. It’s easier to represent, to market. And I think I know already of at least one group that would be interested in this, where "this" represents the editing functionality of Corinthia as well as its converting capability.

-louis
> 
> Regards,
> Dave
> 
> On Feb 3, 2015, at 9:12 AM, Dennis E. Hamilton wrote:
> 
>> Adding to the remark by Louis.  The PDF Association also has some strong conformance testing and that, combined with a way to examine and test PDFs handled/produced by Corinthia using ODFBox, even if outside of Corinthia proper, is a valuable (side)-opportunity.
>> 
>> -----Original Message-----
>> From: Louis Suárez-Potts [mailto:luispo@gmail.com] 
>> Sent: Tuesday, February 3, 2015 08:29
>> To: dev@corinthia.incubator.apache.org
>> Subject: Re: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog
>> 
>> [ ... ]. I was thinking more in terms that PDFBox grants those with JVMs to manipulate PDFs ad hoc. As a lot of enterprise docs. are PDFs, the utility of the service seems plain. (That OpenOffice can do this, too, to a limited degree, as can other open source applications is known; that they are not used this way and instead the maximally expensive options are used just goes to show you that nature doesn’t just abhor a vacuum, it sucks.)
>> 
>> But to return to the point. If one aspect of Corinthia is to enable the manipulation of documents, then it bears watching how other similar, if by no means congruent or identical, services fare in the market. More expanded: to investigate the possibility of cooperation if not collaboration; of mutual interest.
>> 
>> louis
>> 
> 


Re: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Posted by Dave Fisher <da...@comcast.net>.
PDFBox is pretty sweet.

You might recall my Osmosis talk at Apachecon Denver on PDF to PPTX.

At work my team has also created PDF to HTML5 (SVG) conversion with recomposition of text and shapes. This is why I want a good API and some way to plug DocFormat.

This is definitely a part of my interest in Corinthia.

Regards,
Dave

On Feb 3, 2015, at 9:12 AM, Dennis E. Hamilton wrote:

> Adding to the remark by Louis.  The PDF Association also has some strong conformance testing and that, combined with a way to examine and test PDFs handled/produced by Corinthia using ODFBox, even if outside of Corinthia proper, is a valuable (side)-opportunity.
> 
> -----Original Message-----
> From: Louis Suárez-Potts [mailto:luispo@gmail.com] 
> Sent: Tuesday, February 3, 2015 08:29
> To: dev@corinthia.incubator.apache.org
> Subject: Re: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog
> 
> [ ... ]. I was thinking more in terms that PDFBox grants those with JVMs to manipulate PDFs ad hoc. As a lot of enterprise docs. are PDFs, the utility of the service seems plain. (That OpenOffice can do this, too, to a limited degree, as can other open source applications is known; that they are not used this way and instead the maximally expensive options are used just goes to show you that nature doesn’t just abhor a vacuum, it sucks.)
> 
> But to return to the point. If one aspect of Corinthia is to enable the manipulation of documents, then it bears watching how other similar, if by no means congruent or identical, services fare in the market. More expanded: to investigate the possibility of cooperation if not collaboration; of mutual interest.
> 
> louis
> 


RE: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Posted by "Dennis E. Hamilton" <de...@acm.org>.
Adding to the remark by Louis.  The PDF Association also has some strong conformance testing and that, combined with a way to examine and test PDFs handled/produced by Corinthia using ODFBox, even if outside of Corinthia proper, is a valuable (side)-opportunity.

-----Original Message-----
From: Louis Suárez-Potts [mailto:luispo@gmail.com] 
Sent: Tuesday, February 3, 2015 08:29
To: dev@corinthia.incubator.apache.org
Subject: Re: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

[ ... ]. I was thinking more in terms that PDFBox grants those with JVMs to manipulate PDFs ad hoc. As a lot of enterprise docs. are PDFs, the utility of the service seems plain. (That OpenOffice can do this, too, to a limited degree, as can other open source applications is known; that they are not used this way and instead the maximally expensive options are used just goes to show you that nature doesn’t just abhor a vacuum, it sucks.)

But to return to the point. If one aspect of Corinthia is to enable the manipulation of documents, then it bears watching how other similar, if by no means congruent or identical, services fare in the market. More expanded: to investigate the possibility of cooperation if not collaboration; of mutual interest.

louis


Re: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Posted by jan i <ja...@apache.org>.
On 3 February 2015 at 17:29, Louis Suárez-Potts <lu...@gmail.com> wrote:

>
> > On 03-02-2015, at 11:22, jan i <ja...@apache.org> wrote:
> >
> > On Tuesday, February 3, 2015, Louis Suárez-Potts <lu...@gmail.com>
> wrote:
> >
> >> https://blogs.apache.org/foundation/entry/apache_pdfbox_named_an_open
> >>
> >> My guess is that everyone already knows about this :-)
> >
> > yup, but did I overlook a relevance for corinthia ?
> >
> > thanks for the info
> > rgds
> > jan i
> >
>
> Not directly to Corinthia. I was thinking more in terms that PDFBox grants
> those with JVMs to manipulate PDFs ad hoc. As a lot of enterprise docs. are
> PDFs, the utility of the service seems plain. (That OpenOffice can do this,
> too, to a limited degree, as can other open source applications is known;
> that they are not used this way and instead the maximally expensive options
> are used just goes to show you that nature doesn’t just abhor a vacuum, it
> sucks.)
>
> But to return to the point. If one aspect of Corinthia is to enable the
> manipulation of documents, then it bears watching how other similar, if by
> no means congruent or identical, services fare in the market. More
> expanded: to investigate the possibility of cooperation if not
> collaboration; of mutual interest.
>
you are completely right, watching is good !!! since it keeps us updated
with what is hot out there.

I was simply afraid I had missed a point, which would not be the first time.

rgds
jan I.

>
> louis
> >>
> >> Cheers,
> >> louis
> >>
> >>
> >
> > --
> > Sent from My iPad, sorry for any misspellings.
>
>

Re: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Posted by Louis Suárez-Potts <lu...@gmail.com>.
> On 03-02-2015, at 11:22, jan i <ja...@apache.org> wrote:
> 
> On Tuesday, February 3, 2015, Louis Suárez-Potts <lu...@gmail.com> wrote:
> 
>> https://blogs.apache.org/foundation/entry/apache_pdfbox_named_an_open
>> 
>> My guess is that everyone already knows about this :-)
> 
> yup, but did I overlook a relevance for corinthia ?
> 
> thanks for the info
> rgds
> jan i
> 

Not directly to Corinthia. I was thinking more in terms that PDFBox grants those with JVMs to manipulate PDFs ad hoc. As a lot of enterprise docs. are PDFs, the utility of the service seems plain. (That OpenOffice can do this, too, to a limited degree, as can other open source applications is known; that they are not used this way and instead the maximally expensive options are used just goes to show you that nature doesn’t just abhor a vacuum, it sucks.)

But to return to the point. If one aspect of Corinthia is to enable the manipulation of documents, then it bears watching how other similar, if by no means congruent or identical, services fare in the market. More expanded: to investigate the possibility of cooperation if not collaboration; of mutual interest.

louis
>> 
>> Cheers,
>> louis
>> 
>> 
> 
> -- 
> Sent from My iPad, sorry for any misspellings.


Re: Apache™ PDFBox™ named an Open Source Partner Organization of the PDF Association : The Apache Software Foundation Blog

Posted by jan i <ja...@apache.org>.
On Tuesday, February 3, 2015, Louis Suárez-Potts <lu...@gmail.com> wrote:

> https://blogs.apache.org/foundation/entry/apache_pdfbox_named_an_open
>
> My guess is that everyone already knows about this :-)

yup, but did I overlook a relevance for corinthia ?

thanks for the info
rgds
jan i

>
> Cheers,
> louis
>
>

-- 
Sent from My iPad, sorry for any misspellings.