You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Albretch Mueller <lb...@gmail.com> on 2022/10/08 13:22:39 UTC

Is Apache PDFBox based on the Arlington PDF Model? ...

 https://github.com/pdf-association/arlington-pdf-model/

 For whatever reason I (wrongly?) thought that to be the case:

 https://en.wikipedia.org/wiki/Apache_PDFBox

 https://en.wikipedia.org/wiki/COCOMO

 But I am not sure if it makes any functional sense anyway.

 I think it should be relatively easy and easily maintainable to code
around that model, which makes me wonder why hasn't a project been
started based on such baselines ideas.

 lbrtchx

RE: Is Apache PDFBox based on the Arlington PDF Model? ...

Posted by Peter Wyatt <pe...@pdfa.org>.

The Arlington PDF Model is all about the model data, NOT the code. The code artifacts are merely PoC hacks used for prototyping and assessing the capabilities and expressiveness of the model itself. The model is a standalone machine-readable definition of every PDF object defined in the ISO PDF 2.0 specification including many data integrity relationships. The model itself is also continuing to grow and expand in terms of scope and expressiveness. 

For an existing and mature project like Apache PDFBox, the value of the data model will most likely lie in checking implementation details (e.g. required-ness of keys, valid data ranges, etc), test case identification and/or generation, or debugging. Undoubtedly Apache PDFBox will also have grown some level of "permissiveness" to account for real-world malformations found in PDFs - the Arlington PDF Model also provides a means by which such permissiveness can be defined and documented as extensions to the official ISO baseline (the nominal "ground truth" for PDF).

A significant update is also about to occur to the Arlington PDF Model master branch as a result of use and adoption by others... see the "Extensions" branch.

> -----Original Message-----
> From: Jason Pyeron <jp...@pdinc.us>
> Sent: Sunday, 9 October 2022 1:57 AM
> To: users@pdfbox.apache.org
> Subject: RE: Is Apache PDFBox based on the Arlington PDF Model? ...
> 
> > -----Original Message-----
> > From: Albretch Mueller
> > Sent: Saturday, October 8, 2022 9:24 AM
> >
> >  https://github.com/pdf-association/arlington-pdf-model/
> 
> Interesting project form the PDF Association.
> 
> >
> >  For whatever reason I (wrongly?) thought that to be the case:
> >
> >  https://en.wikipedia.org/wiki/Apache_PDFBox
> >
> >  https://en.wikipedia.org/wiki/COCOMO
> >
> 
> What does COCOMO have to do with topic of the question?
> 
> >  But I am not sure if it makes any functional sense anyway.
> >
> >  I think it should be relatively easy and easily maintainable to code
> > around that model, which makes me wonder why hasn't a project been
> > started based on such baselines ideas.
> 
> To start with, looking at their initial commit to understand their point(S) of view and development vector:
> 
> commit a512182b24419a8b71895e262135f937ed22f1f9
> Author: Roman Toda <to...@digitaldocuments.org>
> Date:   Tue Feb 4 13:52:49 2020 +0100
> 
>     initial commit
> 
> There are several DLL files and mostly C code - very windows centric development and not about reading/writing
> PDFs in Java.
> 
> Also this started many years after PDFBox. So to answer the question in the subject, No. Apache PDFBox cannot be
> based on the Arlington PDF Model since PDFBox v1 was started in 2008 and v2 was first released in 2015.
> 
> A quick search of every commit on or before r1904460 (12a38bf88) for 'Arlington' has no results.
> 
> Next let's look at their java code:
> 
> $ find -name '*.java'
> ./gcxml/src/gcxml/Gcxml.java
> ./gcxml/src/gcxml/TSVHandler.java
> ./gcxml/src/gcxml/XMLCreator.java
> ./gcxml/src/gcxml/XMLQuery.java
> 
> Not very much, just seems to be their GC XML program. From the readme.md:
> 
> GXCML - Java PoC utlity
> Java-based proof of concept CLI utility that can:
> 
> convert an Arlington TSV file set into PDF version specific subsets (also as TSV)
> 
> In summary, not sure how or why any of this would be applicable to PDFBox.
> 
> -Jason
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

RE: Is Apache PDFBox based on the Arlington PDF Model? ...

Posted by Jason Pyeron <jp...@pdinc.us>.

> -----Original Message-----
> From: Albretch Mueller 
> Sent: Saturday, October 8, 2022 9:24 AM
> 
>  https://github.com/pdf-association/arlington-pdf-model/

Interesting project form the PDF Association.

> 
>  For whatever reason I (wrongly?) thought that to be the case:
> 
>  https://en.wikipedia.org/wiki/Apache_PDFBox
> 
>  https://en.wikipedia.org/wiki/COCOMO
> 

What does COCOMO have to do with topic of the question? 

>  But I am not sure if it makes any functional sense anyway.
> 
>  I think it should be relatively easy and easily maintainable to code
> around that model, which makes me wonder why hasn't a project been
> started based on such baselines ideas.

To start with, looking at their initial commit to understand their point(S) of view and development vector:

commit a512182b24419a8b71895e262135f937ed22f1f9
Author: Roman Toda <to...@digitaldocuments.org>
Date:   Tue Feb 4 13:52:49 2020 +0100

    initial commit

There are several DLL files and mostly C code - very windows centric development and not about reading/writing PDFs in Java.

Also this started many years after PDFBox. So to answer the question in the subject, No. Apache PDFBox cannot be based on the Arlington PDF Model since PDFBox v1 was started in 2008 and v2 was first released in 2015.

A quick search of every commit on or before r1904460 (12a38bf88) for 'Arlington' has no results.

Next let's look at their java code:

$ find -name '*.java'
./gcxml/src/gcxml/Gcxml.java
./gcxml/src/gcxml/TSVHandler.java
./gcxml/src/gcxml/XMLCreator.java
./gcxml/src/gcxml/XMLQuery.java

Not very much, just seems to be their GC XML program. From the readme.md:

GXCML - Java PoC utlity
Java-based proof of concept CLI utility that can:

convert an Arlington TSV file set into PDF version specific subsets (also as TSV)

In summary, not sure how or why any of this would be applicable to PDFBox.

-Jason


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Is Apache PDFBox based on the Arlington PDF Model? ...

Posted by Albretch Mueller <lb...@gmail.com>.

 https://github.com/pdf-association/arlington-pdf-model/

 For whatever reason I (wrongly?) thought that to be the case:

 https://en.wikipedia.org/wiki/Apache_PDFBox

 https://en.wikipedia.org/wiki/COCOMO

 But I am not sure if it makes any functional sense anyway.

 I think it should be relatively easy and easily maintainable to code
around that model, which makes me wonder why hasn't a project been
started based on such baselines ideas.

 lbrtchx

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Is Apache PDFBox based on the Arlington PDF Model? ...

Posted by Albretch Mueller <lb...@gmail.com>.

On 10/8/22, Tim Allison <ta...@apache.org> wrote:
>> But I am not sure if it makes any functional sense anyway.
> There's far more to parsing and the capabilities to what PDFBox and
> other PDF tools offer than just validating compliance with the spec.

 Well, I figured, but the kind of functionality APDFM xml offers, so
easily exploitable as some sort of SAX listener interface linked to
some command objects array through their XPath hash tables is even
dreamy to me ;-) ... to a point that I feel like starting working on
some PoC right now to be later added or merged onto the PDFBox code
base. I would like to at least have some PDFBox and/or tika folks
participate or watch over what I do.

 I am more of a data analyst, corpora research kind of guy and I may
have to move my mind somewhere else once in a while. I think that
would be some important code which would deserve permanent attention.

 If anyone runs into this thread I would recommend Peter Wyatt's one
paper (April 5th, 2021):

// __ Work in progress: Demystifying PDF through a machine-readable definition

 https://raw.githubusercontent.com/gangtan/LangSec-papers-and-slides/main/langsec21/papers/Wyatt_LangSec21.pdf
~
 lbrtchx

Re: Is Apache PDFBox based on the Arlington PDF Model? ...

Posted by Tim Allison <ta...@apache.org>.

>which makes me wonder why hasn't a project been
started based on such baselines ideas.

Arlington was released only a bit ago and is not yet complete.  Lots
more to do.  It is a revolutionary offering in the PDF space, and I
cannot begin to express how grateful I am to have it.

> But I am not sure if it makes any functional sense anyway.
There's far more to parsing and the capabilities to what PDFBox and
other PDF tools offer than just validating compliance with the spec.
I do not mean to diminish Arlington when I say this!

I won't speak for PDFBox, but I think these are two related but
different technologies.  I've wrapped a parser that uses Arlington's
grammar checker on my personal github site...if that's of any
interest.

On Sat, Oct 8, 2022 at 9:22 AM Albretch Mueller <lb...@gmail.com> wrote:
>
>  https://github.com/pdf-association/arlington-pdf-model/
>
>  For whatever reason I (wrongly?) thought that to be the case:
>
>  https://en.wikipedia.org/wiki/Apache_PDFBox
>
>  https://en.wikipedia.org/wiki/COCOMO
>
>  But I am not sure if it makes any functional sense anyway.
>
>  I think it should be relatively easy and easily maintainable to code
> around that model, which makes me wonder why hasn't a project been
> started based on such baselines ideas.
>
>  lbrtchx