You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Maruan Sahyoun <sa...@fileaffairs.de> on 2013/03/26 14:21:46 UTC

[PDFBox 2.0] Ideas

Hi there,

here is a rough summary of some ideas I have for a potential pdfbox 2.0 release. Maybe we could capture these as part of a wiki or jira ticket so we can add and agree on some of these if we want to. As soon as we have agreement we could have individual tickets for them.

WDYT?

# rearchitect PDF parsing into lexing, incremental (non caching) parser and caching parser
o the lexer would be the low level component delivering tokens to the parser. A sample implementation exists as part of PDFBOX-1000. The benefit would be a clean low level handling of tokens. Although I proposed the lexer I'm not totally happy with the current implementation. That's something for another mail/ticket ...
o the incremental (non caching) parser would allow for page by page processing moving forward only to support text extraction, merging, splitting … - the benefit would be a lower memory consumption as well as a potential faster processing
o the caching parser would support applications such a PDFDebugger or PDFReader

# handling of pdf versions
the current implementation is a mix of PDF 1.4 and some adhoc additions without a clear distinction what is and is not supported. We could ad some support for explicitly handling versions in pdfbox e.g. my marking certain methods and properties to the pdf version support level. This could in addition be a good basis for PDF/A and other compliance checks.

# handle large pdf files
in addition to the pdf parsing pdfbox does not always handle large pdf files well as some of the references are implemented as int instead of long

# split pdfbox into modules to support use cases such as text extraction and merge with the minimum amount of classes needed. more app like tolls such as the PDFDebugger or PDFReader could be additional modules.

With kind regards

Maruan Sahyoun

Re: [PDFBox 2.0] Ideas

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

+1 for releasing together

Maruan Sahyoun

Am 31.03.2013 um 19:54 schrieb Guillaume Bailleul <gb...@gmail.com>:

> Hi all,
> 
> I agree with Timo, pdfbox is not (yet) a big project so releasing per
> module will cost too many.
> We can have modules definition and numbering that permit to do separate
> releases in the futur even if we do not for the moment.
> 
> Guillaume
> Le 29 mars 2013 15:35, "timo.boehme@ontochem.com" <ti...@ontochem.com>
> a écrit :
> 
>> Hi,
>> 
>> I think that doing a release is quite a bit of work and having multiple
>> modules
>> with separate releases each requires extra time. As long as there are no
>> module
>> specific maintainers with responsibilities for releases we should do
>> releases
>> with the complete module set. This also prevents problems with
>> incompatibilities
>> between the modules.
>> 
>> BR
>> Timo
>> 
>>> Maruan Sahyoun <sa...@fileaffairs.de> hat am 29. März 2013 um 14:15
>>> geschrieben:
>>> Am 29.03.2013 um 13:18 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>>> <SNIP>
>>>>> One thing to consider is how we handle releases afterwards. Will we
>> always
>>>>> release
>>>>> all modules as part of a release (like Apache Camel does) or do
>> releases
>>>>> seperately (as Apache Sling does).
>>>> That's a good point, but it'll depend on the details. AFAIK Sling is
>> OSGI
>>>> based
>>>> so that all components should be independent, which makes it easier to
>>>> release
>>>> them separately.
>>> 
>>> Correct Sling is OSGI based. But Apache Camel also has a core component
>> on
>>> which others are based. And they had a similar discussion. I don't think
>> it's
>>> a technical question as if we go for modules within minor releases API's
>>> should stay stable so e.g. PDFReader could count on PDFParser. But as a
>> start
>>> why don't release all modules together and revisit that question later.
>>> 
>>>>> I'm happy to help with implementation/rearrangement as soon as the
>>>>> transition to the CMS is done
>>>> Cool!
>>>> 
>>>> BR
>>>> Andreas Lehmkühler
>>> 
>>> BR
>>> Maruan Sahyoun
>>

Re: [PDFBox 2.0] Ideas

Posted by Guillaume Bailleul <gb...@gmail.com>.

Hi all,

I agree with Timo, pdfbox is not (yet) a big project so releasing per
module will cost too many.
We can have modules definition and numbering that permit to do separate
releases in the futur even if we do not for the moment.

Guillaume
Le 29 mars 2013 15:35, "timo.boehme@ontochem.com" <ti...@ontochem.com>
a écrit :

> Hi,
>
> I think that doing a release is quite a bit of work and having multiple
> modules
> with separate releases each requires extra time. As long as there are no
> module
> specific maintainers with responsibilities for releases we should do
> releases
> with the complete module set. This also prevents problems with
> incompatibilities
> between the modules.
>
> BR
> Timo
>
> > Maruan Sahyoun <sa...@fileaffairs.de> hat am 29. März 2013 um 14:15
> > geschrieben:
> > Am 29.03.2013 um 13:18 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
> > <SNIP>
> > >> One thing to consider is how we handle releases afterwards. Will we
> always
> > >> release
> > > > all modules as part of a release (like Apache Camel does) or do
> releases
> > > > seperately (as Apache Sling does).
> > > That's a good point, but it'll depend on the details. AFAIK Sling is
> OSGI
> > > based
> > > so that all components should be independent, which makes it easier to
> > > release
> > > them separately.
> >
> > Correct Sling is OSGI based. But Apache Camel also has a core component
> on
> > which others are based. And they had a similar discussion. I don't think
> it's
> > a technical question as if we go for modules within minor releases API's
> > should stay stable so e.g. PDFReader could count on PDFParser. But as a
> start
> > why don't release all modules together and revisit that question later.
> >
> > >> I'm happy to help with implementation/rearrangement as soon as the
> > >> transition to the CMS is done
> > > Cool!
> > >
> > > BR
> > > Andreas Lehmkühler
> >
> > BR
> > Maruan Sahyoun
>

Re: [PDFBox 2.0] Ideas

Posted by "timo.boehme@ontochem.com" <ti...@ontochem.com>.

Hi,

I think that doing a release is quite a bit of work and having multiple modules
with separate releases each requires extra time. As long as there are no module
specific maintainers with responsibilities for releases we should do releases
with the complete module set. This also prevents problems with incompatibilities
between the modules.

BR
Timo

> Maruan Sahyoun <sa...@fileaffairs.de> hat am 29. März 2013 um 14:15
> geschrieben:
> Am 29.03.2013 um 13:18 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
> <SNIP>
> >> One thing to consider is how we handle releases afterwards. Will we always
> >> release
> > > all modules as part of a release (like Apache Camel does) or do releases
> > > seperately (as Apache Sling does).
> > That's a good point, but it'll depend on the details. AFAIK Sling is OSGI
> > based
> > so that all components should be independent, which makes it easier to
> > release
> > them separately.
>
> Correct Sling is OSGI based. But Apache Camel also has a core component on
> which others are based. And they had a similar discussion. I don't think it's
> a technical question as if we go for modules within minor releases API's
> should stay stable so e.g. PDFReader could count on PDFParser. But as a start
> why don't release all modules together and revisit that question later.
>
> >> I'm happy to help with implementation/rearrangement as soon as the
> >> transition to the CMS is done
> > Cool!
> >
> > BR
> > Andreas Lehmkühler
>
> BR
> Maruan Sahyoun

Re: [PDFBox 2.0] Ideas

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi,

Maruan Sahyoun

Am 29.03.2013 um 13:18 schrieb Andreas Lehmkuehler <an...@lehmi.de>:

<SNIP>

>> One thing to consider is how we handle releases afterwards. Will we always release
> > all modules as part of a release (like Apache Camel does) or do releases
> > seperately (as Apache Sling does).
> That's a good point, but it'll depend on the details. AFAIK Sling is OSGI based
> so that all components should be independent, which makes it easier to release
> them separately.
> 

Correct Sling is OSGI based. But Apache Camel also has a core component on which others are based. And they had a similar discussion. I don't think it's a technical question as if we go for modules within minor releases API's should stay stable so e.g. PDFReader could count on PDFParser. But as a start why don't release all modules together and revisit that question later.

>> I'm happy to help with implementation/rearrangement as soon as the transition to the CMS is done
> Cool!
> 
> 
> BR
> Andreas Lehmkühler
> 

BR
Maruan Sahyoun

Re: [PDFBox 2.0] Ideas

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Hi,

Am 29.03.2013 12:53, schrieb Maruan Sahyoun:

<SNIP>

>>> I think some wiki pages should be good. This discussion already
>>> started but as mails in the list or maybe jira tickets lost in the
>>> flow.
>>>
>>> There is an apache wiki [1], but I found nothing on PDFBox, a good way
>>> occasion to start.
>> Once we migrated our site to the Apache CMS we'll have some sort of wiki, so
>> that we IMHO don't have to ask for other one.
>>
>>> I do not have many more ideas. According to me, having different
>>> modules for PDF parsers, PDF makers and PDF viewers is an important
>>> one.
>> This is one of my favourites, too. Let's see what'll come up. At least we don't
>> only need people who are interested in some features but also in implementing it ;-)
>
> We might be able to split into modules based on the current code and rearchitect
 > the individual parts later. E.g the command line tools could easily be separated
 > as well as PDFDebugger, PDFReader.
Yes, there are some easy ones and others will be more complicated

> One thing to consider is how we handle releases afterwards. Will we always release
 > all modules as part of a release (like Apache Camel does) or do releases
 > seperately (as Apache Sling does).
That's a good point, but it'll depend on the details. AFAIK Sling is OSGI based
so that all components should be independent, which makes it easier to release
them separately.

> I'm happy to help with implementation/rearrangement as soon as the transition to the CMS is done
Cool!


BR
Andreas Lehmkühler

Re: [PDFBox 2.0] Ideas

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi,

Am 29.03.2013 um 12:27 schrieb Andreas Lehmkuehler <an...@lehmi.de>:

> Hi,
> 
> Am 28.03.2013 21:04, schrieb Guillaume Bailleul:
>> On Tue, Mar 26, 2013 at 2:21 PM, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
>>> Hi there,
>>> 
>>> here is a rough summary of some ideas I have for a potential pdfbox 2.0 release. Maybe we could capture these as part of a wiki or jira ticket so we can add and agree on some of these if we want to. As soon as we have agreement we could have individual tickets for them.
>>> 
>>> WDYT?
>>> 
>>> 
>>> # rearchitect PDF parsing into lexing, incremental (non caching) parser and caching parser
>>> o the lexer would be the low level component delivering tokens to the parser. A sample implementation exists as part of PDFBOX-1000. The benefit would be a clean low level handling of tokens. Although I proposed the lexer I'm not totally happy with the current implementation. That's something for another mail/ticket ...
>>> o the incremental (non caching) parser would allow for page by page processing moving forward only to support text extraction, merging, splitting … - the benefit would be a lower memory consumption as well as a potential faster processing
>>> o the caching parser would support applications such a PDFDebugger or PDFReader
>>> 
>>> # handling of pdf versions
>>> the current implementation is a mix of PDF 1.4 and some adhoc additions without a clear distinction what is and is not supported. We could ad some support for explicitly handling versions in pdfbox e.g. my marking certain methods and properties to the pdf version support level. This could in addition be a good basis for PDF/A and other compliance checks.
>>> 
>>> # handle large pdf files
>>> in addition to the pdf parsing pdfbox does not always handle large pdf files well as some of the references are implemented as int instead of long
>>> 
>>> # split pdfbox into modules to support use cases such as text extraction and merge with the minimum amount of classes needed. more app like tolls such as the PDFDebugger or PDFReader could be additional modules.
>>> 
>>> With kind regards
>>> 
>>> 
>>> Maruan Sahyoun
>>> 
>> 
>> Hi Maruan,
>> 
>> I think some wiki pages should be good. This discussion already
>> started but as mails in the list or maybe jira tickets lost in the
>> flow.
>> 
>> There is an apache wiki [1], but I found nothing on PDFBox, a good way
>> occasion to start.
> Once we migrated our site to the Apache CMS we'll have some sort of wiki, so
> that we IMHO don't have to ask for other one.
> 
>> I do not have many more ideas. According to me, having different
>> modules for PDF parsers, PDF makers and PDF viewers is an important
>> one.
> This is one of my favourites, too. Let's see what'll come up. At least we don't
> only need people who are interested in some features but also in implementing it ;-)

We might be able to split into modules based on the current code and rearchitect the individual parts later. E.g the command line tools could easily be separated as well as PDFDebugger, PDFReader. One thing to consider is how we handle releases afterwards. Will we always release all modules as part of a release (like Apache Camel does) or do releases seperately (as Apache Sling does).

I'm happy to help with implementation/rearrangement as soon as the transition to the CMS is done

> 
> 
>> [1] http://wiki.apache.org/general/
>> 
>> Guillaume Bailleul
> 
> BR
> Andreas Lehmkühler
>

Re: [PDFBox 2.0] Ideas

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Hi,

Am 28.03.2013 21:04, schrieb Guillaume Bailleul:
> On Tue, Mar 26, 2013 at 2:21 PM, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
>> Hi there,
>>
>> here is a rough summary of some ideas I have for a potential pdfbox 2.0 release. Maybe we could capture these as part of a wiki or jira ticket so we can add and agree on some of these if we want to. As soon as we have agreement we could have individual tickets for them.
>>
>> WDYT?
>>
>>
>> # rearchitect PDF parsing into lexing, incremental (non caching) parser and caching parser
>> o the lexer would be the low level component delivering tokens to the parser. A sample implementation exists as part of PDFBOX-1000. The benefit would be a clean low level handling of tokens. Although I proposed the lexer I'm not totally happy with the current implementation. That's something for another mail/ticket ...
>> o the incremental (non caching) parser would allow for page by page processing moving forward only to support text extraction, merging, splitting … - the benefit would be a lower memory consumption as well as a potential faster processing
>> o the caching parser would support applications such a PDFDebugger or PDFReader
>>
>> # handling of pdf versions
>> the current implementation is a mix of PDF 1.4 and some adhoc additions without a clear distinction what is and is not supported. We could ad some support for explicitly handling versions in pdfbox e.g. my marking certain methods and properties to the pdf version support level. This could in addition be a good basis for PDF/A and other compliance checks.
>>
>> # handle large pdf files
>> in addition to the pdf parsing pdfbox does not always handle large pdf files well as some of the references are implemented as int instead of long
>>
>> # split pdfbox into modules to support use cases such as text extraction and merge with the minimum amount of classes needed. more app like tolls such as the PDFDebugger or PDFReader could be additional modules.
>>
>> With kind regards
>>
>>
>> Maruan Sahyoun
>>
>
> Hi Maruan,
>
> I think some wiki pages should be good. This discussion already
> started but as mails in the list or maybe jira tickets lost in the
> flow.
>
> There is an apache wiki [1], but I found nothing on PDFBox, a good way
> occasion to start.
Once we migrated our site to the Apache CMS we'll have some sort of wiki, so
that we IMHO don't have to ask for other one.

> I do not have many more ideas. According to me, having different
> modules for PDF parsers, PDF makers and PDF viewers is an important
> one.
This is one of my favourites, too. Let's see what'll come up. At least we don't
only need people who are interested in some features but also in implementing it ;-)


> [1] http://wiki.apache.org/general/
>
> Guillaume Bailleul

BR
Andreas Lehmkühler

Re: [PDFBox 2.0] Ideas

Posted by Guillaume Bailleul <gb...@gmail.com>.

On Tue, Mar 26, 2013 at 2:21 PM, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
> Hi there,
>
> here is a rough summary of some ideas I have for a potential pdfbox 2.0 release. Maybe we could capture these as part of a wiki or jira ticket so we can add and agree on some of these if we want to. As soon as we have agreement we could have individual tickets for them.
>
> WDYT?
>
>
> # rearchitect PDF parsing into lexing, incremental (non caching) parser and caching parser
> o the lexer would be the low level component delivering tokens to the parser. A sample implementation exists as part of PDFBOX-1000. The benefit would be a clean low level handling of tokens. Although I proposed the lexer I'm not totally happy with the current implementation. That's something for another mail/ticket ...
> o the incremental (non caching) parser would allow for page by page processing moving forward only to support text extraction, merging, splitting … - the benefit would be a lower memory consumption as well as a potential faster processing
> o the caching parser would support applications such a PDFDebugger or PDFReader
>
> # handling of pdf versions
> the current implementation is a mix of PDF 1.4 and some adhoc additions without a clear distinction what is and is not supported. We could ad some support for explicitly handling versions in pdfbox e.g. my marking certain methods and properties to the pdf version support level. This could in addition be a good basis for PDF/A and other compliance checks.
>
> # handle large pdf files
> in addition to the pdf parsing pdfbox does not always handle large pdf files well as some of the references are implemented as int instead of long
>
> # split pdfbox into modules to support use cases such as text extraction and merge with the minimum amount of classes needed. more app like tolls such as the PDFDebugger or PDFReader could be additional modules.
>
> With kind regards
>
>
> Maruan Sahyoun
>

Hi Maruan,

I think some wiki pages should be good. This discussion already
started but as mails in the list or maybe jira tickets lost in the
flow.

There is an apache wiki [1], but I found nothing on PDFBox, a good way
occasion to start.

I do not have many more ideas. According to me, having different
modules for PDF parsers, PDF makers and PDF viewers is an important
one.



[1] http://wiki.apache.org/general/

Guillaume Bailleul