You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Maruan Sahyoun <sa...@fileaffairs.de> on 2014/05/29 09:39:48 UTC

Enhancements to PDFBox

Hi,

for a current project I need to work on enhancing PDFBox for

# splitting files (e.g. remove no longer needed resources)
# merging files (e.g. avoid duplicating resources)
# page handling (adding/removing individual pages with resource handling)
# enhancements to forms handling (pre fill XFA forms - partially done, enhancing AP generation)

Is someone else working on something similar?

BR

Maruan

Re: Enhancements to PDFBox

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Am 29.05.2014 um 14:31 schrieb Andreas Lehmkuehler <an...@lehmi.de>:

> Am 29.05.2014 14:20, schrieb Maruan Sahyoun:
>> Hi,
>> 
>> Am 29.05.2014 um 13:57 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>> 
>>> Am 29.05.2014 09:39, schrieb Maruan Sahyoun:
>>>> Hi,
>>>> 
>>>> for a current project I need to work on enhancing PDFBox for
>>>> 
>>>> # splitting files (e.g. remove no longer needed resources)
>>> I had a quick look some time ago hoping that it would be easy to just remove unneeded stuff but it isn't (maybe I didn't get it yet). In most cases resources are deleted in combination with the page they belong to. The bigger issue is annotations referring to pages. Those pages including there resources aren't removed when the pages are removed because of the reference in the annotation directory.
>>>> # merging files (e.g. avoid duplicating resources)
>>> That just makes sense if the pdfs to be merged uses similar resources.
>>> 
>>>> # page handling (adding/removing individual pages with resource handling)
>>> This should be a side produkt of #1 and #2
>>> 
>>>> # enhancements to forms handling (pre fill XFA forms - partially done, enhancing AP generation)
>>> This seems to be an important feature not only for you. So it would be nice if someone could improve that.
>>> 
>> 
>> I already have filling an XFA form ready with some limitations (PDXFA’s COS has to be an array, dataset entry must be present … ). Could put it in if someone is interested in the current stage but planned to remove some limitations first. I’m not totally sure if that should be part of PDXFA or a Filler tool as this will introduce some dependency on XML handling.
>> Preferences?
> Hmm, maybe it would be I good idea to put that stuff in a separate module, so that it could be added/discarded on demand.

OK - will do.

> 
>>>> Is someone else working on something similar?
>>> My recent todo list is already quite long and maybe #1 and #2 or on it, but I'm afraid on a lower position. But I'm happy to help if someone wants to implement some of those features.
>> 
>> I will be working on #1 and #2 (at least to a degree which is needed for the project). If we could get some ideas together and you could help me - based on your past experience and knowledge of the code base - to get this started this would be great.
> Yes, of course.
> 
>>>> BR
>>>> 
>>>> Maruan
>>> 
> 
> BR
> Andreas Lehmkühler


Re: Enhancements to PDFBox

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 29.05.2014 14:20, schrieb Maruan Sahyoun:
> Hi,
>
> Am 29.05.2014 um 13:57 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>
>> Am 29.05.2014 09:39, schrieb Maruan Sahyoun:
>>> Hi,
>>>
>>> for a current project I need to work on enhancing PDFBox for
>>>
>>> # splitting files (e.g. remove no longer needed resources)
>> I had a quick look some time ago hoping that it would be easy to just remove unneeded stuff but it isn't (maybe I didn't get it yet). In most cases resources are deleted in combination with the page they belong to. The bigger issue is annotations referring to pages. Those pages including there resources aren't removed when the pages are removed because of the reference in the annotation directory.
>>> # merging files (e.g. avoid duplicating resources)
>> That just makes sense if the pdfs to be merged uses similar resources.
>>
>>> # page handling (adding/removing individual pages with resource handling)
>> This should be a side produkt of #1 and #2
>>
>>> # enhancements to forms handling (pre fill XFA forms - partially done, enhancing AP generation)
>> This seems to be an important feature not only for you. So it would be nice if someone could improve that.
>>
>
> I already have filling an XFA form ready with some limitations (PDXFA’s COS has to be an array, dataset entry must be present … ). Could put it in if someone is interested in the current stage but planned to remove some limitations first. I’m not totally sure if that should be part of PDXFA or a Filler tool as this will introduce some dependency on XML handling.
> Preferences?
Hmm, maybe it would be I good idea to put that stuff in a separate module, so 
that it could be added/discarded on demand.

>>> Is someone else working on something similar?
>> My recent todo list is already quite long and maybe #1 and #2 or on it, but I'm afraid on a lower position. But I'm happy to help if someone wants to implement some of those features.
>
> I will be working on #1 and #2 (at least to a degree which is needed for the project). If we could get some ideas together and you could help me - based on your past experience and knowledge of the code base - to get this started this would be great.
Yes, of course.

>>> BR
>>>
>>> Maruan
>>

BR
Andreas Lehmkühler


Re: Enhancements to PDFBox

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi,

Am 29.05.2014 um 13:57 schrieb Andreas Lehmkuehler <an...@lehmi.de>:

> Am 29.05.2014 09:39, schrieb Maruan Sahyoun:
>> Hi,
>> 
>> for a current project I need to work on enhancing PDFBox for
>> 
>> # splitting files (e.g. remove no longer needed resources)
> I had a quick look some time ago hoping that it would be easy to just remove unneeded stuff but it isn't (maybe I didn't get it yet). In most cases resources are deleted in combination with the page they belong to. The bigger issue is annotations referring to pages. Those pages including there resources aren't removed when the pages are removed because of the reference in the annotation directory.
>> # merging files (e.g. avoid duplicating resources)
> That just makes sense if the pdfs to be merged uses similar resources.
> 
>> # page handling (adding/removing individual pages with resource handling)
> This should be a side produkt of #1 and #2
> 
>> # enhancements to forms handling (pre fill XFA forms - partially done, enhancing AP generation)
> This seems to be an important feature not only for you. So it would be nice if someone could improve that.
> 

I already have filling an XFA form ready with some limitations (PDXFA’s COS has to be an array, dataset entry must be present … ). Could put it in if someone is interested in the current stage but planned to remove some limitations first. I’m not totally sure if that should be part of PDXFA or a Filler tool as this will introduce some dependency on XML handling. 
Preferences?

>> Is someone else working on something similar?
> My recent todo list is already quite long and maybe #1 and #2 or on it, but I'm afraid on a lower position. But I'm happy to help if someone wants to implement some of those features.

I will be working on #1 and #2 (at least to a degree which is needed for the project). If we could get some ideas together and you could help me - based on your past experience and knowledge of the code base - to get this started this would be great. 

> 
> 
>> BR
>> 
>> Maruan
> 
> BR
> Andreas Lehmkühler
> 


Re: Enhancements to PDFBox

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 29.05.2014 09:39, schrieb Maruan Sahyoun:
> Hi,
>
> for a current project I need to work on enhancing PDFBox for
>
> # splitting files (e.g. remove no longer needed resources)
I had a quick look some time ago hoping that it would be easy to just remove 
unneeded stuff but it isn't (maybe I didn't get it yet). In most cases resources 
are deleted in combination with the page they belong to. The bigger issue is 
annotations referring to pages. Those pages including there resources aren't 
removed when the pages are removed because of the reference in the annotation 
directory.

> # merging files (e.g. avoid duplicating resources)
That just makes sense if the pdfs to be merged uses similar resources.

> # page handling (adding/removing individual pages with resource handling)
This should be a side produkt of #1 and #2

> # enhancements to forms handling (pre fill XFA forms - partially done, enhancing AP generation)
This seems to be an important feature not only for you. So it would be nice if 
someone could improve that.

> Is someone else working on something similar?
My recent todo list is already quite long and maybe #1 and #2 or on it, but I'm 
afraid on a lower position. But I'm happy to help if someone wants to implement 
some of those features.


> BR
>
> Maruan

BR
Andreas Lehmkühler


Re: Enhancements to PDFBox

Posted by John Hewson <jo...@jahewson.com>.
> It will involve a lot of COS processing. I haven’t decided yet if it will sit on top of COS or PD. Typically we do encourage people to use PD so I tend to start from there and dig down internally as needed. WDYT?

Starting with PD and using COS where needed sounds reasonable. Ultimately you don’t need a high-level API to do the manipulations which you’re interested in, so COS should suffice, but PD might be quicker to get started with.

-- John

On 29 May 2014, at 23:25, Maruan Sahyoun <sa...@fileaffairs.de> wrote:

> 
> Am 29.05.2014 um 18:51 schrieb John Hewson <jo...@jahewson.com>:
> 
>>> # splitting files (e.g. remove no longer needed resources)
>> 
>> Each page has its own Resources dictionary, so it shouldn't be too difficult. One thing to watch out for is is the "page tree" which allows pages to inherit resources from each other, this is handled as PDPageNode but it's kind of messy.
> 
> thanks for the hint. Splitting and merging is somewhat similar as splitting is typically done by creating a new document and importing the needed pages into the newly created document. Using the current code this might lead to duplicate resources. 
> 
>> 
>>> # merging files (e.g. avoid duplicating resources)
>> 
>> Sounds like the files are pretty similar, is this actually an overlay? Or are you wanting to insert entire pages?
> 
> it’s merging individual files together inserting entire pages. Although the files are created individually they share some common elements like company logos or fonts. 
> 
>> 
>> I imagine you probably want to implement both these features at the COS level rather than the PD level, as it's pretty low-level processing.
>> 
> 
> It will involve a lot of COS processing. I haven’t decided yet if it will sit on top of COS or PD. Typically we do encourage people to use PD so I tend to start from there and dig down internally as needed. WDYT?
> 
> 
>> -- John
>> 
>>> On 29 May 2014, at 00:39, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
>>> 
>>> Hi,
>>> 
>>> for a current project I need to work on enhancing PDFBox for
>>> 
>>> # splitting files (e.g. remove no longer needed resources)
>>> # merging files (e.g. avoid duplicating resources)
>>> # page handling (adding/removing individual pages with resource handling)
>>> # enhancements to forms handling (pre fill XFA forms - partially done, enhancing AP generation)
>>> 
>>> Is someone else working on something similar?
>>> 
>>> BR
>>> 
>>> Maruan
> 


Re: Enhancements to PDFBox

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Am 29.05.2014 um 18:51 schrieb John Hewson <jo...@jahewson.com>:

>> # splitting files (e.g. remove no longer needed resources)
> 
> Each page has its own Resources dictionary, so it shouldn't be too difficult. One thing to watch out for is is the "page tree" which allows pages to inherit resources from each other, this is handled as PDPageNode but it's kind of messy.

thanks for the hint. Splitting and merging is somewhat similar as splitting is typically done by creating a new document and importing the needed pages into the newly created document. Using the current code this might lead to duplicate resources. 

> 
>> # merging files (e.g. avoid duplicating resources)
> 
> Sounds like the files are pretty similar, is this actually an overlay? Or are you wanting to insert entire pages?

it’s merging individual files together inserting entire pages. Although the files are created individually they share some common elements like company logos or fonts. 

> 
> I imagine you probably want to implement both these features at the COS level rather than the PD level, as it's pretty low-level processing.
> 

It will involve a lot of COS processing. I haven’t decided yet if it will sit on top of COS or PD. Typically we do encourage people to use PD so I tend to start from there and dig down internally as needed. WDYT?


> -- John
> 
>> On 29 May 2014, at 00:39, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
>> 
>> Hi,
>> 
>> for a current project I need to work on enhancing PDFBox for
>> 
>> # splitting files (e.g. remove no longer needed resources)
>> # merging files (e.g. avoid duplicating resources)
>> # page handling (adding/removing individual pages with resource handling)
>> # enhancements to forms handling (pre fill XFA forms - partially done, enhancing AP generation)
>> 
>> Is someone else working on something similar?
>> 
>> BR
>> 
>> Maruan


Re: Enhancements to PDFBox

Posted by John Hewson <jo...@jahewson.com>.
> # splitting files (e.g. remove no longer needed resources)

Each page has its own Resources dictionary, so it shouldn't be too difficult. One thing to watch out for is is the "page tree" which allows pages to inherit resources from each other, this is handled as PDPageNode but it's kind of messy.

> # merging files (e.g. avoid duplicating resources)

Sounds like the files are pretty similar, is this actually an overlay? Or are you wanting to insert entire pages?

I imagine you probably want to implement both these features at the COS level rather than the PD level, as it's pretty low-level processing.

-- John

> On 29 May 2014, at 00:39, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
> 
> Hi,
> 
> for a current project I need to work on enhancing PDFBox for
> 
> # splitting files (e.g. remove no longer needed resources)
> # merging files (e.g. avoid duplicating resources)
> # page handling (adding/removing individual pages with resource handling)
> # enhancements to forms handling (pre fill XFA forms - partially done, enhancing AP generation)
> 
> Is someone else working on something similar?
> 
> BR
> 
> Maruan

Re: Enhancements to PDFBox

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Simon,

thanks for the pointer - very useful.

BR
Maruan

Am 29.05.2014 um 12:06 schrieb Simon Steiner <si...@gmail.com>:

> Hi,
> 
> I worked on merging fonts in pdfs in fop using pdfbox
> https://issues.apache.org/jira/browse/FOP-2302
> 
> Thanks
> 
> -----Original Message-----
> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
> Sent: 29 May 2014 08:40
> To: dev@pdfbox.apache.org
> Subject: Enhancements to PDFBox
> 
> Hi,
> 
> for a current project I need to work on enhancing PDFBox for
> 
> # splitting files (e.g. remove no longer needed resources) # merging files
> (e.g. avoid duplicating resources) # page handling (adding/removing
> individual pages with resource handling) # enhancements to forms handling
> (pre fill XFA forms - partially done, enhancing AP generation)
> 
> Is someone else working on something similar?
> 
> BR
> 
> Maruan
> 


RE: Enhancements to PDFBox

Posted by Simon Steiner <si...@gmail.com>.
Hi,

I worked on merging fonts in pdfs in fop using pdfbox
https://issues.apache.org/jira/browse/FOP-2302

Thanks

-----Original Message-----
From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
Sent: 29 May 2014 08:40
To: dev@pdfbox.apache.org
Subject: Enhancements to PDFBox

Hi,

for a current project I need to work on enhancing PDFBox for

# splitting files (e.g. remove no longer needed resources) # merging files
(e.g. avoid duplicating resources) # page handling (adding/removing
individual pages with resource handling) # enhancements to forms handling
(pre fill XFA forms - partially done, enhancing AP generation)

Is someone else working on something similar?

BR

Maruan


Re: Enhancements to PDFBox

Posted by Tilman Hausherr <TH...@t-online.de>.
No I'm not. Sounds like a lot of work.

- avoid duplicating resources: It would probably mean comparing 
dictionaries and contents. And rearranging stuff., i.e. pointing to the 
correct object
- remove no longer needed resources: sounds like an "orphan check", i.e. 
decode all streams to see which resources are used
- the third one ("with resource handling") seems derived from the first two
- no idea about the 4th one

Tilman

Am 29.05.2014 09:39, schrieb Maruan Sahyoun:
> Hi,
>
> for a current project I need to work on enhancing PDFBox for
>
> # splitting files (e.g. remove no longer needed resources)
> # merging files (e.g. avoid duplicating resources)
> # page handling (adding/removing individual pages with resource handling)
> # enhancements to forms handling (pre fill XFA forms - partially done, enhancing AP generation)
>
> Is someone else working on something similar?
>
> BR
>
> Maruan