You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Gabriel Pessoa <ga...@bry.com.br> on 2017/04/20 13:01:03 UTC

Memory Consumption in PDDocument.load

Hello.

Recently at our company we started to worry about how much memory was 
being used during our PDF signing process. We are using the 1.8.13 now, 
mostly because the loading time on 2.0.x got longer (I actually asked 
about it some six months ago and Tilman explained the reason why).

This question on StackOverflow I think cleared some doubts I had about 
how PDFBox worked: 
http://stackoverflow.com/questions/22340674/performance-itext-vs-pdfbox

The main point being: PDFBox parses and have ALL the objects in the PDF 
loaded. So, complex objects will use a lot of memory. Am I correct?

If that is the case, I understand that is necessary for PDF 
manipulation, but is that necessary for PDF signing? Looking at a signed 
PDF structure it looks like only the Root entry (to update the AcroForm 
entry) and the signed page entry (to update the Annots entry) are really 
needed for signing.

So I would be too wrong in suggesting a new load method that would be 
used only for singing and that would only load those necessary entries 
and would not load things like images and fonts and tables, etc.

If not that, something akin to "lazy loading" could be done? With the 
PDF objects only being actually parsed and loaded when being accessed. 
The load would only map all the references in that case.

If any on those two options is possible but you don't have anyone 
currently available to work on it, I could try to develop that solution. 
I would only need to know if it would be better to use the 2.0.6 branch 
or the 3.0.0 trunk.

Thank you very much for your time.

-- 
Atenciosamente,

Gabriel Pessoa
Analista
BRy Tecnologia
Rua Lauro Linhares, 2123 Torre B - 3� andar
88036-002 - Florian�polis - SC - Brasil
+55 (48) 3234 6696


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Memory Consumption in PDDocument.load

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 20.04.2017 um 19:46 schrieb Gabriel Pessoa:
> Thank you very much for you reply, Tilman.
>
> There is any estimative to when that new parser will be available? I 
> ask because if that would still take some time, I think my company 
> will want me to try make that basic parser I mentioned as first 
> suggestion in a local fork of PDFBox until we have a official release.

I don't know. We all work on our free time, so nobody has time targets.

Tilman


>
> Em 20/04/2017 14:25, Tilman Hausherr escreveu:
>> Am 20.04.2017 um 15:01 schrieb Gabriel Pessoa:
>>> Hello.
>>>
>>> Recently at our company we started to worry about how much memory 
>>> was being used during our PDF signing process. We are using the 
>>> 1.8.13 now, mostly because the loading time on 2.0.x got longer (I 
>>> actually asked about it some six months ago and Tilman explained the 
>>> reason why).
>>>
>>> This question on StackOverflow I think cleared some doubts I had 
>>> about how PDFBox worked: 
>>> http://stackoverflow.com/questions/22340674/performance-itext-vs-pdfbox
>>>
>>> The main point being: PDFBox parses and have ALL the objects in the 
>>> PDF loaded. So, complex objects will use a lot of memory. Am I correct?
>>
>> Yes
>>
>>>
>>> If that is the case, I understand that is necessary for PDF 
>>> manipulation, but is that necessary for PDF signing? Looking at a 
>>> signed PDF structure it looks like only the Root entry (to update 
>>> the AcroForm entry) and the signed page entry (to update the Annots 
>>> entry) are really needed for signing.
>>
>> And the acroform field tree
>>
>>>
>>> So I would be too wrong in suggesting a new load method that would 
>>> be used only for singing and that would only load those necessary 
>>> entries and would not load things like images and fonts and tables, 
>>> etc.
>>>
>>> If not that, something akin to "lazy loading" could be done? With 
>>> the PDF objects only being actually parsed and loaded when being 
>>> accessed. The load would only map all the references in that case.
>>>
>>> If any on those two options is possible but you don't have anyone 
>>> currently available to work on it, I could try to develop that 
>>> solution. I would only need to know if it would be better to use the 
>>> 2.0.6 branch or the 3.0.0 trunk.
>>>
>>> Thank you very much for your time.
>>>
>>
>> Andreas wrote that he's working on an on-demand parser.
>>
>> Tilman
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Memory Consumption in PDDocument.load

Posted by Gabriel Pessoa <ga...@bry.com.br>.

Thank you very much for you reply, Tilman.

There is any estimative to when that new parser will be available? I ask 
because if that would still take some time, I think my company will want 
me to try make that basic parser I mentioned as first suggestion in a 
local fork of PDFBox until we have a official release.

Em 20/04/2017 14:25, Tilman Hausherr escreveu:
> Am 20.04.2017 um 15:01 schrieb Gabriel Pessoa:
>> Hello.
>>
>> Recently at our company we started to worry about how much memory was 
>> being used during our PDF signing process. We are using the 1.8.13 
>> now, mostly because the loading time on 2.0.x got longer (I actually 
>> asked about it some six months ago and Tilman explained the reason why).
>>
>> This question on StackOverflow I think cleared some doubts I had 
>> about how PDFBox worked: 
>> http://stackoverflow.com/questions/22340674/performance-itext-vs-pdfbox
>>
>> The main point being: PDFBox parses and have ALL the objects in the 
>> PDF loaded. So, complex objects will use a lot of memory. Am I correct?
>
> Yes
>
>>
>> If that is the case, I understand that is necessary for PDF 
>> manipulation, but is that necessary for PDF signing? Looking at a 
>> signed PDF structure it looks like only the Root entry (to update the 
>> AcroForm entry) and the signed page entry (to update the Annots 
>> entry) are really needed for signing.
>
> And the acroform field tree
>
>>
>> So I would be too wrong in suggesting a new load method that would be 
>> used only for singing and that would only load those necessary 
>> entries and would not load things like images and fonts and tables, etc.
>>
>> If not that, something akin to "lazy loading" could be done? With the 
>> PDF objects only being actually parsed and loaded when being 
>> accessed. The load would only map all the references in that case.
>>
>> If any on those two options is possible but you don't have anyone 
>> currently available to work on it, I could try to develop that 
>> solution. I would only need to know if it would be better to use the 
>> 2.0.6 branch or the 3.0.0 trunk.
>>
>> Thank you very much for your time.
>>
>
> Andreas wrote that he's working on an on-demand parser.
>
> Tilman
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>

-- 
Atenciosamente,

Gabriel Pessoa
Analista
BRy Tecnologia
Rua Lauro Linhares, 2123 Torre B - 3� andar
88036-002 - Florian�polis - SC - Brasil
+55 (48) 3234 6696


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Memory Consumption in PDDocument.load

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 20.04.2017 um 15:01 schrieb Gabriel Pessoa:
> Hello.
>
> Recently at our company we started to worry about how much memory was 
> being used during our PDF signing process. We are using the 1.8.13 
> now, mostly because the loading time on 2.0.x got longer (I actually 
> asked about it some six months ago and Tilman explained the reason why).
>
> This question on StackOverflow I think cleared some doubts I had about 
> how PDFBox worked: 
> http://stackoverflow.com/questions/22340674/performance-itext-vs-pdfbox
>
> The main point being: PDFBox parses and have ALL the objects in the 
> PDF loaded. So, complex objects will use a lot of memory. Am I correct?

Yes

>
> If that is the case, I understand that is necessary for PDF 
> manipulation, but is that necessary for PDF signing? Looking at a 
> signed PDF structure it looks like only the Root entry (to update the 
> AcroForm entry) and the signed page entry (to update the Annots entry) 
> are really needed for signing.

And the acroform field tree

>
> So I would be too wrong in suggesting a new load method that would be 
> used only for singing and that would only load those necessary entries 
> and would not load things like images and fonts and tables, etc.
>
> If not that, something akin to "lazy loading" could be done? With the 
> PDF objects only being actually parsed and loaded when being accessed. 
> The load would only map all the references in that case.
>
> If any on those two options is possible but you don't have anyone 
> currently available to work on it, I could try to develop that 
> solution. I would only need to know if it would be better to use the 
> 2.0.6 branch or the 3.0.0 trunk.
>
> Thank you very much for your time.
>

Andreas wrote that he's working on an on-demand parser.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org