You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Maison Mo <mo...@yahoo.fr.INVALID> on 2022/07/27 14:10:40 UTC

High memory usage with pdfbox 3

Hello,
We parse random pdf files, some are containing large images (5000x8000), with filters,and I noticed a regression in our CI with this test.This seems related to [PDFBOX-4836] Reduce the usage of ScatchFileBuffer when parsing a pdf - ASF JIRA

| 
| 
|  | 
[PDFBOX-4836] Reduce the usage of ScatchFileBuffer when parsing a pdf - ...


 |

 |

 |


and in particular this commit :PDFBOX-4836: don't use ScratchFile within COSInputStream any more · apache/pdfbox@6b9dd61

| 
| 
| 
|  |  |

 |

 |
| 
|  | 
PDFBOX-4836: don't use ScratchFile within COSInputStream any more · apac...

git-svn-id: https://svn.apache.org/repos/asf/pdfbox/trunk@1881870 13f79535-47bb-0310-9956-ffa450edef68
 |

 |

 |



Pdfbox 2 was using scratch file to do this (heavy) processing, this is no more the case(hence our OOMError)
Unfortunately this is quite surprising, given the PDDocument was opened with :Loader.loadPDF( pdfInputStream, MemoryUsageSetting.setupTempFileOnly() );Looking at the code, it seems that the InputStream is always completely read into memory by this Loader, is that correct ?So what is the purpose of defining a MemoryUsageSetting if it is ignored in lower layers ?
This looks like a blocker for us : we need to cap pdfbox memory usage somehow.Is there a workaround for this ?
Thank you in advance for your responses.

M.

Re: High memory usage with pdfbox 3

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Hi,

thanks for the feedback for the upcoming new major version.

First of all, we changed a lot of stuff.

We targeted different goals. Let me name a couple of them:

* implement an on-demand parser
* support compressed object streams
* lower the usage of resources at all
* simplify the code at all to make it easier to maintain it and hopefully 
attract more devs
* remove deprecated stuff
* lots of minor improvements/fixes
* ...

Some of those goals include others or depend on them.

I removed the usage of ScratchFileBuffer when reading a pdf because:
* it increased the complexity of the code
* we had some issues in the past with unclosed streams and/or unclosed buffers
* the on-demand parser doesn't load the whole pdf and in the long run the loaded 
resources shall be released if they aren't needed anymore
* it is always a bad idea to combine to many complex features in one code layer, 
such as parsing a complex file format and implement some fancy caching. We can't 
solve every issue an user might have, especially those which are not part of our 
main goal.

BTW, the complete removal of ScratchFileBuffer is on my TODO list for 4.x

As I already said, we changed a lot. There are breaking changes and changes in 
the behaviour of the lib and you are facing one of it.

The parser needs random access to the pdf and therefore an InputStream is always 
copied to something which provides random access.

PDFBox 2.0.x copies the data of the InputStream to a file and/or to the memory 
depending on the MemoryUsageSetting.

PDFBox 3.0.0 always copies the data of the InputStream to the memory. It might 
be suitable for small files but a bad idea if memory is limited. Instead one 
should save the data to a temp file to be used as input.

I'm going to document the changes of the io stuff in the migration guide.

Any possible new caching stuff, I don't have any in my mind, should be 
implemented in the io package or something else but not inside the parser.

Cheers
Andreas

Am 27.07.22 um 16:10 schrieb Maison Mo:
> 
> Hello,
> We parse random pdf files, some are containing large images (5000x8000), with filters,and I noticed a regression in our CI with this test.This seems related to [PDFBOX-4836] Reduce the usage of ScatchFileBuffer when parsing a pdf - ASF JIRA
> 
> |
> |
> |  |
> [PDFBOX-4836] Reduce the usage of ScatchFileBuffer when parsing a pdf - ...
> 
> 
>   |
> 
>   |
> 
>   |
> 
> 
> and in particular this commit :PDFBOX-4836: don't use ScratchFile within COSInputStream any more · apache/pdfbox@6b9dd61
> 
> |
> |
> |
> |  |  |
> 
>   |
> 
>   |
> |
> |  |
> PDFBOX-4836: don't use ScratchFile within COSInputStream any more · apac...
> 
> git-svn-id: https://svn.apache.org/repos/asf/pdfbox/trunk@1881870 13f79535-47bb-0310-9956-ffa450edef68
>   |
> 
>   |
> 
>   |
> 
> 
> 
> Pdfbox 2 was using scratch file to do this (heavy) processing, this is no more the case(hence our OOMError)
> Unfortunately this is quite surprising, given the PDDocument was opened with :Loader.loadPDF( pdfInputStream, MemoryUsageSetting.setupTempFileOnly() );Looking at the code, it seems that the InputStream is always completely read into memory by this Loader, is that correct ?So what is the purpose of defining a MemoryUsageSetting if it is ignored in lower layers ?
> This looks like a blocker for us : we need to cap pdfbox memory usage somehow.Is there a workaround for this ?
> Thank you in advance for your responses.
> 
> M.
> 
> 
> 
> 
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org