You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Mark A. Claassen" <MC...@ocie.net> on 2021/06/10 17:07:13 UTC

RE: [Possible Spam] Re: PDF Memory issue

Thanks for the reply.

> Why should the list not be kept? We need it for when the file is saved.

I need to study that code a bit more, there is a lot going on there that I don't yet understand.  What I was thinking was if there might be an alternative to keeping the stream object in memory, like storing the necessary metadata for it in a smaller structure.  

Maybe the stream is the perfect object for this.  However, at 4K or more a piece, and one per page, this scales at least linearly with the number of pages.  When dealing with "normal" documents, this is not an issue.  But when the number of pages gets large, this overhead is significant. 

We had someone try to create a PDF from a 25,000 page text source.  25,000 * 4K is 100 megabytes.  If it was possible to not maintain any data in the ScratchFileBuffer, it would scale a bit better.

Thanks again,

Mark Claassen
Senior Software Engineer

Donnell Systems, Inc.
130 South Main Street
Leighton Plaza Suite 375
South Bend, IN  46601
E-mail: mailto:mclaassen@ocie.net
Voice: (574)232-3784
Fax: (574)232-4014

Disclaimer:
The opinions provided herein do not necessarily state or reflect 
those of Donnell Systems, Inc.(DSI). DSI makes no warranty for and 
assumes no legal liability or responsibility for the posting. 


-----Original Message-----
From: Tilman Hausherr <TH...@t-online.de> 
Sent: Thursday, June 10, 2021 12:02 PM
To: dev@pdfbox.apache.org
Subject: [Possible Spam] Re: PDF Memory issue
Importance: Low

Why should the list not be kept? We need it for when the file is saved.

Tilman

Am 10.06.2021 um 03:07 schrieb Mark A. Claassen:
> (This was started on the users list, but I am switching over to the 
> dev list.)
>
> I found the issue.  I have a bunch of small pages.  The COSDocument keeps a list of the streams that have been created.  The problem is that the currentPage in the ScratchFileBuffer is always in memory.  If there are 40,000 pages, then this will add up to 40,000 * the page size (4096) which is over 160,000,000.
>
> So, now I am not sure how to deal with this.  Each page has a PDFPageContentStream, which creates a ScratchFileBuffer.
> This ScratchFileBuffer is kept in the list of streams.  I could recompile with a smaller page size, but that will only cut the problem by a percentage.  Does anyone think it may be possible to change this to not maintain the list of streams?  Or maybe clear the currentPage byte array for the items in the list?
>
> I am willing to do some work on this, but a little guidance (or realism) would be helpful before I get too deep into this.
>
> Thanks,
>
> Mark Claassen
> Senior Software Engineer
>
> Donnell Systems, Inc.
> 130 South Main Street
> Leighton Plaza Suite 375
> South Bend, IN  46601
> E-mail: mailto:mclaassen@ocie.net
> Voice: (574)232-3784
> Fax: (574)232-4014
>
> Disclaimer:
> The opinions provided herein do not necessarily state or reflect those 
> of Donnell Systems, Inc.(DSI). DSI makes no warranty for and assumes 
> no legal liability or responsibility for the posting.
> -----Original Message-----
> From: Mark A. Claassen <MC...@ocie.net>
> Sent: Wednesday, June 9, 2021 4:53 PM
> To: users@pdfbox.apache.org
> Subject: [Possible Spam] RE: PDF Memory issue
> Importance: Low
>
> In looking at this further, it seems that the ScratchFileBuffer.close method is only called when the document is closed.  ScratchFileBuffer.clear is never called.
>
> These are the only places where the pageHandler.markPagesAsFree is called.  I believe this is the issue, since markPagesAsFree is never called, this content just keeps building up until the document is closed.
>
> Any guidance would be greatly appreciated.  I can't seem to find a configuration work around for this issue.
>
> Mark Claassen
> Senior Software Engineer
>
> Donnell Systems, Inc.
> 130 South Main Street
> Leighton Plaza Suite 375
> South Bend, IN  46601
> E-mail: mailto:mclaassen@ocie.net
> Voice: (574)232-3784
> Fax: (574)232-4014
>
> Disclaimer:
> The opinions provided herein do not necessarily state or reflect those of Donnell Systems, Inc.(DSI). DSI makes no warranty for and assumes no legal liability or responsibility for the posting.
>
>
> -----Original Message-----
> From: Mark A. Claassen <MC...@ocie.net>
> Sent: Wednesday, June 9, 2021 1:39 PM
> To: users@pdfbox.apache.org
> Subject: [Possible Spam] PDF Memory issue
> Importance: Low
>
> Hi.  Thanks for your time.
>
> I am using PDF box and am having trouble creating large PDFS (50,000+ pages).  The heap size of the process is capped, but with the temp file active (which I can see being created) I didn't think this would matter.
>
> Here is what I am doing in a very condensed form:
> 	MEMORY_SETTING = MemoryUsageSetting.setupTempFileOnly();
> 	PDDocument pdf = new PDDocument(MEMORY_SETTING);
> 	
> 	for (...) {
> 		String text = [generate page text]
> 		PDPage page = new PDPage(PDRectangle.LETTER);
> 		try (PDPageContentStream contentStream = new 
> PDPageContentStream(doc, page, 
> PDPageContentStream.AppendMode.OVERWRITE, false)) {
> 			
> 			contentStream.endText();
> 			doc.addPage(page);
> 	}
>
> When I do a heap dump, I see over 100 MG of memory taken by 42,000 
> instances of ScratchFileBuffer.currentPage
>
> Is there something I am going wrong here?  Or is this a bug?  It seems like I must be doing something wrong / forgetting to do something, since this is a problem in 2 and 3-RC1.
>
> Thanks again,
>
> Mark Claassen
> Senior Software Engineer
>
> Donnell Systems, Inc.
> 130 South Main Street
> Leighton Plaza Suite 375
> South Bend, IN  46601
> E-mail: mailto:mclaassen@ocie.net
> Voice: (574)232-3784
> Fax: (574)232-4014
>
> Disclaimer:
> The opinions provided herein do not necessarily state or reflect those of Donnell Systems, Inc.(DSI). DSI makes no warranty for and assumes no legal liability or responsibility for the posting.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: [Possible Spam] Re: PDF Memory issue

Posted by "Martinez, Mel - 0441 - MITLL" <m....@ll.mit.edu>.

Ah.  Too bad.

Note that, if the byte arrays are immutable (or at least treated as such) and in a wrapper object (such as ByteBuffer) with, as I indicated, a proper .equals() and .hashcode() implementation, object pooling still can be effective.

Good luck!

Mel

Dr. Mel Martinez
m.martinez@ll.mit.edu



> On Jun 10, 2021, at 4:57 PM, Mark A. Claassen <MC...@ocie.net> wrote:
> 
> Thanks for the tips.  I don't think they will help here, however.  The 4K object that is being held is a byte array.
> 
> Thanks again,
> 
> Mark Claassen
> Senior Software Engineer
> 
> Donnell Systems, Inc.
> 130 South Main Street
> Leighton Plaza Suite 375
> South Bend, IN  46601
> E-mail: mailto:mclaassen@ocie.net
> Voice: (574)232-3784
> Fax: (574)232-4014
>   
> -------------------------------------------
> Confidentiality Notice: OCIESERVICE
> -------------------------------------------
> The contents of this e-mail message and any attachments are intended solely for the addressee(s) named in this message. This communication is intended to be and to remain confidential. If you are not the intended recipient of this message, or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and its attachments. Do not deliver, distribute, copy, disclose the contents or take any action in reliance upon the information contained in the communication or any attachments.
> 
> 
> -----Original Message-----
> From: Martinez, Mel - 0441 - MITLL <m....@ll.mit.edu> 
> Sent: Thursday, June 10, 2021 4:28 PM
> To: dev@pdfbox.apache.org
> Cc: Martinez, Mel - 0441 - MITLL <m....@ll.mit.edu>
> Subject: Re: [Possible Spam] Re: PDF Memory issue
> Importance: Low
> 
> I haven’t looked at this particular code at all, but I’m guessing that a LOT of the objects being referenced are strings — possibly identical strings?
> 
> It may be useful to implement object (string) pooling.    That can save a ton of memory.
> 
> Do not use the built-in String.intern() function for this, though.   That is limited and slow.    It’s better to build the string pool around something like ConcurrentHashMap.putIfAbsent().
> 
> You then need to rewrite the code to do a pool check whenever new strings are created / input.
> 
> 
> ConcurrentHashMap stringPool = new ConcurrentHashMap();  //<— do this once and make it available to all your code, whether as a singleton or static.
> 
> String s = someStepThatCreatesOrInputsAString();
> 
> s = stringPool.putIfAbsent(s, s);   //<— add this step everywhere
> 
> 
> This imposes a very tiny lookup cost with every putIfAbsent() call but it’s pretty small and benchmarks you can find on the ’net show its still way faster than String.intern(), especially for large O(n).  The putIfAbsent() call is atomic and this is perfectly thread safe.
> 
> The end result is that you can enforce that you will have only one copy of any string in memory, regardless of how many references you might have of it.
> 
> For non-String objects, this can also be used but it’s important that the object have proper .equals() and .hashcode() methods implemented.
> 
> I hope this suggestion is helpful.  This pattern saved me massive amounts of memory in clients pulling data from cloud databases. 
> 
> If it doesn’t make sense to apply in this particular code, then hopefully it will still prove a useful tip for someone somewhere else.
> 
> Cheers,
> 
> Mel
> 
> Dr. Mel Martinez
> m.martinez@ll.mit.edu
> 
> 
> 
>> On Jun 10, 2021, at 1:07 PM, Mark A. Claassen <MC...@ocie.net> wrote:
>> 
>> Thanks for the reply.
>> 
>>> Why should the list not be kept? We need it for when the file is saved.
>> 
>> I need to study that code a bit more, there is a lot going on there that I don't yet understand.  What I was thinking was if there might be an alternative to keeping the stream object in memory, like storing the necessary metadata for it in a smaller structure.  
>> 
>> Maybe the stream is the perfect object for this.  However, at 4K or more a piece, and one per page, this scales at least linearly with the number of pages.  When dealing with "normal" documents, this is not an issue.  But when the number of pages gets large, this overhead is significant. 
>> 
>> We had someone try to create a PDF from a 25,000 page text source.  25,000 * 4K is 100 megabytes.  If it was possible to not maintain any data in the ScratchFileBuffer, it would scale a bit better.
>> 
>> Thanks again,
>> 
>> Mark Claassen
>> Senior Software Engineer
>> 
>> Donnell Systems, Inc.
>> 130 South Main Street
>> Leighton Plaza Suite 375
>> South Bend, IN  46601
>> E-mail: mailto:mclaassen@ocie.net
>> Voice: (574)232-3784
>> Fax: (574)232-4014
>> 
>> Disclaimer:
>> The opinions provided herein do not necessarily state or reflect 
>> those of Donnell Systems, Inc.(DSI). DSI makes no warranty for and 
>> assumes no legal liability or responsibility for the posting. 
>> 
>> 
>> -----Original Message-----
>> From: Tilman Hausherr <TH...@t-online.de> 
>> Sent: Thursday, June 10, 2021 12:02 PM
>> To: dev@pdfbox.apache.org
>> Subject: [Possible Spam] Re: PDF Memory issue
>> Importance: Low
>> 
>> Why should the list not be kept? We need it for when the file is saved.
>> 
>> Tilman
>> 
>> Am 10.06.2021 um 03:07 schrieb Mark A. Claassen:
>>> (This was started on the users list, but I am switching over to the 
>>> dev list.)
>>> 
>>> I found the issue.  I have a bunch of small pages.  The COSDocument keeps a list of the streams that have been created.  The problem is that the currentPage in the ScratchFileBuffer is always in memory.  If there are 40,000 pages, then this will add up to 40,000 * the page size (4096) which is over 160,000,000.
>>> 
>>> So, now I am not sure how to deal with this.  Each page has a PDFPageContentStream, which creates a ScratchFileBuffer.
>>> This ScratchFileBuffer is kept in the list of streams.  I could recompile with a smaller page size, but that will only cut the problem by a percentage.  Does anyone think it may be possible to change this to not maintain the list of streams?  Or maybe clear the currentPage byte array for the items in the list?
>>> 
>>> I am willing to do some work on this, but a little guidance (or realism) would be helpful before I get too deep into this.
>>> 
>>> Thanks,
>>> 
>>> Mark Claassen
>>> Senior Software Engineer
>>> 
>>> Donnell Systems, Inc.
>>> 130 South Main Street
>>> Leighton Plaza Suite 375
>>> South Bend, IN  46601
>>> E-mail: mailto:mclaassen@ocie.net
>>> Voice: (574)232-3784
>>> Fax: (574)232-4014
>>> 
>>> Disclaimer:
>>> The opinions provided herein do not necessarily state or reflect those 
>>> of Donnell Systems, Inc.(DSI). DSI makes no warranty for and assumes 
>>> no legal liability or responsibility for the posting.
>>> -----Original Message-----
>>> From: Mark A. Claassen <MC...@ocie.net>
>>> Sent: Wednesday, June 9, 2021 4:53 PM
>>> To: users@pdfbox.apache.org
>>> Subject: [Possible Spam] RE: PDF Memory issue
>>> Importance: Low
>>> 
>>> In looking at this further, it seems that the ScratchFileBuffer.close method is only called when the document is closed.  ScratchFileBuffer.clear is never called.
>>> 
>>> These are the only places where the pageHandler.markPagesAsFree is called.  I believe this is the issue, since markPagesAsFree is never called, this content just keeps building up until the document is closed.
>>> 
>>> Any guidance would be greatly appreciated.  I can't seem to find a configuration work around for this issue.
>>> 
>>> Mark Claassen
>>> Senior Software Engineer
>>> 
>>> Donnell Systems, Inc.
>>> 130 South Main Street
>>> Leighton Plaza Suite 375
>>> South Bend, IN  46601
>>> E-mail: mailto:mclaassen@ocie.net
>>> Voice: (574)232-3784
>>> Fax: (574)232-4014
>>> 
>>> Disclaimer:
>>> The opinions provided herein do not necessarily state or reflect those of Donnell Systems, Inc.(DSI). DSI makes no warranty for and assumes no legal liability or responsibility for the posting.
>>> 
>>> 
>>> -----Original Message-----
>>> From: Mark A. Claassen <MC...@ocie.net>
>>> Sent: Wednesday, June 9, 2021 1:39 PM
>>> To: users@pdfbox.apache.org
>>> Subject: [Possible Spam] PDF Memory issue
>>> Importance: Low
>>> 
>>> Hi.  Thanks for your time.
>>> 
>>> I am using PDF box and am having trouble creating large PDFS (50,000+ pages).  The heap size of the process is capped, but with the temp file active (which I can see being created) I didn't think this would matter.
>>> 
>>> Here is what I am doing in a very condensed form:
>>> 	MEMORY_SETTING = MemoryUsageSetting.setupTempFileOnly();
>>> 	PDDocument pdf = new PDDocument(MEMORY_SETTING);
>>> 	
>>> 	for (...) {
>>> 		String text = [generate page text]
>>> 		PDPage page = new PDPage(PDRectangle.LETTER);
>>> 		try (PDPageContentStream contentStream = new 
>>> PDPageContentStream(doc, page, 
>>> PDPageContentStream.AppendMode.OVERWRITE, false)) {
>>> 			
>>> 			contentStream.endText();
>>> 			doc.addPage(page);
>>> 	}
>>> 
>>> When I do a heap dump, I see over 100 MG of memory taken by 42,000 
>>> instances of ScratchFileBuffer.currentPage
>>> 
>>> Is there something I am going wrong here?  Or is this a bug?  It seems like I must be doing something wrong / forgetting to do something, since this is a problem in 2 and 3-RC1.
>>> 
>>> Thanks again,
>>> 
>>> Mark Claassen
>>> Senior Software Engineer
>>> 
>>> Donnell Systems, Inc.
>>> 130 South Main Street
>>> Leighton Plaza Suite 375
>>> South Bend, IN  46601
>>> E-mail: mailto:mclaassen@ocie.net
>>> Voice: (574)232-3784
>>> Fax: (574)232-4014
>>> 
>>> Disclaimer:
>>> The opinions provided herein do not necessarily state or reflect those of Donnell Systems, Inc.(DSI). DSI makes no warranty for and assumes no legal liability or responsibility for the posting.
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>> 
>

RE: [Possible Spam] Re: PDF Memory issue

Posted by "Mark A. Claassen" <MC...@ocie.net>.

Thanks for the tips.  I don't think they will help here, however.  The 4K object that is being held is a byte array.

Thanks again,

Mark Claassen
Senior Software Engineer

Donnell Systems, Inc.
130 South Main Street
Leighton Plaza Suite 375
South Bend, IN  46601
E-mail: mailto:mclaassen@ocie.net
Voice: (574)232-3784
Fax: (574)232-4014
  
-------------------------------------------
Confidentiality Notice: OCIESERVICE
-------------------------------------------
The contents of this e-mail message and any attachments are intended solely for the addressee(s) named in this message. This communication is intended to be and to remain confidential. If you are not the intended recipient of this message, or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and its attachments. Do not deliver, distribute, copy, disclose the contents or take any action in reliance upon the information contained in the communication or any attachments.


-----Original Message-----
From: Martinez, Mel - 0441 - MITLL <m....@ll.mit.edu> 
Sent: Thursday, June 10, 2021 4:28 PM
To: dev@pdfbox.apache.org
Cc: Martinez, Mel - 0441 - MITLL <m....@ll.mit.edu>
Subject: Re: [Possible Spam] Re: PDF Memory issue
Importance: Low

I haven’t looked at this particular code at all, but I’m guessing that a LOT of the objects being referenced are strings — possibly identical strings?

It may be useful to implement object (string) pooling.    That can save a ton of memory.

Do not use the built-in String.intern() function for this, though.   That is limited and slow.    It’s better to build the string pool around something like ConcurrentHashMap.putIfAbsent().

You then need to rewrite the code to do a pool check whenever new strings are created / input.


ConcurrentHashMap stringPool = new ConcurrentHashMap();  //<— do this once and make it available to all your code, whether as a singleton or static.

String s = someStepThatCreatesOrInputsAString();

s = stringPool.putIfAbsent(s, s);   //<— add this step everywhere


This imposes a very tiny lookup cost with every putIfAbsent() call but it’s pretty small and benchmarks you can find on the ’net show its still way faster than String.intern(), especially for large O(n).  The putIfAbsent() call is atomic and this is perfectly thread safe.

The end result is that you can enforce that you will have only one copy of any string in memory, regardless of how many references you might have of it.

For non-String objects, this can also be used but it’s important that the object have proper .equals() and .hashcode() methods implemented.

I hope this suggestion is helpful.  This pattern saved me massive amounts of memory in clients pulling data from cloud databases. 

If it doesn’t make sense to apply in this particular code, then hopefully it will still prove a useful tip for someone somewhere else.

Cheers,

Mel

Dr. Mel Martinez
m.martinez@ll.mit.edu



> On Jun 10, 2021, at 1:07 PM, Mark A. Claassen <MC...@ocie.net> wrote:
> 
> Thanks for the reply.
> 
>> Why should the list not be kept? We need it for when the file is saved.
> 
> I need to study that code a bit more, there is a lot going on there that I don't yet understand.  What I was thinking was if there might be an alternative to keeping the stream object in memory, like storing the necessary metadata for it in a smaller structure.  
> 
> Maybe the stream is the perfect object for this.  However, at 4K or more a piece, and one per page, this scales at least linearly with the number of pages.  When dealing with "normal" documents, this is not an issue.  But when the number of pages gets large, this overhead is significant. 
> 
> We had someone try to create a PDF from a 25,000 page text source.  25,000 * 4K is 100 megabytes.  If it was possible to not maintain any data in the ScratchFileBuffer, it would scale a bit better.
> 
> Thanks again,
> 
> Mark Claassen
> Senior Software Engineer
> 
> Donnell Systems, Inc.
> 130 South Main Street
> Leighton Plaza Suite 375
> South Bend, IN  46601
> E-mail: mailto:mclaassen@ocie.net
> Voice: (574)232-3784
> Fax: (574)232-4014
> 
> Disclaimer:
> The opinions provided herein do not necessarily state or reflect 
> those of Donnell Systems, Inc.(DSI). DSI makes no warranty for and 
> assumes no legal liability or responsibility for the posting. 
> 
> 
> -----Original Message-----
> From: Tilman Hausherr <TH...@t-online.de> 
> Sent: Thursday, June 10, 2021 12:02 PM
> To: dev@pdfbox.apache.org
> Subject: [Possible Spam] Re: PDF Memory issue
> Importance: Low
> 
> Why should the list not be kept? We need it for when the file is saved.
> 
> Tilman
> 
> Am 10.06.2021 um 03:07 schrieb Mark A. Claassen:
>> (This was started on the users list, but I am switching over to the 
>> dev list.)
>> 
>> I found the issue.  I have a bunch of small pages.  The COSDocument keeps a list of the streams that have been created.  The problem is that the currentPage in the ScratchFileBuffer is always in memory.  If there are 40,000 pages, then this will add up to 40,000 * the page size (4096) which is over 160,000,000.
>> 
>> So, now I am not sure how to deal with this.  Each page has a PDFPageContentStream, which creates a ScratchFileBuffer.
>> This ScratchFileBuffer is kept in the list of streams.  I could recompile with a smaller page size, but that will only cut the problem by a percentage.  Does anyone think it may be possible to change this to not maintain the list of streams?  Or maybe clear the currentPage byte array for the items in the list?
>> 
>> I am willing to do some work on this, but a little guidance (or realism) would be helpful before I get too deep into this.
>> 
>> Thanks,
>> 
>> Mark Claassen
>> Senior Software Engineer
>> 
>> Donnell Systems, Inc.
>> 130 South Main Street
>> Leighton Plaza Suite 375
>> South Bend, IN  46601
>> E-mail: mailto:mclaassen@ocie.net
>> Voice: (574)232-3784
>> Fax: (574)232-4014
>> 
>> Disclaimer:
>> The opinions provided herein do not necessarily state or reflect those 
>> of Donnell Systems, Inc.(DSI). DSI makes no warranty for and assumes 
>> no legal liability or responsibility for the posting.
>> -----Original Message-----
>> From: Mark A. Claassen <MC...@ocie.net>
>> Sent: Wednesday, June 9, 2021 4:53 PM
>> To: users@pdfbox.apache.org
>> Subject: [Possible Spam] RE: PDF Memory issue
>> Importance: Low
>> 
>> In looking at this further, it seems that the ScratchFileBuffer.close method is only called when the document is closed.  ScratchFileBuffer.clear is never called.
>> 
>> These are the only places where the pageHandler.markPagesAsFree is called.  I believe this is the issue, since markPagesAsFree is never called, this content just keeps building up until the document is closed.
>> 
>> Any guidance would be greatly appreciated.  I can't seem to find a configuration work around for this issue.
>> 
>> Mark Claassen
>> Senior Software Engineer
>> 
>> Donnell Systems, Inc.
>> 130 South Main Street
>> Leighton Plaza Suite 375
>> South Bend, IN  46601
>> E-mail: mailto:mclaassen@ocie.net
>> Voice: (574)232-3784
>> Fax: (574)232-4014
>> 
>> Disclaimer:
>> The opinions provided herein do not necessarily state or reflect those of Donnell Systems, Inc.(DSI). DSI makes no warranty for and assumes no legal liability or responsibility for the posting.
>> 
>> 
>> -----Original Message-----
>> From: Mark A. Claassen <MC...@ocie.net>
>> Sent: Wednesday, June 9, 2021 1:39 PM
>> To: users@pdfbox.apache.org
>> Subject: [Possible Spam] PDF Memory issue
>> Importance: Low
>> 
>> Hi.  Thanks for your time.
>> 
>> I am using PDF box and am having trouble creating large PDFS (50,000+ pages).  The heap size of the process is capped, but with the temp file active (which I can see being created) I didn't think this would matter.
>> 
>> Here is what I am doing in a very condensed form:
>> 	MEMORY_SETTING = MemoryUsageSetting.setupTempFileOnly();
>> 	PDDocument pdf = new PDDocument(MEMORY_SETTING);
>> 	
>> 	for (...) {
>> 		String text = [generate page text]
>> 		PDPage page = new PDPage(PDRectangle.LETTER);
>> 		try (PDPageContentStream contentStream = new 
>> PDPageContentStream(doc, page, 
>> PDPageContentStream.AppendMode.OVERWRITE, false)) {
>> 			
>> 			contentStream.endText();
>> 			doc.addPage(page);
>> 	}
>> 
>> When I do a heap dump, I see over 100 MG of memory taken by 42,000 
>> instances of ScratchFileBuffer.currentPage
>> 
>> Is there something I am going wrong here?  Or is this a bug?  It seems like I must be doing something wrong / forgetting to do something, since this is a problem in 2 and 3-RC1.
>> 
>> Thanks again,
>> 
>> Mark Claassen
>> Senior Software Engineer
>> 
>> Donnell Systems, Inc.
>> 130 South Main Street
>> Leighton Plaza Suite 375
>> South Bend, IN  46601
>> E-mail: mailto:mclaassen@ocie.net
>> Voice: (574)232-3784
>> Fax: (574)232-4014
>> 
>> Disclaimer:
>> The opinions provided herein do not necessarily state or reflect those of Donnell Systems, Inc.(DSI). DSI makes no warranty for and assumes no legal liability or responsibility for the posting.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>

Re: [Possible Spam] Re: PDF Memory issue

Posted by "Martinez, Mel - 0441 - MITLL" <m....@ll.mit.edu>.

I haven’t looked at this particular code at all, but I’m guessing that a LOT of the objects being referenced are strings — possibly identical strings?

It may be useful to implement object (string) pooling.    That can save a ton of memory.

Do not use the built-in String.intern() function for this, though.   That is limited and slow.    It’s better to build the string pool around something like ConcurrentHashMap.putIfAbsent().

You then need to rewrite the code to do a pool check whenever new strings are created / input.


ConcurrentHashMap stringPool = new ConcurrentHashMap();  //<— do this once and make it available to all your code, whether as a singleton or static.

String s = someStepThatCreatesOrInputsAString();

s = stringPool.putIfAbsent(s, s);   //<— add this step everywhere


This imposes a very tiny lookup cost with every putIfAbsent() call but it’s pretty small and benchmarks you can find on the ’net show its still way faster than String.intern(), especially for large O(n).  The putIfAbsent() call is atomic and this is perfectly thread safe.

The end result is that you can enforce that you will have only one copy of any string in memory, regardless of how many references you might have of it.

For non-String objects, this can also be used but it’s important that the object have proper .equals() and .hashcode() methods implemented.

I hope this suggestion is helpful.  This pattern saved me massive amounts of memory in clients pulling data from cloud databases. 

If it doesn’t make sense to apply in this particular code, then hopefully it will still prove a useful tip for someone somewhere else.

Cheers,

Mel

Dr. Mel Martinez
m.martinez@ll.mit.edu



> On Jun 10, 2021, at 1:07 PM, Mark A. Claassen <MC...@ocie.net> wrote:
> 
> Thanks for the reply.
> 
>> Why should the list not be kept? We need it for when the file is saved.
> 
> I need to study that code a bit more, there is a lot going on there that I don't yet understand.  What I was thinking was if there might be an alternative to keeping the stream object in memory, like storing the necessary metadata for it in a smaller structure.  
> 
> Maybe the stream is the perfect object for this.  However, at 4K or more a piece, and one per page, this scales at least linearly with the number of pages.  When dealing with "normal" documents, this is not an issue.  But when the number of pages gets large, this overhead is significant. 
> 
> We had someone try to create a PDF from a 25,000 page text source.  25,000 * 4K is 100 megabytes.  If it was possible to not maintain any data in the ScratchFileBuffer, it would scale a bit better.
> 
> Thanks again,
> 
> Mark Claassen
> Senior Software Engineer
> 
> Donnell Systems, Inc.
> 130 South Main Street
> Leighton Plaza Suite 375
> South Bend, IN  46601
> E-mail: mailto:mclaassen@ocie.net
> Voice: (574)232-3784
> Fax: (574)232-4014
> 
> Disclaimer:
> The opinions provided herein do not necessarily state or reflect 
> those of Donnell Systems, Inc.(DSI). DSI makes no warranty for and 
> assumes no legal liability or responsibility for the posting. 
> 
> 
> -----Original Message-----
> From: Tilman Hausherr <TH...@t-online.de> 
> Sent: Thursday, June 10, 2021 12:02 PM
> To: dev@pdfbox.apache.org
> Subject: [Possible Spam] Re: PDF Memory issue
> Importance: Low
> 
> Why should the list not be kept? We need it for when the file is saved.
> 
> Tilman
> 
> Am 10.06.2021 um 03:07 schrieb Mark A. Claassen:
>> (This was started on the users list, but I am switching over to the 
>> dev list.)
>> 
>> I found the issue.  I have a bunch of small pages.  The COSDocument keeps a list of the streams that have been created.  The problem is that the currentPage in the ScratchFileBuffer is always in memory.  If there are 40,000 pages, then this will add up to 40,000 * the page size (4096) which is over 160,000,000.
>> 
>> So, now I am not sure how to deal with this.  Each page has a PDFPageContentStream, which creates a ScratchFileBuffer.
>> This ScratchFileBuffer is kept in the list of streams.  I could recompile with a smaller page size, but that will only cut the problem by a percentage.  Does anyone think it may be possible to change this to not maintain the list of streams?  Or maybe clear the currentPage byte array for the items in the list?
>> 
>> I am willing to do some work on this, but a little guidance (or realism) would be helpful before I get too deep into this.
>> 
>> Thanks,
>> 
>> Mark Claassen
>> Senior Software Engineer
>> 
>> Donnell Systems, Inc.
>> 130 South Main Street
>> Leighton Plaza Suite 375
>> South Bend, IN  46601
>> E-mail: mailto:mclaassen@ocie.net
>> Voice: (574)232-3784
>> Fax: (574)232-4014
>> 
>> Disclaimer:
>> The opinions provided herein do not necessarily state or reflect those 
>> of Donnell Systems, Inc.(DSI). DSI makes no warranty for and assumes 
>> no legal liability or responsibility for the posting.
>> -----Original Message-----
>> From: Mark A. Claassen <MC...@ocie.net>
>> Sent: Wednesday, June 9, 2021 4:53 PM
>> To: users@pdfbox.apache.org
>> Subject: [Possible Spam] RE: PDF Memory issue
>> Importance: Low
>> 
>> In looking at this further, it seems that the ScratchFileBuffer.close method is only called when the document is closed.  ScratchFileBuffer.clear is never called.
>> 
>> These are the only places where the pageHandler.markPagesAsFree is called.  I believe this is the issue, since markPagesAsFree is never called, this content just keeps building up until the document is closed.
>> 
>> Any guidance would be greatly appreciated.  I can't seem to find a configuration work around for this issue.
>> 
>> Mark Claassen
>> Senior Software Engineer
>> 
>> Donnell Systems, Inc.
>> 130 South Main Street
>> Leighton Plaza Suite 375
>> South Bend, IN  46601
>> E-mail: mailto:mclaassen@ocie.net
>> Voice: (574)232-3784
>> Fax: (574)232-4014
>> 
>> Disclaimer:
>> The opinions provided herein do not necessarily state or reflect those of Donnell Systems, Inc.(DSI). DSI makes no warranty for and assumes no legal liability or responsibility for the posting.
>> 
>> 
>> -----Original Message-----
>> From: Mark A. Claassen <MC...@ocie.net>
>> Sent: Wednesday, June 9, 2021 1:39 PM
>> To: users@pdfbox.apache.org
>> Subject: [Possible Spam] PDF Memory issue
>> Importance: Low
>> 
>> Hi.  Thanks for your time.
>> 
>> I am using PDF box and am having trouble creating large PDFS (50,000+ pages).  The heap size of the process is capped, but with the temp file active (which I can see being created) I didn't think this would matter.
>> 
>> Here is what I am doing in a very condensed form:
>> 	MEMORY_SETTING = MemoryUsageSetting.setupTempFileOnly();
>> 	PDDocument pdf = new PDDocument(MEMORY_SETTING);
>> 	
>> 	for (...) {
>> 		String text = [generate page text]
>> 		PDPage page = new PDPage(PDRectangle.LETTER);
>> 		try (PDPageContentStream contentStream = new 
>> PDPageContentStream(doc, page, 
>> PDPageContentStream.AppendMode.OVERWRITE, false)) {
>> 			
>> 			contentStream.endText();
>> 			doc.addPage(page);
>> 	}
>> 
>> When I do a heap dump, I see over 100 MG of memory taken by 42,000 
>> instances of ScratchFileBuffer.currentPage
>> 
>> Is there something I am going wrong here?  Or is this a bug?  It seems like I must be doing something wrong / forgetting to do something, since this is a problem in 2 and 3-RC1.
>> 
>> Thanks again,
>> 
>> Mark Claassen
>> Senior Software Engineer
>> 
>> Donnell Systems, Inc.
>> 130 South Main Street
>> Leighton Plaza Suite 375
>> South Bend, IN  46601
>> E-mail: mailto:mclaassen@ocie.net
>> Voice: (574)232-3784
>> Fax: (574)232-4014
>> 
>> Disclaimer:
>> The opinions provided herein do not necessarily state or reflect those of Donnell Systems, Inc.(DSI). DSI makes no warranty for and assumes no legal liability or responsibility for the posting.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>