You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by D....@aurenz.de on 2017/07/13 07:20:55 UTC

Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents

Hi dear contributors to pdfbox,

I just would like to report that Splitter.createNewDocument() should be able to consider different MemoryUsageSetting configurations.

In version 2.0.6 this method is implemented as


protected PDDocument createNewDocument() throws IOException
    {
        PDDocument document = new PDDocument();
        document.getDocument().setVersion(getSourceDocument().getVersion());
        document.setDocumentInformation(getSourceDocument().getDocumentInformation());
        document.getDocumentCatalog().setViewerPreferences(
                getSourceDocument().getDocumentCatalog().getViewerPreferences());
        return document;
    }



I would suggest to introduce a member variable "MemoryUsageSetting memSetting" that can be set for each instance of "Splitter".

This way createNewDocument() could be implemented as


protected PDDocument createNewDocument() throws IOException
    {
        PDDocument document = new PDDocument(this. memSetting);
        document.getDocument().setVersion(getSourceDocument().getVersion());
        document.setDocumentInformation(getSourceDocument().getDocumentInformation());
        document.getDocumentCatalog().setViewerPreferences(
                getSourceDocument().getDocumentCatalog().getViewerPreferences());
        return document;
    }


Thankfully createNewDocument() is not private, so I could override this method in my child class (as I did for "protected void processPage()", too... (just FYI - to create process messages)


Please have a look at "PDFMergerUtility.mergeDocuments()" which is deprecated since MemoryUsageSetting was introduced. Now, the usage of "PDFMergerUtility.mergeDocuments(MemoryUsageSetting memUsageSetting)" is encouraged.


By the way: The utility "PDFSplit" would have to be updated to pass a configured MemoryUsageSetting to "Splitter" - otherwise this tool relies on main memory only.

Perhaps it would be a good thing to be able to define a "pdfbox-wide" basic MemoryUsageSetting which could be used everywhere as a fallback. This way the default constructor of PDDocument could be changed from

its implementation in version 2.0.6

public PDDocument()
    {
        this(MemoryUsageSetting.setupMainMemoryOnly());
    }


to something like


public PDDocument()
    {
        this(MemoryUsageSetting.asConfigured());
    }



Regards,

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


AW: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents

Posted by D....@aurenz.de.
Hi Tilman,

I really appreciated the introduction of MemoryUsageSetting a couple of version ago as it saved me a lot of time and headache when we stumbled across an out of memory problem merging single page documents to large multi page documents.

If currently PDF box tools do not make use of this configuration option, people using these tools do not benefit from the introduction of MemoryUsageSetting - that's all :-)

Daniel




-----Ursprüngliche Nachricht-----
Von: Tilman Hausherr [mailto:THausherr@t-online.de] 
Gesendet: Freitag, 14. Juli 2017 17:08
An: users@pdfbox.apache.org
Betreff: Re: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents

Hi,

No, I did the setter/getter solution which is what you wrote.

Re PDFSplit command line - is this a problem that actually happened to you or just an idea?  If I start putting a memory option there I may have to put it in every tool :-(

Tilman

Am 14.07.2017 um 10:39 schrieb D.Hamann@aurenz.de:
> Hi Tilman,
>
> I used a decompiler to have a look at the sources.
>
> Perhaps it would be a good idea to set Splitter() deprecated
>
>              @deprecated
>              public Splitter() {}
>
>              public Splitter(MemoryUsageSetting memoryUsageSetting) {
> 		this.memoryUsageSetting = memoryUsageSetting;
>              }
>
>
> to point people to the improvement before they fall into the out of memory hole themselves.
>
>
> Please add a program argument to PDFSplit.split() like so:
>
[...]


Re: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

No, I did the setter/getter solution which is what you wrote.

Re PDFSplit command line - is this a problem that actually happened to 
you or just an idea?  If I start putting a memory option there I may 
have to put it in every tool :-(

Tilman

Am 14.07.2017 um 10:39 schrieb D.Hamann@aurenz.de:
> Hi Tilman,
>
> I used a decompiler to have a look at the sources.
>
> Perhaps it would be a good idea to set Splitter() deprecated
>
>              @deprecated
>              public Splitter() {}
>
>              public Splitter(MemoryUsageSetting memoryUsageSetting) {
> 		this.memoryUsageSetting = memoryUsageSetting;
>              }
>
>
> to point people to the improvement before they fall into the out of memory hole themselves.
>
>
> Please add a program argument to PDFSplit.split() like so:
>
>             if (args[i].equals("-memory")) {
>                  if (++i >= args.length) {
>                      PDFSplit.usage();
>                  }
>                  if (args[i].equals("tempFile")) {
>                            memoryUsageSetting = .........
>                  } else if (args[i].equals("mainMemory")) {
>                            memoryUsageSetting = .........
>                  } else if (args[i].equals("mixed")) {
>                            memoryUsageSetting = .........
>                  } else {
>                        PDFSplit.usage();
>                  }
>                  continue;
>              }
>
> Perhaps it would be a good idea to even make "maxMainMemoryBytes" and "maxStorageBytes" configurable, too.
>
> Thanks a lot - I really appreciate your great work and support!
>
> Cheers,
>
> Daniel
>
>
> -----Ursprüngliche Nachricht-----
> Von: Tilman Hausherr [mailto:THausherr@t-online.de]
> Gesendet: Donnerstag, 13. Juli 2017 21:21
> An: users@pdfbox.apache.org
> Betreff: Re: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents
>
> See
> https://issues.apache.org/jira/browse/PDFBOX-3869
>
> and try a snapshot from
> https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.7-SNAPSHOT/
> (at the bottom)
>
> Please give feedback whether this is what you wanted. Please do it quickly because a new version will be built on monday so either I'd have to revert before or we'll be stuck with this API.
>
> Re: a global configuration - maybe at a later time. I'm not THAT convinced that it is needed.
>
> Tilman
>
>
> Am 13.07.2017 um 09:20 schrieb D.Hamann@aurenz.de:
>> Hi dear contributors to pdfbox,
>>
>> I just would like to report that Splitter.createNewDocument() should be able to consider different MemoryUsageSetting configurations.
>>
>> In version 2.0.6 this method is implemented as
>>
>>
>> protected PDDocument createNewDocument() throws IOException
>>       {
>>           PDDocument document = new PDDocument();
>>           document.getDocument().setVersion(getSourceDocument().getVersion());
>>           document.setDocumentInformation(getSourceDocument().getDocumentInformation());
>>           document.getDocumentCatalog().setViewerPreferences(
>>                   getSourceDocument().getDocumentCatalog().getViewerPreferences());
>>           return document;
>>       }
>>
>>
>>
>> I would suggest to introduce a member variable "MemoryUsageSetting memSetting" that can be set for each instance of "Splitter".
>>
>> This way createNewDocument() could be implemented as
>>
>>
>> protected PDDocument createNewDocument() throws IOException
>>       {
>>           PDDocument document = new PDDocument(this. memSetting);
>>           document.getDocument().setVersion(getSourceDocument().getVersion());
>>           document.setDocumentInformation(getSourceDocument().getDocumentInformation());
>>           document.getDocumentCatalog().setViewerPreferences(
>>                   getSourceDocument().getDocumentCatalog().getViewerPreferences());
>>           return document;
>>       }
>>
>>
>> Thankfully createNewDocument() is not private, so I could override
>> this method in my child class (as I did for "protected void
>> processPage()", too... (just FYI - to create process messages)
>>
>>
>> Please have a look at "PDFMergerUtility.mergeDocuments()" which is deprecated since MemoryUsageSetting was introduced. Now, the usage of "PDFMergerUtility.mergeDocuments(MemoryUsageSetting memUsageSetting)" is encouraged.
>>
>>
>> By the way: The utility "PDFSplit" would have to be updated to pass a configured MemoryUsageSetting to "Splitter" - otherwise this tool relies on main memory only.
>>
>> Perhaps it would be a good thing to be able to define a "pdfbox-wide"
>> basic MemoryUsageSetting which could be used everywhere as a fallback.
>> This way the default constructor of PDDocument could be changed from
>>
>> its implementation in version 2.0.6
>>
>> public PDDocument()
>>       {
>>           this(MemoryUsageSetting.setupMainMemoryOnly());
>>       }
>>
>>
>> to something like
>>
>>
>> public PDDocument()
>>       {
>>           this(MemoryUsageSetting.asConfigured());
>>       }
>>
>>
>>
>> Regards,
>>
>> Daniel
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 14.07.2017 um 11:23 schrieb D.Hamann@aurenz.de:
> Hi,
>
> I see - thank you. Your mail from Do 13.07.2017 21:21 (pretty late evening... ;-)) said https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.7-SNAPSHOT/

Yes that was so that you could use the jar on your own project.

Tilman


>
> Cause I'm not that familiar with pdfbox project this led to my confusion.
>
> Did you notice my review? Thanks again!
>
>
> -----Ursprüngliche Nachricht-----
> Von: Andreas Lehmkühler [mailto:andreas@lehmi.de]
> Gesendet: Freitag, 14. Juli 2017 11:18
> An: users@pdfbox.apache.org
> Betreff: Re: AW: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents
>
> You are looking at the wrong place. pdfbox-app is just a meta project to create a convience binary of all relevant subprojects. It doesn't contain any source code.
>
> The source code you are looking for is here:
>
> https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox/2.0.7-SNAPSHOT/
>
> Andreas
>
>> D.Hamann@aurenz.de hat am 14. Juli 2017 um 11:05 geschrieben:
>>
>>
>> Hi,
>>
>> I talking about the snapshot versions provided here:
>>
>> https://repository.apache.org/content/groups/snapshots/org/apache/pdfb
>> ox/pdfbox-app/2.0.7-SNAPSHOT/
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


AW: AW: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents

Posted by D....@aurenz.de.
Hi,

I see - thank you. Your mail from Do 13.07.2017 21:21 (pretty late evening... ;-)) said https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.7-SNAPSHOT/

Cause I'm not that familiar with pdfbox project this led to my confusion.

Did you notice my review? Thanks again!


-----Ursprüngliche Nachricht-----
Von: Andreas Lehmkühler [mailto:andreas@lehmi.de] 
Gesendet: Freitag, 14. Juli 2017 11:18
An: users@pdfbox.apache.org
Betreff: Re: AW: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents

You are looking at the wrong place. pdfbox-app is just a meta project to create a convience binary of all relevant subprojects. It doesn't contain any source code.

The source code you are looking for is here:

https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox/2.0.7-SNAPSHOT/

Andreas

> D.Hamann@aurenz.de hat am 14. Juli 2017 um 11:05 geschrieben:
> 
> 
> Hi,
> 
> I talking about the snapshot versions provided here:
> 
> https://repository.apache.org/content/groups/snapshots/org/apache/pdfb
> ox/pdfbox-app/2.0.7-SNAPSHOT/
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: AW: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents

Posted by Andreas Lehmkühler <an...@lehmi.de>.
You are looking at the wrong place. pdfbox-app is just a meta project to create a convience binary of all relevant subprojects. It doesn't contain any source code.

The source code you are looking for is here:

https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox/2.0.7-SNAPSHOT/

Andreas

> D.Hamann@aurenz.de hat am 14. Juli 2017 um 11:05 geschrieben:
> 
> 
> Hi,
> 
> I talking about the snapshot versions provided here:
> 
> https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.7-SNAPSHOT/
> 
> Can you tell me were to download jars containing source files? The source jars there just contain the META-INF directory but nothing else.
> 
> Thank you!
> 
> -----Ursprüngliche Nachricht-----
> Von: Gilad Denneboom [mailto:gilad.denneboom@gmail.com] 
> Gesendet: Freitag, 14. Juli 2017 11:03
> An: users@pdfbox.apache.org
> Betreff: Re: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents
> 
> You don't need a decompiler... PDFBox is an open-source library. All the code is available online.
> 
> On Fri, Jul 14, 2017 at 10:39 AM, <D....@aurenz.de> wrote:
> 
> > Hi Tilman,
> >
> > I used a decompiler to have a look at the sources.
> >
> > Perhaps it would be a good idea to set Splitter() deprecated
> >
> >             @deprecated
> >             public Splitter() {}
> >
> >             public Splitter(MemoryUsageSetting memoryUsageSetting) {
> >                 this.memoryUsageSetting = memoryUsageSetting;
> >             }
> >
> >
> > to point people to the improvement before they fall into the out of 
> > memory hole themselves.
> >
> >
> > Please add a program argument to PDFSplit.split() like so:
> >
> >            if (args[i].equals("-memory")) {
> >                 if (++i >= args.length) {
> >                     PDFSplit.usage();
> >                 }
> >                 if (args[i].equals("tempFile")) {
> >                           memoryUsageSetting = .........
> >                 } else if (args[i].equals("mainMemory")) {
> >                           memoryUsageSetting = .........
> >                 } else if (args[i].equals("mixed")) {
> >                           memoryUsageSetting = .........
> >                 } else {
> >                       PDFSplit.usage();
> >                 }
> >                 continue;
> >             }
> >
> > Perhaps it would be a good idea to even make "maxMainMemoryBytes" and 
> > "maxStorageBytes" configurable, too.
> >
> > Thanks a lot - I really appreciate your great work and support!
> >
> > Cheers,
> >
> > Daniel
> >
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Tilman Hausherr [mailto:THausherr@t-online.de]
> > Gesendet: Donnerstag, 13. Juli 2017 21:21
> > An: users@pdfbox.apache.org
> > Betreff: Re: Splitter.createNewDocument() always uses main memory only 
> > - this leads to out of memory when splitting large documents
> >
> > See
> > https://issues.apache.org/jira/browse/PDFBOX-3869
> >
> > and try a snapshot from
> > https://repository.apache.org/content/groups/snapshots/org/
> > apache/pdfbox/pdfbox-app/2.0.7-SNAPSHOT/
> > (at the bottom)
> >
> > Please give feedback whether this is what you wanted. Please do it 
> > quickly because a new version will be built on monday so either I'd 
> > have to revert before or we'll be stuck with this API.
> >
> > Re: a global configuration - maybe at a later time. I'm not THAT 
> > convinced that it is needed.
> >
> > Tilman
> >
> >
> > Am 13.07.2017 um 09:20 schrieb D.Hamann@aurenz.de:
> > > Hi dear contributors to pdfbox,
> > >
> > > I just would like to report that Splitter.createNewDocument() should 
> > > be
> > able to consider different MemoryUsageSetting configurations.
> > >
> > > In version 2.0.6 this method is implemented as
> > >
> > >
> > > protected PDDocument createNewDocument() throws IOException
> > >      {
> > >          PDDocument document = new PDDocument();
> > >          document.getDocument().setVersion(getSourceDocument()
> > .getVersion());
> > >          document.setDocumentInformation(getSourceDocument().
> > getDocumentInformation());
> > >          document.getDocumentCatalog().setViewerPreferences(
> > >                  getSourceDocument().getDocumentCatalog().
> > getViewerPreferences());
> > >          return document;
> > >      }
> > >
> > >
> > >
> > > I would suggest to introduce a member variable "MemoryUsageSetting
> > memSetting" that can be set for each instance of "Splitter".
> > >
> > > This way createNewDocument() could be implemented as
> > >
> > >
> > > protected PDDocument createNewDocument() throws IOException
> > >      {
> > >          PDDocument document = new PDDocument(this. memSetting);
> > >          document.getDocument().setVersion(getSourceDocument()
> > .getVersion());
> > >          document.setDocumentInformation(getSourceDocument().
> > getDocumentInformation());
> > >          document.getDocumentCatalog().setViewerPreferences(
> > >                  getSourceDocument().getDocumentCatalog().
> > getViewerPreferences());
> > >          return document;
> > >      }
> > >
> > >
> > > Thankfully createNewDocument() is not private, so I could override 
> > > this method in my child class (as I did for "protected void 
> > > processPage()", too... (just FYI - to create process messages)
> > >
> > >
> > > Please have a look at "PDFMergerUtility.mergeDocuments()" which is
> > deprecated since MemoryUsageSetting was introduced. Now, the usage of 
> > "PDFMergerUtility.mergeDocuments(MemoryUsageSetting memUsageSetting)" 
> > is encouraged.
> > >
> > >
> > > By the way: The utility "PDFSplit" would have to be updated to pass 
> > > a
> > configured MemoryUsageSetting to "Splitter" - otherwise this tool 
> > relies on main memory only.
> > >
> > > Perhaps it would be a good thing to be able to define a "pdfbox-wide"
> > > basic MemoryUsageSetting which could be used everywhere as a fallback.
> > > This way the default constructor of PDDocument could be changed from
> > >
> > > its implementation in version 2.0.6
> > >
> > > public PDDocument()
> > >      {
> > >          this(MemoryUsageSetting.setupMainMemoryOnly());
> > >      }
> > >
> > >
> > > to something like
> > >
> > >
> > > public PDDocument()
> > >      {
> > >          this(MemoryUsageSetting.asConfigured());
> > >      }
> > >
> > >
> > >
> > > Regards,
> > >
> > > Daniel
> > >
> > > --------------------------------------------------------------------
> > > - To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > > For additional commands, e-mail: users-help@pdfbox.apache.org
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


AW: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents

Posted by D....@aurenz.de.
Hi,

I talking about the snapshot versions provided here:

https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.7-SNAPSHOT/

Can you tell me were to download jars containing source files? The source jars there just contain the META-INF directory but nothing else.

Thank you!

-----Ursprüngliche Nachricht-----
Von: Gilad Denneboom [mailto:gilad.denneboom@gmail.com] 
Gesendet: Freitag, 14. Juli 2017 11:03
An: users@pdfbox.apache.org
Betreff: Re: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents

You don't need a decompiler... PDFBox is an open-source library. All the code is available online.

On Fri, Jul 14, 2017 at 10:39 AM, <D....@aurenz.de> wrote:

> Hi Tilman,
>
> I used a decompiler to have a look at the sources.
>
> Perhaps it would be a good idea to set Splitter() deprecated
>
>             @deprecated
>             public Splitter() {}
>
>             public Splitter(MemoryUsageSetting memoryUsageSetting) {
>                 this.memoryUsageSetting = memoryUsageSetting;
>             }
>
>
> to point people to the improvement before they fall into the out of 
> memory hole themselves.
>
>
> Please add a program argument to PDFSplit.split() like so:
>
>            if (args[i].equals("-memory")) {
>                 if (++i >= args.length) {
>                     PDFSplit.usage();
>                 }
>                 if (args[i].equals("tempFile")) {
>                           memoryUsageSetting = .........
>                 } else if (args[i].equals("mainMemory")) {
>                           memoryUsageSetting = .........
>                 } else if (args[i].equals("mixed")) {
>                           memoryUsageSetting = .........
>                 } else {
>                       PDFSplit.usage();
>                 }
>                 continue;
>             }
>
> Perhaps it would be a good idea to even make "maxMainMemoryBytes" and 
> "maxStorageBytes" configurable, too.
>
> Thanks a lot - I really appreciate your great work and support!
>
> Cheers,
>
> Daniel
>
>
> -----Ursprüngliche Nachricht-----
> Von: Tilman Hausherr [mailto:THausherr@t-online.de]
> Gesendet: Donnerstag, 13. Juli 2017 21:21
> An: users@pdfbox.apache.org
> Betreff: Re: Splitter.createNewDocument() always uses main memory only 
> - this leads to out of memory when splitting large documents
>
> See
> https://issues.apache.org/jira/browse/PDFBOX-3869
>
> and try a snapshot from
> https://repository.apache.org/content/groups/snapshots/org/
> apache/pdfbox/pdfbox-app/2.0.7-SNAPSHOT/
> (at the bottom)
>
> Please give feedback whether this is what you wanted. Please do it 
> quickly because a new version will be built on monday so either I'd 
> have to revert before or we'll be stuck with this API.
>
> Re: a global configuration - maybe at a later time. I'm not THAT 
> convinced that it is needed.
>
> Tilman
>
>
> Am 13.07.2017 um 09:20 schrieb D.Hamann@aurenz.de:
> > Hi dear contributors to pdfbox,
> >
> > I just would like to report that Splitter.createNewDocument() should 
> > be
> able to consider different MemoryUsageSetting configurations.
> >
> > In version 2.0.6 this method is implemented as
> >
> >
> > protected PDDocument createNewDocument() throws IOException
> >      {
> >          PDDocument document = new PDDocument();
> >          document.getDocument().setVersion(getSourceDocument()
> .getVersion());
> >          document.setDocumentInformation(getSourceDocument().
> getDocumentInformation());
> >          document.getDocumentCatalog().setViewerPreferences(
> >                  getSourceDocument().getDocumentCatalog().
> getViewerPreferences());
> >          return document;
> >      }
> >
> >
> >
> > I would suggest to introduce a member variable "MemoryUsageSetting
> memSetting" that can be set for each instance of "Splitter".
> >
> > This way createNewDocument() could be implemented as
> >
> >
> > protected PDDocument createNewDocument() throws IOException
> >      {
> >          PDDocument document = new PDDocument(this. memSetting);
> >          document.getDocument().setVersion(getSourceDocument()
> .getVersion());
> >          document.setDocumentInformation(getSourceDocument().
> getDocumentInformation());
> >          document.getDocumentCatalog().setViewerPreferences(
> >                  getSourceDocument().getDocumentCatalog().
> getViewerPreferences());
> >          return document;
> >      }
> >
> >
> > Thankfully createNewDocument() is not private, so I could override 
> > this method in my child class (as I did for "protected void 
> > processPage()", too... (just FYI - to create process messages)
> >
> >
> > Please have a look at "PDFMergerUtility.mergeDocuments()" which is
> deprecated since MemoryUsageSetting was introduced. Now, the usage of 
> "PDFMergerUtility.mergeDocuments(MemoryUsageSetting memUsageSetting)" 
> is encouraged.
> >
> >
> > By the way: The utility "PDFSplit" would have to be updated to pass 
> > a
> configured MemoryUsageSetting to "Splitter" - otherwise this tool 
> relies on main memory only.
> >
> > Perhaps it would be a good thing to be able to define a "pdfbox-wide"
> > basic MemoryUsageSetting which could be used everywhere as a fallback.
> > This way the default constructor of PDDocument could be changed from
> >
> > its implementation in version 2.0.6
> >
> > public PDDocument()
> >      {
> >          this(MemoryUsageSetting.setupMainMemoryOnly());
> >      }
> >
> >
> > to something like
> >
> >
> > public PDDocument()
> >      {
> >          this(MemoryUsageSetting.asConfigured());
> >      }
> >
> >
> >
> > Regards,
> >
> > Daniel
> >
> > --------------------------------------------------------------------
> > - To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents

Posted by Gilad Denneboom <gi...@gmail.com>.
You don't need a decompiler... PDFBox is an open-source library. All the
code is available online.

On Fri, Jul 14, 2017 at 10:39 AM, <D....@aurenz.de> wrote:

> Hi Tilman,
>
> I used a decompiler to have a look at the sources.
>
> Perhaps it would be a good idea to set Splitter() deprecated
>
>             @deprecated
>             public Splitter() {}
>
>             public Splitter(MemoryUsageSetting memoryUsageSetting) {
>                 this.memoryUsageSetting = memoryUsageSetting;
>             }
>
>
> to point people to the improvement before they fall into the out of memory
> hole themselves.
>
>
> Please add a program argument to PDFSplit.split() like so:
>
>            if (args[i].equals("-memory")) {
>                 if (++i >= args.length) {
>                     PDFSplit.usage();
>                 }
>                 if (args[i].equals("tempFile")) {
>                           memoryUsageSetting = .........
>                 } else if (args[i].equals("mainMemory")) {
>                           memoryUsageSetting = .........
>                 } else if (args[i].equals("mixed")) {
>                           memoryUsageSetting = .........
>                 } else {
>                       PDFSplit.usage();
>                 }
>                 continue;
>             }
>
> Perhaps it would be a good idea to even make "maxMainMemoryBytes" and
> "maxStorageBytes" configurable, too.
>
> Thanks a lot - I really appreciate your great work and support!
>
> Cheers,
>
> Daniel
>
>
> -----Ursprüngliche Nachricht-----
> Von: Tilman Hausherr [mailto:THausherr@t-online.de]
> Gesendet: Donnerstag, 13. Juli 2017 21:21
> An: users@pdfbox.apache.org
> Betreff: Re: Splitter.createNewDocument() always uses main memory only -
> this leads to out of memory when splitting large documents
>
> See
> https://issues.apache.org/jira/browse/PDFBOX-3869
>
> and try a snapshot from
> https://repository.apache.org/content/groups/snapshots/org/
> apache/pdfbox/pdfbox-app/2.0.7-SNAPSHOT/
> (at the bottom)
>
> Please give feedback whether this is what you wanted. Please do it quickly
> because a new version will be built on monday so either I'd have to revert
> before or we'll be stuck with this API.
>
> Re: a global configuration - maybe at a later time. I'm not THAT convinced
> that it is needed.
>
> Tilman
>
>
> Am 13.07.2017 um 09:20 schrieb D.Hamann@aurenz.de:
> > Hi dear contributors to pdfbox,
> >
> > I just would like to report that Splitter.createNewDocument() should be
> able to consider different MemoryUsageSetting configurations.
> >
> > In version 2.0.6 this method is implemented as
> >
> >
> > protected PDDocument createNewDocument() throws IOException
> >      {
> >          PDDocument document = new PDDocument();
> >          document.getDocument().setVersion(getSourceDocument()
> .getVersion());
> >          document.setDocumentInformation(getSourceDocument().
> getDocumentInformation());
> >          document.getDocumentCatalog().setViewerPreferences(
> >                  getSourceDocument().getDocumentCatalog().
> getViewerPreferences());
> >          return document;
> >      }
> >
> >
> >
> > I would suggest to introduce a member variable "MemoryUsageSetting
> memSetting" that can be set for each instance of "Splitter".
> >
> > This way createNewDocument() could be implemented as
> >
> >
> > protected PDDocument createNewDocument() throws IOException
> >      {
> >          PDDocument document = new PDDocument(this. memSetting);
> >          document.getDocument().setVersion(getSourceDocument()
> .getVersion());
> >          document.setDocumentInformation(getSourceDocument().
> getDocumentInformation());
> >          document.getDocumentCatalog().setViewerPreferences(
> >                  getSourceDocument().getDocumentCatalog().
> getViewerPreferences());
> >          return document;
> >      }
> >
> >
> > Thankfully createNewDocument() is not private, so I could override
> > this method in my child class (as I did for "protected void
> > processPage()", too... (just FYI - to create process messages)
> >
> >
> > Please have a look at "PDFMergerUtility.mergeDocuments()" which is
> deprecated since MemoryUsageSetting was introduced. Now, the usage of
> "PDFMergerUtility.mergeDocuments(MemoryUsageSetting memUsageSetting)" is
> encouraged.
> >
> >
> > By the way: The utility "PDFSplit" would have to be updated to pass a
> configured MemoryUsageSetting to "Splitter" - otherwise this tool relies on
> main memory only.
> >
> > Perhaps it would be a good thing to be able to define a "pdfbox-wide"
> > basic MemoryUsageSetting which could be used everywhere as a fallback.
> > This way the default constructor of PDDocument could be changed from
> >
> > its implementation in version 2.0.6
> >
> > public PDDocument()
> >      {
> >          this(MemoryUsageSetting.setupMainMemoryOnly());
> >      }
> >
> >
> > to something like
> >
> >
> > public PDDocument()
> >      {
> >          this(MemoryUsageSetting.asConfigured());
> >      }
> >
> >
> >
> > Regards,
> >
> > Daniel
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

AW: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents

Posted by D....@aurenz.de.
Hi Tilman,

I used a decompiler to have a look at the sources.

Perhaps it would be a good idea to set Splitter() deprecated

            @deprecated
            public Splitter() {}

            public Splitter(MemoryUsageSetting memoryUsageSetting) {
		this.memoryUsageSetting = memoryUsageSetting;
            }


to point people to the improvement before they fall into the out of memory hole themselves.


Please add a program argument to PDFSplit.split() like so:

           if (args[i].equals("-memory")) {
                if (++i >= args.length) {
                    PDFSplit.usage();
                }
                if (args[i].equals("tempFile")) {
                          memoryUsageSetting = .........
                } else if (args[i].equals("mainMemory")) {
                          memoryUsageSetting = .........
                } else if (args[i].equals("mixed")) {
                          memoryUsageSetting = .........
                } else {
                      PDFSplit.usage();
                }
                continue;
            }

Perhaps it would be a good idea to even make "maxMainMemoryBytes" and "maxStorageBytes" configurable, too.

Thanks a lot - I really appreciate your great work and support!

Cheers, 

Daniel


-----Ursprüngliche Nachricht-----
Von: Tilman Hausherr [mailto:THausherr@t-online.de] 
Gesendet: Donnerstag, 13. Juli 2017 21:21
An: users@pdfbox.apache.org
Betreff: Re: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents

See
https://issues.apache.org/jira/browse/PDFBOX-3869

and try a snapshot from
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.7-SNAPSHOT/
(at the bottom)

Please give feedback whether this is what you wanted. Please do it quickly because a new version will be built on monday so either I'd have to revert before or we'll be stuck with this API.

Re: a global configuration - maybe at a later time. I'm not THAT convinced that it is needed.

Tilman


Am 13.07.2017 um 09:20 schrieb D.Hamann@aurenz.de:
> Hi dear contributors to pdfbox,
>
> I just would like to report that Splitter.createNewDocument() should be able to consider different MemoryUsageSetting configurations.
>
> In version 2.0.6 this method is implemented as
>
>
> protected PDDocument createNewDocument() throws IOException
>      {
>          PDDocument document = new PDDocument();
>          document.getDocument().setVersion(getSourceDocument().getVersion());
>          document.setDocumentInformation(getSourceDocument().getDocumentInformation());
>          document.getDocumentCatalog().setViewerPreferences(
>                  getSourceDocument().getDocumentCatalog().getViewerPreferences());
>          return document;
>      }
>
>
>
> I would suggest to introduce a member variable "MemoryUsageSetting memSetting" that can be set for each instance of "Splitter".
>
> This way createNewDocument() could be implemented as
>
>
> protected PDDocument createNewDocument() throws IOException
>      {
>          PDDocument document = new PDDocument(this. memSetting);
>          document.getDocument().setVersion(getSourceDocument().getVersion());
>          document.setDocumentInformation(getSourceDocument().getDocumentInformation());
>          document.getDocumentCatalog().setViewerPreferences(
>                  getSourceDocument().getDocumentCatalog().getViewerPreferences());
>          return document;
>      }
>
>
> Thankfully createNewDocument() is not private, so I could override 
> this method in my child class (as I did for "protected void 
> processPage()", too... (just FYI - to create process messages)
>
>
> Please have a look at "PDFMergerUtility.mergeDocuments()" which is deprecated since MemoryUsageSetting was introduced. Now, the usage of "PDFMergerUtility.mergeDocuments(MemoryUsageSetting memUsageSetting)" is encouraged.
>
>
> By the way: The utility "PDFSplit" would have to be updated to pass a configured MemoryUsageSetting to "Splitter" - otherwise this tool relies on main memory only.
>
> Perhaps it would be a good thing to be able to define a "pdfbox-wide" 
> basic MemoryUsageSetting which could be used everywhere as a fallback. 
> This way the default constructor of PDDocument could be changed from
>
> its implementation in version 2.0.6
>
> public PDDocument()
>      {
>          this(MemoryUsageSetting.setupMainMemoryOnly());
>      }
>
>
> to something like
>
>
> public PDDocument()
>      {
>          this(MemoryUsageSetting.asConfigured());
>      }
>
>
>
> Regards,
>
> Daniel
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


AW: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents

Posted by D....@aurenz.de.
Hi Tilman,

thanks a lot for addressing this topic so incredibly fast. I wanted to do the review but source jars from

https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.7-SNAPSHOT/

are significantly smaller than the jars there containing class files only. That is because they just contain the META-INF directory but nothing else - at least as I found with "pdfbox-app-2.0.7-20170713.214057-144-sources"

I'm pretty sure though your code change is exactly what I suggested. If you could point me to the source codes of the 2.0.7-SNAPSHOT today before 16:00 - I will definitely have a look at it.

Cheers,

Daniel


-----Ursprüngliche Nachricht-----
Von: Tilman Hausherr [mailto:THausherr@t-online.de] 
Gesendet: Donnerstag, 13. Juli 2017 21:21
An: users@pdfbox.apache.org
Betreff: Re: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents

See
https://issues.apache.org/jira/browse/PDFBOX-3869

and try a snapshot from
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.7-SNAPSHOT/
(at the bottom)

Please give feedback whether this is what you wanted. Please do it quickly because a new version will be built on monday so either I'd have to revert before or we'll be stuck with this API.

Re: a global configuration - maybe at a later time. I'm not THAT convinced that it is needed.

Tilman


Am 13.07.2017 um 09:20 schrieb D.Hamann@aurenz.de:
> Hi dear contributors to pdfbox,
>
> I just would like to report that Splitter.createNewDocument() should be able to consider different MemoryUsageSetting configurations.
>
> In version 2.0.6 this method is implemented as
>
>
> protected PDDocument createNewDocument() throws IOException
>      {
>          PDDocument document = new PDDocument();
>          document.getDocument().setVersion(getSourceDocument().getVersion());
>          document.setDocumentInformation(getSourceDocument().getDocumentInformation());
>          document.getDocumentCatalog().setViewerPreferences(
>                  getSourceDocument().getDocumentCatalog().getViewerPreferences());
>          return document;
>      }
>
>
>
> I would suggest to introduce a member variable "MemoryUsageSetting memSetting" that can be set for each instance of "Splitter".
>
> This way createNewDocument() could be implemented as
>
>
> protected PDDocument createNewDocument() throws IOException
>      {
>          PDDocument document = new PDDocument(this. memSetting);
>          document.getDocument().setVersion(getSourceDocument().getVersion());
>          document.setDocumentInformation(getSourceDocument().getDocumentInformation());
>          document.getDocumentCatalog().setViewerPreferences(
>                  getSourceDocument().getDocumentCatalog().getViewerPreferences());
>          return document;
>      }
>
>
> Thankfully createNewDocument() is not private, so I could override 
> this method in my child class (as I did for "protected void 
> processPage()", too... (just FYI - to create process messages)
>
>
> Please have a look at "PDFMergerUtility.mergeDocuments()" which is deprecated since MemoryUsageSetting was introduced. Now, the usage of "PDFMergerUtility.mergeDocuments(MemoryUsageSetting memUsageSetting)" is encouraged.
>
>
> By the way: The utility "PDFSplit" would have to be updated to pass a configured MemoryUsageSetting to "Splitter" - otherwise this tool relies on main memory only.
>
> Perhaps it would be a good thing to be able to define a "pdfbox-wide" 
> basic MemoryUsageSetting which could be used everywhere as a fallback. 
> This way the default constructor of PDDocument could be changed from
>
> its implementation in version 2.0.6
>
> public PDDocument()
>      {
>          this(MemoryUsageSetting.setupMainMemoryOnly());
>      }
>
>
> to something like
>
>
> public PDDocument()
>      {
>          this(MemoryUsageSetting.asConfigured());
>      }
>
>
>
> Regards,
>
> Daniel
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents

Posted by Tilman Hausherr <TH...@t-online.de>.
See
https://issues.apache.org/jira/browse/PDFBOX-3869

and try a snapshot from
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.7-SNAPSHOT/
(at the bottom)

Please give feedback whether this is what you wanted. Please do it 
quickly because a new version will be built on monday so either I'd have 
to revert before or we'll be stuck with this API.

Re: a global configuration - maybe at a later time. I'm not THAT 
convinced that it is needed.

Tilman


Am 13.07.2017 um 09:20 schrieb D.Hamann@aurenz.de:
> Hi dear contributors to pdfbox,
>
> I just would like to report that Splitter.createNewDocument() should be able to consider different MemoryUsageSetting configurations.
>
> In version 2.0.6 this method is implemented as
>
>
> protected PDDocument createNewDocument() throws IOException
>      {
>          PDDocument document = new PDDocument();
>          document.getDocument().setVersion(getSourceDocument().getVersion());
>          document.setDocumentInformation(getSourceDocument().getDocumentInformation());
>          document.getDocumentCatalog().setViewerPreferences(
>                  getSourceDocument().getDocumentCatalog().getViewerPreferences());
>          return document;
>      }
>
>
>
> I would suggest to introduce a member variable "MemoryUsageSetting memSetting" that can be set for each instance of "Splitter".
>
> This way createNewDocument() could be implemented as
>
>
> protected PDDocument createNewDocument() throws IOException
>      {
>          PDDocument document = new PDDocument(this. memSetting);
>          document.getDocument().setVersion(getSourceDocument().getVersion());
>          document.setDocumentInformation(getSourceDocument().getDocumentInformation());
>          document.getDocumentCatalog().setViewerPreferences(
>                  getSourceDocument().getDocumentCatalog().getViewerPreferences());
>          return document;
>      }
>
>
> Thankfully createNewDocument() is not private, so I could override this method in my child class (as I did for "protected void processPage()", too... (just FYI - to create process messages)
>
>
> Please have a look at "PDFMergerUtility.mergeDocuments()" which is deprecated since MemoryUsageSetting was introduced. Now, the usage of "PDFMergerUtility.mergeDocuments(MemoryUsageSetting memUsageSetting)" is encouraged.
>
>
> By the way: The utility "PDFSplit" would have to be updated to pass a configured MemoryUsageSetting to "Splitter" - otherwise this tool relies on main memory only.
>
> Perhaps it would be a good thing to be able to define a "pdfbox-wide" basic MemoryUsageSetting which could be used everywhere as a fallback. This way the default constructor of PDDocument could be changed from
>
> its implementation in version 2.0.6
>
> public PDDocument()
>      {
>          this(MemoryUsageSetting.setupMainMemoryOnly());
>      }
>
>
> to something like
>
>
> public PDDocument()
>      {
>          this(MemoryUsageSetting.asConfigured());
>      }
>
>
>
> Regards,
>
> Daniel
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Splitter.createNewDocument() always uses main memory only - this leads to out of memory when splitting large documents

Posted by Tilman Hausherr <TH...@t-online.de>.
This is definitively a good idea. I'll create an issue later today...

Tilman

Am 13.07.2017 um 09:20 schrieb D.Hamann@aurenz.de:
> Hi dear contributors to pdfbox,
>
> I just would like to report that Splitter.createNewDocument() should be able to consider different MemoryUsageSetting configurations.
>
> In version 2.0.6 this method is implemented as
>
>
> protected PDDocument createNewDocument() throws IOException
>      {
>          PDDocument document = new PDDocument();
>          document.getDocument().setVersion(getSourceDocument().getVersion());
>          document.setDocumentInformation(getSourceDocument().getDocumentInformation());
>          document.getDocumentCatalog().setViewerPreferences(
>                  getSourceDocument().getDocumentCatalog().getViewerPreferences());
>          return document;
>      }
>
>
>
> I would suggest to introduce a member variable "MemoryUsageSetting memSetting" that can be set for each instance of "Splitter".
>
> This way createNewDocument() could be implemented as
>
>
> protected PDDocument createNewDocument() throws IOException
>      {
>          PDDocument document = new PDDocument(this. memSetting);
>          document.getDocument().setVersion(getSourceDocument().getVersion());
>          document.setDocumentInformation(getSourceDocument().getDocumentInformation());
>          document.getDocumentCatalog().setViewerPreferences(
>                  getSourceDocument().getDocumentCatalog().getViewerPreferences());
>          return document;
>      }
>
>
> Thankfully createNewDocument() is not private, so I could override this method in my child class (as I did for "protected void processPage()", too... (just FYI - to create process messages)
>
>
> Please have a look at "PDFMergerUtility.mergeDocuments()" which is deprecated since MemoryUsageSetting was introduced. Now, the usage of "PDFMergerUtility.mergeDocuments(MemoryUsageSetting memUsageSetting)" is encouraged.
>
>
> By the way: The utility "PDFSplit" would have to be updated to pass a configured MemoryUsageSetting to "Splitter" - otherwise this tool relies on main memory only.
>
> Perhaps it would be a good thing to be able to define a "pdfbox-wide" basic MemoryUsageSetting which could be used everywhere as a fallback. This way the default constructor of PDDocument could be changed from
>
> its implementation in version 2.0.6
>
> public PDDocument()
>      {
>          this(MemoryUsageSetting.setupMainMemoryOnly());
>      }
>
>
> to something like
>
>
> public PDDocument()
>      {
>          this(MemoryUsageSetting.asConfigured());
>      }
>
>
>
> Regards,
>
> Daniel
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org