You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Jesus Jr M Salvo <je...@gmail.com> on 2013/11/09 07:11:17 UTC

Memory usage when counting the number of pages and creating bookmarks for large PDFs

pdfbox-1.8.2
tika-app-1.4 ( I'm including Apache Tika as I just found out that
Apache Tika comes with pdfbox )

I have various existing PDFs that I need to merge into one PDF. The
number of PDFs to be merged into one can be varied .. anywhere from 2
PDFs to 10 or 20 or more PDFs. The number of pages in each PDF to be
merged can also be varied. These PDFs are mostly scanned via an EDRMS
like HP TRIM7 ... so documents say like ... medical reports, etc ..
and up as PDFs. Thus, each page of the PDF is an image instead of
text.

Merging them into a single PDF is no problem using the PDFMergerUtility.

After I have merged them into a single PDF, I then need to add
bookmarks so that the person reading the PDF ( e.g. insurer, trustee )
can quickly jump to a section of the merged PDF to see one of the
merged PDFs.

The issue is the memory consumption .. the merged PDF tend to be quite
large ( anywhere from 200MB to 1GB ... again because each individual
PDF were scanned via an EDRMS like HP TRIM7 so each page tend to be an
image ). Now having multiple of these merges run in parallel, and I
can easily consume the entire heap allocated to the JVM.

To create the bookmarks, I have to open the large / merged PDF.

So the question is, is there a better way of creating bookmarks so as
that the amount of memory consumed is minimal ?

Note that I am making sure I am calling PDDocument.close() in a
finally clause. See snippets below.


1) To create the bookmarks, I have to find out the number of pages in
each PDF before they are merged. Something like in a loop:

PDDocument document = null;
try {
    document = PDDocument.load(aDownload.getLocalFile());
    aDownload.setNumberOfPages( document.getNumberOfPages() );
} finally {
    if( document != null ) {
        document.close();
    }
}

2) Then I have to open the large / merged PDF file, then create the
bookmarks using the number of pages as the guide from above ( And I
also have to set the meta-data ... the author, date/time, subject on
the PDF ):

private void finaliseDocument(
final File pdfFile,
final List<DocumentDownloadEntry> downloadEntries )
throws Exception
{
    logger.log(Level.INFO, String.format("Finalising PDF document %s",
pdfFile.toString()));
    PDDocument document = null;
    try {
        document = PDDocument.load(pdfFile);
        document.getDocumentCatalog().setPageMode(PDDocumentCatalog.PAGE_MODE_USE_OUTLINES);
        document.getDocumentInformation().setCreationDate(Calendar.getInstance());
        document.getDocumentInformation().setAuthor(getUserName());
        document.getDocumentInformation().setTitle(getClaimDocuments().getEquipId()
+ " - " + getSubmissionType());
        makeBookmarks( document, downloadEntries );
        document.save(pdfFile);
    } finally {
        if( document != null ) {
            document.close();
        }
    }
}

private void makeBookmarks(
final PDDocument document,
final List<DocumentDownloadEntry> downloadEntries)
throws Exception
{
        PDDocumentOutline outline =  new PDDocumentOutline();
        document.getDocumentCatalog().setDocumentOutline( outline );
        PDOutlineItem pagesOutline = new PDOutlineItem();
        pagesOutline.setTitle( document.getDocumentInformation().getTitle() );
        outline.appendChild( pagesOutline );

        @SuppressWarnings("rawtypes")
        List pages = document.getDocumentCatalog().getAllPages();
        int pageIndex = 0;
        for( DocumentDownloadEntry aDownload : downloadEntries ) {
          if( aDownload.isDownload() && aDownload.isDownloaded() ) {
            PDPage page = (PDPage)pages.get( pageIndex );
            pageIndex += aDownload.getNumberOfPages();

                PDPageFitWidthDestination dest = new
PDPageFitWidthDestination();
                dest.setPage( page );
                PDOutlineItem bookmark = new PDOutlineItem();
                bookmark.setDestination( dest );

                bookmark.setTitle( aDownload.getDocumentName() );
                pagesOutline.appendChild( bookmark );
          }
        }
        pagesOutline.openNode();
        outline.openNode();
}

Re: Memory usage when counting the number of pages and creating bookmarks for large PDFs

Posted by Jesus Jr M Salvo <je...@gmail.com>.
Seems that PDFMergerUtility.appendDocument(PDDocument dest, PDDocument
src) is the solution, as you can then use RandomAccessFile.

On 11 November 2013 20:31, Jesus Jr M Salvo <je...@gmail.com> wrote:
> Thanks.
>
> I tried adding the bookmark to the source PDFs upfront, but they are
> not merged into the merged PDF. However, using a scratch file /
> org.apache.pdfbox.io.RandomAccessFile worked pretty well to bring down
> memory usage. So am happy with that.
>
> Now the only thing left is the memory usage when actually merging. I
> was using PDFMergerUtility.addSource( File ) multiple times then doing
> a PDFMergerUtility.setDestinationStream() and
> PDFMergerUtility.mergeDocuments(). The memory usage when calling
> PDFMergerUtility.mergeDocuments() is the last bit where memory jumps
> quite high.
>
>
>
>
>
> On 9 November 2013 19:29, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
>> Hi,
>>
>> there are some possible improvements
>>
>> # add the bookmarks to the source files upfront - they will be merged into the target
>> # use a scratch file when loading the PDFs e.g. PDDocument.load(InputStream input, RandomAccess scratchFile) so temporary data is stored on file instead of memory to lower the memory consumption during runtime
>> # enhance the way how the images are stored in the PDF e.g. by using a different compression algorithm. This will be more complicated as you need to preprocess your PDFs but maybe it's useful as it might help you to produce smaller result files.
>>
>> BR
>>
>> Maruan Sahyoun
>>
>> Am 09.11.2013 um 07:11 schrieb Jesus Jr M Salvo <je...@gmail.com>:
>>
>>> pdfbox-1.8.2
>>> tika-app-1.4 ( I'm including Apache Tika as I just found out that
>>> Apache Tika comes with pdfbox )
>>>
>>> I have various existing PDFs that I need to merge into one PDF. The
>>> number of PDFs to be merged into one can be varied .. anywhere from 2
>>> PDFs to 10 or 20 or more PDFs. The number of pages in each PDF to be
>>> merged can also be varied. These PDFs are mostly scanned via an EDRMS
>>> like HP TRIM7 ... so documents say like ... medical reports, etc ..
>>> and up as PDFs. Thus, each page of the PDF is an image instead of
>>> text.
>>>
>>> Merging them into a single PDF is no problem using the PDFMergerUtility.
>>>
>>> After I have merged them into a single PDF, I then need to add
>>> bookmarks so that the person reading the PDF ( e.g. insurer, trustee )
>>> can quickly jump to a section of the merged PDF to see one of the
>>> merged PDFs.
>>>
>>> The issue is the memory consumption .. the merged PDF tend to be quite
>>> large ( anywhere from 200MB to 1GB ... again because each individual
>>> PDF were scanned via an EDRMS like HP TRIM7 so each page tend to be an
>>> image ). Now having multiple of these merges run in parallel, and I
>>> can easily consume the entire heap allocated to the JVM.
>>>
>>> To create the bookmarks, I have to open the large / merged PDF.
>>>
>>> So the question is, is there a better way of creating bookmarks so as
>>> that the amount of memory consumed is minimal ?
>>>
>>> Note that I am making sure I am calling PDDocument.close() in a
>>> finally clause. See snippets below.
>>>
>>>
>>> 1) To create the bookmarks, I have to find out the number of pages in
>>> each PDF before they are merged. Something like in a loop:
>>>
>>> PDDocument document = null;
>>> try {
>>>    document = PDDocument.load(aDownload.getLocalFile());
>>>    aDownload.setNumberOfPages( document.getNumberOfPages() );
>>> } finally {
>>>    if( document != null ) {
>>>        document.close();
>>>    }
>>> }
>>>
>>> 2) Then I have to open the large / merged PDF file, then create the
>>> bookmarks using the number of pages as the guide from above ( And I
>>> also have to set the meta-data ... the author, date/time, subject on
>>> the PDF ):
>>>
>>> private void finaliseDocument(
>>> final File pdfFile,
>>> final List<DocumentDownloadEntry> downloadEntries )
>>> throws Exception
>>> {
>>>    logger.log(Level.INFO, String.format("Finalising PDF document %s",
>>> pdfFile.toString()));
>>>    PDDocument document = null;
>>>    try {
>>>        document = PDDocument.load(pdfFile);
>>>        document.getDocumentCatalog().setPageMode(PDDocumentCatalog.PAGE_MODE_USE_OUTLINES);
>>>        document.getDocumentInformation().setCreationDate(Calendar.getInstance());
>>>        document.getDocumentInformation().setAuthor(getUserName());
>>>        document.getDocumentInformation().setTitle(getClaimDocuments().getEquipId()
>>> + " - " + getSubmissionType());
>>>        makeBookmarks( document, downloadEntries );
>>>        document.save(pdfFile);
>>>    } finally {
>>>        if( document != null ) {
>>>            document.close();
>>>        }
>>>    }
>>> }
>>>
>>> private void makeBookmarks(
>>> final PDDocument document,
>>> final List<DocumentDownloadEntry> downloadEntries)
>>> throws Exception
>>> {
>>>        PDDocumentOutline outline =  new PDDocumentOutline();
>>>        document.getDocumentCatalog().setDocumentOutline( outline );
>>>        PDOutlineItem pagesOutline = new PDOutlineItem();
>>>        pagesOutline.setTitle( document.getDocumentInformation().getTitle() );
>>>        outline.appendChild( pagesOutline );
>>>
>>>        @SuppressWarnings("rawtypes")
>>>        List pages = document.getDocumentCatalog().getAllPages();
>>>        int pageIndex = 0;
>>>        for( DocumentDownloadEntry aDownload : downloadEntries ) {
>>>          if( aDownload.isDownload() && aDownload.isDownloaded() ) {
>>>            PDPage page = (PDPage)pages.get( pageIndex );
>>>            pageIndex += aDownload.getNumberOfPages();
>>>
>>>                PDPageFitWidthDestination dest = new
>>> PDPageFitWidthDestination();
>>>                dest.setPage( page );
>>>                PDOutlineItem bookmark = new PDOutlineItem();
>>>                bookmark.setDestination( dest );
>>>
>>>                bookmark.setTitle( aDownload.getDocumentName() );
>>>                pagesOutline.appendChild( bookmark );
>>>          }
>>>        }
>>>        pagesOutline.openNode();
>>>        outline.openNode();
>>> }
>>

Re: Memory usage when counting the number of pages and creating bookmarks for large PDFs

Posted by Jesus Jr M Salvo <je...@gmail.com>.
Thanks.

I tried adding the bookmark to the source PDFs upfront, but they are
not merged into the merged PDF. However, using a scratch file /
org.apache.pdfbox.io.RandomAccessFile worked pretty well to bring down
memory usage. So am happy with that.

Now the only thing left is the memory usage when actually merging. I
was using PDFMergerUtility.addSource( File ) multiple times then doing
a PDFMergerUtility.setDestinationStream() and
PDFMergerUtility.mergeDocuments(). The memory usage when calling
PDFMergerUtility.mergeDocuments() is the last bit where memory jumps
quite high.





On 9 November 2013 19:29, Maruan Sahyoun <sa...@fileaffairs.de> wrote:
> Hi,
>
> there are some possible improvements
>
> # add the bookmarks to the source files upfront - they will be merged into the target
> # use a scratch file when loading the PDFs e.g. PDDocument.load(InputStream input, RandomAccess scratchFile) so temporary data is stored on file instead of memory to lower the memory consumption during runtime
> # enhance the way how the images are stored in the PDF e.g. by using a different compression algorithm. This will be more complicated as you need to preprocess your PDFs but maybe it's useful as it might help you to produce smaller result files.
>
> BR
>
> Maruan Sahyoun
>
> Am 09.11.2013 um 07:11 schrieb Jesus Jr M Salvo <je...@gmail.com>:
>
>> pdfbox-1.8.2
>> tika-app-1.4 ( I'm including Apache Tika as I just found out that
>> Apache Tika comes with pdfbox )
>>
>> I have various existing PDFs that I need to merge into one PDF. The
>> number of PDFs to be merged into one can be varied .. anywhere from 2
>> PDFs to 10 or 20 or more PDFs. The number of pages in each PDF to be
>> merged can also be varied. These PDFs are mostly scanned via an EDRMS
>> like HP TRIM7 ... so documents say like ... medical reports, etc ..
>> and up as PDFs. Thus, each page of the PDF is an image instead of
>> text.
>>
>> Merging them into a single PDF is no problem using the PDFMergerUtility.
>>
>> After I have merged them into a single PDF, I then need to add
>> bookmarks so that the person reading the PDF ( e.g. insurer, trustee )
>> can quickly jump to a section of the merged PDF to see one of the
>> merged PDFs.
>>
>> The issue is the memory consumption .. the merged PDF tend to be quite
>> large ( anywhere from 200MB to 1GB ... again because each individual
>> PDF were scanned via an EDRMS like HP TRIM7 so each page tend to be an
>> image ). Now having multiple of these merges run in parallel, and I
>> can easily consume the entire heap allocated to the JVM.
>>
>> To create the bookmarks, I have to open the large / merged PDF.
>>
>> So the question is, is there a better way of creating bookmarks so as
>> that the amount of memory consumed is minimal ?
>>
>> Note that I am making sure I am calling PDDocument.close() in a
>> finally clause. See snippets below.
>>
>>
>> 1) To create the bookmarks, I have to find out the number of pages in
>> each PDF before they are merged. Something like in a loop:
>>
>> PDDocument document = null;
>> try {
>>    document = PDDocument.load(aDownload.getLocalFile());
>>    aDownload.setNumberOfPages( document.getNumberOfPages() );
>> } finally {
>>    if( document != null ) {
>>        document.close();
>>    }
>> }
>>
>> 2) Then I have to open the large / merged PDF file, then create the
>> bookmarks using the number of pages as the guide from above ( And I
>> also have to set the meta-data ... the author, date/time, subject on
>> the PDF ):
>>
>> private void finaliseDocument(
>> final File pdfFile,
>> final List<DocumentDownloadEntry> downloadEntries )
>> throws Exception
>> {
>>    logger.log(Level.INFO, String.format("Finalising PDF document %s",
>> pdfFile.toString()));
>>    PDDocument document = null;
>>    try {
>>        document = PDDocument.load(pdfFile);
>>        document.getDocumentCatalog().setPageMode(PDDocumentCatalog.PAGE_MODE_USE_OUTLINES);
>>        document.getDocumentInformation().setCreationDate(Calendar.getInstance());
>>        document.getDocumentInformation().setAuthor(getUserName());
>>        document.getDocumentInformation().setTitle(getClaimDocuments().getEquipId()
>> + " - " + getSubmissionType());
>>        makeBookmarks( document, downloadEntries );
>>        document.save(pdfFile);
>>    } finally {
>>        if( document != null ) {
>>            document.close();
>>        }
>>    }
>> }
>>
>> private void makeBookmarks(
>> final PDDocument document,
>> final List<DocumentDownloadEntry> downloadEntries)
>> throws Exception
>> {
>>        PDDocumentOutline outline =  new PDDocumentOutline();
>>        document.getDocumentCatalog().setDocumentOutline( outline );
>>        PDOutlineItem pagesOutline = new PDOutlineItem();
>>        pagesOutline.setTitle( document.getDocumentInformation().getTitle() );
>>        outline.appendChild( pagesOutline );
>>
>>        @SuppressWarnings("rawtypes")
>>        List pages = document.getDocumentCatalog().getAllPages();
>>        int pageIndex = 0;
>>        for( DocumentDownloadEntry aDownload : downloadEntries ) {
>>          if( aDownload.isDownload() && aDownload.isDownloaded() ) {
>>            PDPage page = (PDPage)pages.get( pageIndex );
>>            pageIndex += aDownload.getNumberOfPages();
>>
>>                PDPageFitWidthDestination dest = new
>> PDPageFitWidthDestination();
>>                dest.setPage( page );
>>                PDOutlineItem bookmark = new PDOutlineItem();
>>                bookmark.setDestination( dest );
>>
>>                bookmark.setTitle( aDownload.getDocumentName() );
>>                pagesOutline.appendChild( bookmark );
>>          }
>>        }
>>        pagesOutline.openNode();
>>        outline.openNode();
>> }
>

Re: Memory usage when counting the number of pages and creating bookmarks for large PDFs

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi,

there are some possible improvements

# add the bookmarks to the source files upfront - they will be merged into the target
# use a scratch file when loading the PDFs e.g. PDDocument.load(InputStream input, RandomAccess scratchFile) so temporary data is stored on file instead of memory to lower the memory consumption during runtime
# enhance the way how the images are stored in the PDF e.g. by using a different compression algorithm. This will be more complicated as you need to preprocess your PDFs but maybe it's useful as it might help you to produce smaller result files.

BR

Maruan Sahyoun

Am 09.11.2013 um 07:11 schrieb Jesus Jr M Salvo <je...@gmail.com>:

> pdfbox-1.8.2
> tika-app-1.4 ( I'm including Apache Tika as I just found out that
> Apache Tika comes with pdfbox )
> 
> I have various existing PDFs that I need to merge into one PDF. The
> number of PDFs to be merged into one can be varied .. anywhere from 2
> PDFs to 10 or 20 or more PDFs. The number of pages in each PDF to be
> merged can also be varied. These PDFs are mostly scanned via an EDRMS
> like HP TRIM7 ... so documents say like ... medical reports, etc ..
> and up as PDFs. Thus, each page of the PDF is an image instead of
> text.
> 
> Merging them into a single PDF is no problem using the PDFMergerUtility.
> 
> After I have merged them into a single PDF, I then need to add
> bookmarks so that the person reading the PDF ( e.g. insurer, trustee )
> can quickly jump to a section of the merged PDF to see one of the
> merged PDFs.
> 
> The issue is the memory consumption .. the merged PDF tend to be quite
> large ( anywhere from 200MB to 1GB ... again because each individual
> PDF were scanned via an EDRMS like HP TRIM7 so each page tend to be an
> image ). Now having multiple of these merges run in parallel, and I
> can easily consume the entire heap allocated to the JVM.
> 
> To create the bookmarks, I have to open the large / merged PDF.
> 
> So the question is, is there a better way of creating bookmarks so as
> that the amount of memory consumed is minimal ?
> 
> Note that I am making sure I am calling PDDocument.close() in a
> finally clause. See snippets below.
> 
> 
> 1) To create the bookmarks, I have to find out the number of pages in
> each PDF before they are merged. Something like in a loop:
> 
> PDDocument document = null;
> try {
>    document = PDDocument.load(aDownload.getLocalFile());
>    aDownload.setNumberOfPages( document.getNumberOfPages() );
> } finally {
>    if( document != null ) {
>        document.close();
>    }
> }
> 
> 2) Then I have to open the large / merged PDF file, then create the
> bookmarks using the number of pages as the guide from above ( And I
> also have to set the meta-data ... the author, date/time, subject on
> the PDF ):
> 
> private void finaliseDocument(
> final File pdfFile,
> final List<DocumentDownloadEntry> downloadEntries )
> throws Exception
> {
>    logger.log(Level.INFO, String.format("Finalising PDF document %s",
> pdfFile.toString()));
>    PDDocument document = null;
>    try {
>        document = PDDocument.load(pdfFile);
>        document.getDocumentCatalog().setPageMode(PDDocumentCatalog.PAGE_MODE_USE_OUTLINES);
>        document.getDocumentInformation().setCreationDate(Calendar.getInstance());
>        document.getDocumentInformation().setAuthor(getUserName());
>        document.getDocumentInformation().setTitle(getClaimDocuments().getEquipId()
> + " - " + getSubmissionType());
>        makeBookmarks( document, downloadEntries );
>        document.save(pdfFile);
>    } finally {
>        if( document != null ) {
>            document.close();
>        }
>    }
> }
> 
> private void makeBookmarks(
> final PDDocument document,
> final List<DocumentDownloadEntry> downloadEntries)
> throws Exception
> {
>        PDDocumentOutline outline =  new PDDocumentOutline();
>        document.getDocumentCatalog().setDocumentOutline( outline );
>        PDOutlineItem pagesOutline = new PDOutlineItem();
>        pagesOutline.setTitle( document.getDocumentInformation().getTitle() );
>        outline.appendChild( pagesOutline );
> 
>        @SuppressWarnings("rawtypes")
>        List pages = document.getDocumentCatalog().getAllPages();
>        int pageIndex = 0;
>        for( DocumentDownloadEntry aDownload : downloadEntries ) {
>          if( aDownload.isDownload() && aDownload.isDownloaded() ) {
>            PDPage page = (PDPage)pages.get( pageIndex );
>            pageIndex += aDownload.getNumberOfPages();
> 
>                PDPageFitWidthDestination dest = new
> PDPageFitWidthDestination();
>                dest.setPage( page );
>                PDOutlineItem bookmark = new PDOutlineItem();
>                bookmark.setDestination( dest );
> 
>                bookmark.setTitle( aDownload.getDocumentName() );
>                pagesOutline.appendChild( bookmark );
>          }
>        }
>        pagesOutline.openNode();
>        outline.openNode();
> }