You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Stefan Magnus Landrø <st...@gmail.com> on 2015/01/27 12:50:40 UTC

non sequential parser and 2.0

Hi there,

I reported an issue related to the non sequential parser in the 1.8 code
line last year (PDFBOX-1965) and was really happy to see that the issue was
recently fixed. Thanks a lot, Andreas!

I also noticed that the non sequential parser will become the default
parser in 2.0.

In my project we're using pdfbox to verify that all pages in a given pdf
can be printed by a 3rd party print service (all pages have to be A4, only
use standard fonts or embed them otherwise, have certain margins etc etc).

We noticed the document returned by getDocument() gets increasingly big
memory wise (especially if the pdf is large and complex in structure -
http://no.mouser.com/catalog/English/103/dload/pdf/mouser.pdf demonstrates
the effect well) as we iterate over all the pages in the pdf, and we free
it up gradually by doing the following in a subclass of NonSequentialParser
/ CosParser

    @Override
    public PDPage getPage(int pageNr) throws IOException {
        // Free up memory regularly
        if (pageNr % 5 == 0) {
            Set<COSObjectKey> cosObjectKeys =
super.xrefTrailerResolver.getXrefTable().keySet();
            for (COSObjectKey cosObjectKey : cosObjectKeys) {
                super.getDocument().removeObject(cosObjectKey);
            }
        }
        return super.getPage(pageNr);
    }

This feels a bit like a hack - any chance this kind of functionality could
be build into pdfbox?

And, BTW, any clues when the 2.0 release will be ready? Are you planning on
shipping release candidates too (which would prevent people from having to
rely upon/distribute snapshot versions)?

Thanks

Stefan


-- 
BEKK Open
http://open.bekk.no

TesTcl - a unit test framework for iRules
http://testcl.com