You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Stefan Magnus Landrø <st...@gmail.com> on 2014/03/06 15:39:24 UTC

Re: Stream parsing huge PDF document in order to prevent memory issues

Hi there,

So I tried using the NonSequentialParser setting the
org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property to
true.

The memory footprint looks much better, however, I can't get the individual
pages due to a NPE in the getPage code.

It turns out the resDict below is mostly null - which again causes a NPE in
parseDictObjects.

Should I file a bug?

Stefan


    public PDPage getPage(int pageNr) throws IOException
    {
        getPagesObject();

        // ---- get list of top level pages
        COSArray kids = (COSArray)
pagesDictionary.getDictionaryObject(COSName.KIDS);

        if (kids == null)
        {
            throw new IOException("Missing 'Kids' entry in pages
dictionary.");
        }

        // ---- get page we are looking for (possibly going recursively into
        // subpages)
        COSObject pageObj = getPageObject(pageNr, kids, 0);

        if (pageObj == null)
        {
            throw new IOException("Page " + pageNr + " not found.");
        }

        // ---- parse all objects necessary to load page.
        COSDictionary pageDict = (COSDictionary) pageObj.getObject();

        if (parseMinimalCatalog && (!allPagesParsed))
        {
            // parse page resources since we did not do this on start
            COSDictionary resDict = (COSDictionary)
pageDict.getDictionaryObject(COSName.RESOURCES);
            parseDictObjects(resDict);
        }

        return new PDPage(pageDict);
    }



2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:

> Hi,
>
> PDF is a random access format with key information (the Cross Reference
> where to find the objects) being at the end of the file and the PDF objects
> spread around the file.
>
> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
> instead of PDDocument.load and set the system property
> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which does
> a minimal parsing of the PDF. That could reduce the memory consumption a
> little bit.  Unfortunately once an object has been parsed it’s content
> stays in memory so you would need to do a low level parsing yourself with
> the information available from the initial parsing stage.
>
> Maruan Sahyoun
>
> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
> stefan.landro@gmail.com>:
>
> > Hi there,
> >
> > I'm trying to validate random pdfs (potentially huge - 100s of MBs)
> > according to the following rule set:
> > - Dimensions of all pages should be A4 (297 mm * 210 mm)
> > - There should be no content within a certain rectangular area of a page
> > (left margin where the print shop inserts a bar code)
> > - Number of pages should be less than N
> > - PDF version used
> >
> > So far we've been using
> >
> > PDDocument.load with a scratch file, but with huge documents (e.g.
> product
> > catalogues), things explode.
> > Is there a way to stream parse a PDF similar to stream parsing an XML
> > document (e.g. using StAX) and validate one page at a time?
> >
> > Cheers
> >
> > Stefan
>
>


-- 
BEKK Open
http://open.bekk.no

TesTcl - a unit test framework for iRules
http://testcl.com

Re: Stream parsing huge PDF document in order to prevent memory issues

Posted by Stefan Magnus Landrø <st...@gmail.com>.

Here it is: https://issues.apache.org/jira/browse/PDFBOX-1965

Thanks

Stefan


2014-03-07 12:47 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:

> Hi Stefan,
>
> unfortunately this is seems to be a bug. When the parseMinimal property is
> set to true indirect objects are not followed when the PDF is parsed. May I
> ask you to file a issue in Jira [
> https://issues.apache.org/jira/browse/PDFBOX/] and attach the pdf file in
> question.
>
> BR
> Maruan Sahyoun
>
> Am 07.03.2014 um 07:11 schrieb Maruan Sahyoun <sa...@fileaffairs.de>:
>
> > Hi Stefan,
> >
> > just fine. If I need more information I’ll let you know.
> >
> > BR
> > Maruan Sahyoun
> >
> > Am 06.03.2014 um 23:53 schrieb Stefan Magnus Landrø <
> stefan.landro@gmail.com>:
> >
> >> Hi Maruan,
> >>
> >> So I created a small maven project containing a PDF-file I just
> generated
> >> on my mac, and pushed it to https://github.com/landro/pdfboxbug
> >> I could create a zip and upload to your bugtracker, but that feels kinda
> >> awkward.
> >> What do you prefer?
> >>
> >> Stefan
> >>
> >>
> >>
> >> 2014-03-06 15:47 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
> >>
> >>> Yes please, file a bug report together with a sample PDF and sample
> code
> >>> to reproduce the issue. Which PDFBox version are you using?
> >>>
> >>> BR
> >>> Maruan Sahyoun
> >>>
> >>> Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø <
> >>> stefan.landro@gmail.com>:
> >>>
> >>>> Hi there,
> >>>>
> >>>> So I tried using the NonSequentialParser setting the
> >>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal
> property
> >>> to
> >>>> true.
> >>>>
> >>>> The memory footprint looks much better, however, I can't get the
> >>> individual
> >>>> pages due to a NPE in the getPage code.
> >>>>
> >>>> It turns out the resDict below is mostly null - which again causes a
> NPE
> >>> in
> >>>> parseDictObjects.
> >>>>
> >>>> Should I file a bug?
> >>>>
> >>>> Stefan
> >>>>
> >>>>
> >>>>  public PDPage getPage(int pageNr) throws IOException
> >>>>  {
> >>>>      getPagesObject();
> >>>>
> >>>>      // ---- get list of top level pages
> >>>>      COSArray kids = (COSArray)
> >>>> pagesDictionary.getDictionaryObject(COSName.KIDS);
> >>>>
> >>>>      if (kids == null)
> >>>>      {
> >>>>          throw new IOException("Missing 'Kids' entry in pages
> >>>> dictionary.");
> >>>>      }
> >>>>
> >>>>      // ---- get page we are looking for (possibly going recursively
> >>> into
> >>>>      // subpages)
> >>>>      COSObject pageObj = getPageObject(pageNr, kids, 0);
> >>>>
> >>>>      if (pageObj == null)
> >>>>      {
> >>>>          throw new IOException("Page " + pageNr + " not found.");
> >>>>      }
> >>>>
> >>>>      // ---- parse all objects necessary to load page.
> >>>>      COSDictionary pageDict = (COSDictionary) pageObj.getObject();
> >>>>
> >>>>      if (parseMinimalCatalog && (!allPagesParsed))
> >>>>      {
> >>>>          // parse page resources since we did not do this on start
> >>>>          COSDictionary resDict = (COSDictionary)
> >>>> pageDict.getDictionaryObject(COSName.RESOURCES);
> >>>>          parseDictObjects(resDict);
> >>>>      }
> >>>>
> >>>>      return new PDPage(pageDict);
> >>>>  }
> >>>>
> >>>>
> >>>>
> >>>> 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> PDF is a random access format with key information (the Cross
> Reference
> >>>>> where to find the objects) being at the end of the file and the PDF
> >>> objects
> >>>>> spread around the file.
> >>>>>
> >>>>> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
> >>>>> instead of PDDocument.load and set the system property
> >>>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which
> >>> does
> >>>>> a minimal parsing of the PDF. That could reduce the memory
> consumption a
> >>>>> little bit.  Unfortunately once an object has been parsed it’s
> content
> >>>>> stays in memory so you would need to do a low level parsing yourself
> >>> with
> >>>>> the information available from the initial parsing stage.
> >>>>>
> >>>>> Maruan Sahyoun
> >>>>>
> >>>>> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
> >>>>> stefan.landro@gmail.com>:
> >>>>>
> >>>>>> Hi there,
> >>>>>>
> >>>>>> I'm trying to validate random pdfs (potentially huge - 100s of MBs)
> >>>>>> according to the following rule set:
> >>>>>> - Dimensions of all pages should be A4 (297 mm * 210 mm)
> >>>>>> - There should be no content within a certain rectangular area of a
> >>> page
> >>>>>> (left margin where the print shop inserts a bar code)
> >>>>>> - Number of pages should be less than N
> >>>>>> - PDF version used
> >>>>>>
> >>>>>> So far we've been using
> >>>>>>
> >>>>>> PDDocument.load with a scratch file, but with huge documents (e.g.
> >>>>> product
> >>>>>> catalogues), things explode.
> >>>>>> Is there a way to stream parse a PDF similar to stream parsing an
> XML
> >>>>>> document (e.g. using StAX) and validate one page at a time?
> >>>>>>
> >>>>>> Cheers
> >>>>>>
> >>>>>> Stefan
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> BEKK Open
> >>>> http://open.bekk.no
> >>>>
> >>>> TesTcl - a unit test framework for iRules
> >>>> http://testcl.com
> >>>
> >>>
> >>
> >>
> >> --
> >> BEKK Open
> >> http://open.bekk.no
> >>
> >> TesTcl - a unit test framework for iRules
> >> http://testcl.com
> >
>
>


-- 
BEKK Open
http://open.bekk.no

TesTcl - a unit test framework for iRules
http://testcl.com

Re: Stream parsing huge PDF document in order to prevent memory issues

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi Stefan,

unfortunately this is seems to be a bug. When the parseMinimal property is set to true indirect objects are not followed when the PDF is parsed. May I ask you to file a issue in Jira [https://issues.apache.org/jira/browse/PDFBOX/] and attach the pdf file in question.

BR
Maruan Sahyoun

Am 07.03.2014 um 07:11 schrieb Maruan Sahyoun <sa...@fileaffairs.de>:

> Hi Stefan,
> 
> just fine. If I need more information I’ll let you know.
> 
> BR
> Maruan Sahyoun
> 
> Am 06.03.2014 um 23:53 schrieb Stefan Magnus Landrø <st...@gmail.com>:
> 
>> Hi Maruan,
>> 
>> So I created a small maven project containing a PDF-file I just generated
>> on my mac, and pushed it to https://github.com/landro/pdfboxbug
>> I could create a zip and upload to your bugtracker, but that feels kinda
>> awkward.
>> What do you prefer?
>> 
>> Stefan
>> 
>> 
>> 
>> 2014-03-06 15:47 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
>> 
>>> Yes please, file a bug report together with a sample PDF and sample code
>>> to reproduce the issue. Which PDFBox version are you using?
>>> 
>>> BR
>>> Maruan Sahyoun
>>> 
>>> Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø <
>>> stefan.landro@gmail.com>:
>>> 
>>>> Hi there,
>>>> 
>>>> So I tried using the NonSequentialParser setting the
>>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property
>>> to
>>>> true.
>>>> 
>>>> The memory footprint looks much better, however, I can't get the
>>> individual
>>>> pages due to a NPE in the getPage code.
>>>> 
>>>> It turns out the resDict below is mostly null - which again causes a NPE
>>> in
>>>> parseDictObjects.
>>>> 
>>>> Should I file a bug?
>>>> 
>>>> Stefan
>>>> 
>>>> 
>>>>  public PDPage getPage(int pageNr) throws IOException
>>>>  {
>>>>      getPagesObject();
>>>> 
>>>>      // ---- get list of top level pages
>>>>      COSArray kids = (COSArray)
>>>> pagesDictionary.getDictionaryObject(COSName.KIDS);
>>>> 
>>>>      if (kids == null)
>>>>      {
>>>>          throw new IOException("Missing 'Kids' entry in pages
>>>> dictionary.");
>>>>      }
>>>> 
>>>>      // ---- get page we are looking for (possibly going recursively
>>> into
>>>>      // subpages)
>>>>      COSObject pageObj = getPageObject(pageNr, kids, 0);
>>>> 
>>>>      if (pageObj == null)
>>>>      {
>>>>          throw new IOException("Page " + pageNr + " not found.");
>>>>      }
>>>> 
>>>>      // ---- parse all objects necessary to load page.
>>>>      COSDictionary pageDict = (COSDictionary) pageObj.getObject();
>>>> 
>>>>      if (parseMinimalCatalog && (!allPagesParsed))
>>>>      {
>>>>          // parse page resources since we did not do this on start
>>>>          COSDictionary resDict = (COSDictionary)
>>>> pageDict.getDictionaryObject(COSName.RESOURCES);
>>>>          parseDictObjects(resDict);
>>>>      }
>>>> 
>>>>      return new PDPage(pageDict);
>>>>  }
>>>> 
>>>> 
>>>> 
>>>> 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> PDF is a random access format with key information (the Cross Reference
>>>>> where to find the objects) being at the end of the file and the PDF
>>> objects
>>>>> spread around the file.
>>>>> 
>>>>> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
>>>>> instead of PDDocument.load and set the system property
>>>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which
>>> does
>>>>> a minimal parsing of the PDF. That could reduce the memory consumption a
>>>>> little bit.  Unfortunately once an object has been parsed it’s content
>>>>> stays in memory so you would need to do a low level parsing yourself
>>> with
>>>>> the information available from the initial parsing stage.
>>>>> 
>>>>> Maruan Sahyoun
>>>>> 
>>>>> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
>>>>> stefan.landro@gmail.com>:
>>>>> 
>>>>>> Hi there,
>>>>>> 
>>>>>> I'm trying to validate random pdfs (potentially huge - 100s of MBs)
>>>>>> according to the following rule set:
>>>>>> - Dimensions of all pages should be A4 (297 mm * 210 mm)
>>>>>> - There should be no content within a certain rectangular area of a
>>> page
>>>>>> (left margin where the print shop inserts a bar code)
>>>>>> - Number of pages should be less than N
>>>>>> - PDF version used
>>>>>> 
>>>>>> So far we've been using
>>>>>> 
>>>>>> PDDocument.load with a scratch file, but with huge documents (e.g.
>>>>> product
>>>>>> catalogues), things explode.
>>>>>> Is there a way to stream parse a PDF similar to stream parsing an XML
>>>>>> document (e.g. using StAX) and validate one page at a time?
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> Stefan
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> BEKK Open
>>>> http://open.bekk.no
>>>> 
>>>> TesTcl - a unit test framework for iRules
>>>> http://testcl.com
>>> 
>>> 
>> 
>> 
>> -- 
>> BEKK Open
>> http://open.bekk.no
>> 
>> TesTcl - a unit test framework for iRules
>> http://testcl.com
>

Re: Stream parsing huge PDF document in order to prevent memory issues

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi Stefan,

just fine. If I need more information I’ll let you know.

BR
Maruan Sahyoun

Am 06.03.2014 um 23:53 schrieb Stefan Magnus Landrø <st...@gmail.com>:

> Hi Maruan,
> 
> So I created a small maven project containing a PDF-file I just generated
> on my mac, and pushed it to https://github.com/landro/pdfboxbug
> I could create a zip and upload to your bugtracker, but that feels kinda
> awkward.
> What do you prefer?
> 
> Stefan
> 
> 
> 
> 2014-03-06 15:47 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
> 
>> Yes please, file a bug report together with a sample PDF and sample code
>> to reproduce the issue. Which PDFBox version are you using?
>> 
>> BR
>> Maruan Sahyoun
>> 
>> Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø <
>> stefan.landro@gmail.com>:
>> 
>>> Hi there,
>>> 
>>> So I tried using the NonSequentialParser setting the
>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property
>> to
>>> true.
>>> 
>>> The memory footprint looks much better, however, I can't get the
>> individual
>>> pages due to a NPE in the getPage code.
>>> 
>>> It turns out the resDict below is mostly null - which again causes a NPE
>> in
>>> parseDictObjects.
>>> 
>>> Should I file a bug?
>>> 
>>> Stefan
>>> 
>>> 
>>>   public PDPage getPage(int pageNr) throws IOException
>>>   {
>>>       getPagesObject();
>>> 
>>>       // ---- get list of top level pages
>>>       COSArray kids = (COSArray)
>>> pagesDictionary.getDictionaryObject(COSName.KIDS);
>>> 
>>>       if (kids == null)
>>>       {
>>>           throw new IOException("Missing 'Kids' entry in pages
>>> dictionary.");
>>>       }
>>> 
>>>       // ---- get page we are looking for (possibly going recursively
>> into
>>>       // subpages)
>>>       COSObject pageObj = getPageObject(pageNr, kids, 0);
>>> 
>>>       if (pageObj == null)
>>>       {
>>>           throw new IOException("Page " + pageNr + " not found.");
>>>       }
>>> 
>>>       // ---- parse all objects necessary to load page.
>>>       COSDictionary pageDict = (COSDictionary) pageObj.getObject();
>>> 
>>>       if (parseMinimalCatalog && (!allPagesParsed))
>>>       {
>>>           // parse page resources since we did not do this on start
>>>           COSDictionary resDict = (COSDictionary)
>>> pageDict.getDictionaryObject(COSName.RESOURCES);
>>>           parseDictObjects(resDict);
>>>       }
>>> 
>>>       return new PDPage(pageDict);
>>>   }
>>> 
>>> 
>>> 
>>> 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
>>> 
>>>> Hi,
>>>> 
>>>> PDF is a random access format with key information (the Cross Reference
>>>> where to find the objects) being at the end of the file and the PDF
>> objects
>>>> spread around the file.
>>>> 
>>>> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
>>>> instead of PDDocument.load and set the system property
>>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which
>> does
>>>> a minimal parsing of the PDF. That could reduce the memory consumption a
>>>> little bit.  Unfortunately once an object has been parsed it’s content
>>>> stays in memory so you would need to do a low level parsing yourself
>> with
>>>> the information available from the initial parsing stage.
>>>> 
>>>> Maruan Sahyoun
>>>> 
>>>> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
>>>> stefan.landro@gmail.com>:
>>>> 
>>>>> Hi there,
>>>>> 
>>>>> I'm trying to validate random pdfs (potentially huge - 100s of MBs)
>>>>> according to the following rule set:
>>>>> - Dimensions of all pages should be A4 (297 mm * 210 mm)
>>>>> - There should be no content within a certain rectangular area of a
>> page
>>>>> (left margin where the print shop inserts a bar code)
>>>>> - Number of pages should be less than N
>>>>> - PDF version used
>>>>> 
>>>>> So far we've been using
>>>>> 
>>>>> PDDocument.load with a scratch file, but with huge documents (e.g.
>>>> product
>>>>> catalogues), things explode.
>>>>> Is there a way to stream parse a PDF similar to stream parsing an XML
>>>>> document (e.g. using StAX) and validate one page at a time?
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> Stefan
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> BEKK Open
>>> http://open.bekk.no
>>> 
>>> TesTcl - a unit test framework for iRules
>>> http://testcl.com
>> 
>> 
> 
> 
> -- 
> BEKK Open
> http://open.bekk.no
> 
> TesTcl - a unit test framework for iRules
> http://testcl.com

Re: Stream parsing huge PDF document in order to prevent memory issues

Posted by Stefan Magnus Landrø <st...@gmail.com>.

Hi Maruan,

So I created a small maven project containing a PDF-file I just generated
on my mac, and pushed it to https://github.com/landro/pdfboxbug
I could create a zip and upload to your bugtracker, but that feels kinda
awkward.
What do you prefer?

Stefan



2014-03-06 15:47 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:

> Yes please, file a bug report together with a sample PDF and sample code
> to reproduce the issue. Which PDFBox version are you using?
>
> BR
> Maruan Sahyoun
>
> Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø <
> stefan.landro@gmail.com>:
>
> > Hi there,
> >
> > So I tried using the NonSequentialParser setting the
> > org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property
> to
> > true.
> >
> > The memory footprint looks much better, however, I can't get the
> individual
> > pages due to a NPE in the getPage code.
> >
> > It turns out the resDict below is mostly null - which again causes a NPE
> in
> > parseDictObjects.
> >
> > Should I file a bug?
> >
> > Stefan
> >
> >
> >    public PDPage getPage(int pageNr) throws IOException
> >    {
> >        getPagesObject();
> >
> >        // ---- get list of top level pages
> >        COSArray kids = (COSArray)
> > pagesDictionary.getDictionaryObject(COSName.KIDS);
> >
> >        if (kids == null)
> >        {
> >            throw new IOException("Missing 'Kids' entry in pages
> > dictionary.");
> >        }
> >
> >        // ---- get page we are looking for (possibly going recursively
> into
> >        // subpages)
> >        COSObject pageObj = getPageObject(pageNr, kids, 0);
> >
> >        if (pageObj == null)
> >        {
> >            throw new IOException("Page " + pageNr + " not found.");
> >        }
> >
> >        // ---- parse all objects necessary to load page.
> >        COSDictionary pageDict = (COSDictionary) pageObj.getObject();
> >
> >        if (parseMinimalCatalog && (!allPagesParsed))
> >        {
> >            // parse page resources since we did not do this on start
> >            COSDictionary resDict = (COSDictionary)
> > pageDict.getDictionaryObject(COSName.RESOURCES);
> >            parseDictObjects(resDict);
> >        }
> >
> >        return new PDPage(pageDict);
> >    }
> >
> >
> >
> > 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
> >
> >> Hi,
> >>
> >> PDF is a random access format with key information (the Cross Reference
> >> where to find the objects) being at the end of the file and the PDF
> objects
> >> spread around the file.
> >>
> >> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
> >> instead of PDDocument.load and set the system property
> >> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which
> does
> >> a minimal parsing of the PDF. That could reduce the memory consumption a
> >> little bit.  Unfortunately once an object has been parsed it’s content
> >> stays in memory so you would need to do a low level parsing yourself
> with
> >> the information available from the initial parsing stage.
> >>
> >> Maruan Sahyoun
> >>
> >> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
> >> stefan.landro@gmail.com>:
> >>
> >>> Hi there,
> >>>
> >>> I'm trying to validate random pdfs (potentially huge - 100s of MBs)
> >>> according to the following rule set:
> >>> - Dimensions of all pages should be A4 (297 mm * 210 mm)
> >>> - There should be no content within a certain rectangular area of a
> page
> >>> (left margin where the print shop inserts a bar code)
> >>> - Number of pages should be less than N
> >>> - PDF version used
> >>>
> >>> So far we've been using
> >>>
> >>> PDDocument.load with a scratch file, but with huge documents (e.g.
> >> product
> >>> catalogues), things explode.
> >>> Is there a way to stream parse a PDF similar to stream parsing an XML
> >>> document (e.g. using StAX) and validate one page at a time?
> >>>
> >>> Cheers
> >>>
> >>> Stefan
> >>
> >>
> >
> >
> > --
> > BEKK Open
> > http://open.bekk.no
> >
> > TesTcl - a unit test framework for iRules
> > http://testcl.com
>
>


-- 
BEKK Open
http://open.bekk.no

TesTcl - a unit test framework for iRules
http://testcl.com

Re: Stream parsing huge PDF document in order to prevent memory issues

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Yes please, file a bug report together with a sample PDF and sample code to reproduce the issue. Which PDFBox version are you using?

BR
Maruan Sahyoun

Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø <st...@gmail.com>:

> Hi there,
> 
> So I tried using the NonSequentialParser setting the
> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property to
> true.
> 
> The memory footprint looks much better, however, I can't get the individual
> pages due to a NPE in the getPage code.
> 
> It turns out the resDict below is mostly null - which again causes a NPE in
> parseDictObjects.
> 
> Should I file a bug?
> 
> Stefan
> 
> 
>    public PDPage getPage(int pageNr) throws IOException
>    {
>        getPagesObject();
> 
>        // ---- get list of top level pages
>        COSArray kids = (COSArray)
> pagesDictionary.getDictionaryObject(COSName.KIDS);
> 
>        if (kids == null)
>        {
>            throw new IOException("Missing 'Kids' entry in pages
> dictionary.");
>        }
> 
>        // ---- get page we are looking for (possibly going recursively into
>        // subpages)
>        COSObject pageObj = getPageObject(pageNr, kids, 0);
> 
>        if (pageObj == null)
>        {
>            throw new IOException("Page " + pageNr + " not found.");
>        }
> 
>        // ---- parse all objects necessary to load page.
>        COSDictionary pageDict = (COSDictionary) pageObj.getObject();
> 
>        if (parseMinimalCatalog && (!allPagesParsed))
>        {
>            // parse page resources since we did not do this on start
>            COSDictionary resDict = (COSDictionary)
> pageDict.getDictionaryObject(COSName.RESOURCES);
>            parseDictObjects(resDict);
>        }
> 
>        return new PDPage(pageDict);
>    }
> 
> 
> 
> 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
> 
>> Hi,
>> 
>> PDF is a random access format with key information (the Cross Reference
>> where to find the objects) being at the end of the file and the PDF objects
>> spread around the file.
>> 
>> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
>> instead of PDDocument.load and set the system property
>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which does
>> a minimal parsing of the PDF. That could reduce the memory consumption a
>> little bit.  Unfortunately once an object has been parsed it’s content
>> stays in memory so you would need to do a low level parsing yourself with
>> the information available from the initial parsing stage.
>> 
>> Maruan Sahyoun
>> 
>> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
>> stefan.landro@gmail.com>:
>> 
>>> Hi there,
>>> 
>>> I'm trying to validate random pdfs (potentially huge - 100s of MBs)
>>> according to the following rule set:
>>> - Dimensions of all pages should be A4 (297 mm * 210 mm)
>>> - There should be no content within a certain rectangular area of a page
>>> (left margin where the print shop inserts a bar code)
>>> - Number of pages should be less than N
>>> - PDF version used
>>> 
>>> So far we've been using
>>> 
>>> PDDocument.load with a scratch file, but with huge documents (e.g.
>> product
>>> catalogues), things explode.
>>> Is there a way to stream parse a PDF similar to stream parsing an XML
>>> document (e.g. using StAX) and validate one page at a time?
>>> 
>>> Cheers
>>> 
>>> Stefan
>> 
>> 
> 
> 
> -- 
> BEKK Open
> http://open.bekk.no
> 
> TesTcl - a unit test framework for iRules
> http://testcl.com