You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Stefan Magnus Landrø <st...@gmail.com> on 2014/03/06 15:39:24 UTC
Re: Stream parsing huge PDF document in order to prevent memory issues
Hi there,
So I tried using the NonSequentialParser setting the
org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property to
true.
The memory footprint looks much better, however, I can't get the individual
pages due to a NPE in the getPage code.
It turns out the resDict below is mostly null - which again causes a NPE in
parseDictObjects.
Should I file a bug?
Stefan
public PDPage getPage(int pageNr) throws IOException
{
getPagesObject();
// ---- get list of top level pages
COSArray kids = (COSArray)
pagesDictionary.getDictionaryObject(COSName.KIDS);
if (kids == null)
{
throw new IOException("Missing 'Kids' entry in pages
dictionary.");
}
// ---- get page we are looking for (possibly going recursively into
// subpages)
COSObject pageObj = getPageObject(pageNr, kids, 0);
if (pageObj == null)
{
throw new IOException("Page " + pageNr + " not found.");
}
// ---- parse all objects necessary to load page.
COSDictionary pageDict = (COSDictionary) pageObj.getObject();
if (parseMinimalCatalog && (!allPagesParsed))
{
// parse page resources since we did not do this on start
COSDictionary resDict = (COSDictionary)
pageDict.getDictionaryObject(COSName.RESOURCES);
parseDictObjects(resDict);
}
return new PDPage(pageDict);
}
2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
> Hi,
>
> PDF is a random access format with key information (the Cross Reference
> where to find the objects) being at the end of the file and the PDF objects
> spread around the file.
>
> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
> instead of PDDocument.load and set the system property
> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which does
> a minimal parsing of the PDF. That could reduce the memory consumption a
> little bit. Unfortunately once an object has been parsed it’s content
> stays in memory so you would need to do a low level parsing yourself with
> the information available from the initial parsing stage.
>
> Maruan Sahyoun
>
> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
> stefan.landro@gmail.com>:
>
> > Hi there,
> >
> > I'm trying to validate random pdfs (potentially huge - 100s of MBs)
> > according to the following rule set:
> > - Dimensions of all pages should be A4 (297 mm * 210 mm)
> > - There should be no content within a certain rectangular area of a page
> > (left margin where the print shop inserts a bar code)
> > - Number of pages should be less than N
> > - PDF version used
> >
> > So far we've been using
> >
> > PDDocument.load with a scratch file, but with huge documents (e.g.
> product
> > catalogues), things explode.
> > Is there a way to stream parse a PDF similar to stream parsing an XML
> > document (e.g. using StAX) and validate one page at a time?
> >
> > Cheers
> >
> > Stefan
>
>
--
BEKK Open
http://open.bekk.no
TesTcl - a unit test framework for iRules
http://testcl.com
Re: Stream parsing huge PDF document in order to prevent memory issues
Posted by Stefan Magnus Landrø <st...@gmail.com>.
Here it is: https://issues.apache.org/jira/browse/PDFBOX-1965
Thanks
Stefan
2014-03-07 12:47 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
> Hi Stefan,
>
> unfortunately this is seems to be a bug. When the parseMinimal property is
> set to true indirect objects are not followed when the PDF is parsed. May I
> ask you to file a issue in Jira [
> https://issues.apache.org/jira/browse/PDFBOX/] and attach the pdf file in
> question.
>
> BR
> Maruan Sahyoun
>
> Am 07.03.2014 um 07:11 schrieb Maruan Sahyoun <sa...@fileaffairs.de>:
>
> > Hi Stefan,
> >
> > just fine. If I need more information I’ll let you know.
> >
> > BR
> > Maruan Sahyoun
> >
> > Am 06.03.2014 um 23:53 schrieb Stefan Magnus Landrø <
> stefan.landro@gmail.com>:
> >
> >> Hi Maruan,
> >>
> >> So I created a small maven project containing a PDF-file I just
> generated
> >> on my mac, and pushed it to https://github.com/landro/pdfboxbug
> >> I could create a zip and upload to your bugtracker, but that feels kinda
> >> awkward.
> >> What do you prefer?
> >>
> >> Stefan
> >>
> >>
> >>
> >> 2014-03-06 15:47 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
> >>
> >>> Yes please, file a bug report together with a sample PDF and sample
> code
> >>> to reproduce the issue. Which PDFBox version are you using?
> >>>
> >>> BR
> >>> Maruan Sahyoun
> >>>
> >>> Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø <
> >>> stefan.landro@gmail.com>:
> >>>
> >>>> Hi there,
> >>>>
> >>>> So I tried using the NonSequentialParser setting the
> >>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal
> property
> >>> to
> >>>> true.
> >>>>
> >>>> The memory footprint looks much better, however, I can't get the
> >>> individual
> >>>> pages due to a NPE in the getPage code.
> >>>>
> >>>> It turns out the resDict below is mostly null - which again causes a
> NPE
> >>> in
> >>>> parseDictObjects.
> >>>>
> >>>> Should I file a bug?
> >>>>
> >>>> Stefan
> >>>>
> >>>>
> >>>> public PDPage getPage(int pageNr) throws IOException
> >>>> {
> >>>> getPagesObject();
> >>>>
> >>>> // ---- get list of top level pages
> >>>> COSArray kids = (COSArray)
> >>>> pagesDictionary.getDictionaryObject(COSName.KIDS);
> >>>>
> >>>> if (kids == null)
> >>>> {
> >>>> throw new IOException("Missing 'Kids' entry in pages
> >>>> dictionary.");
> >>>> }
> >>>>
> >>>> // ---- get page we are looking for (possibly going recursively
> >>> into
> >>>> // subpages)
> >>>> COSObject pageObj = getPageObject(pageNr, kids, 0);
> >>>>
> >>>> if (pageObj == null)
> >>>> {
> >>>> throw new IOException("Page " + pageNr + " not found.");
> >>>> }
> >>>>
> >>>> // ---- parse all objects necessary to load page.
> >>>> COSDictionary pageDict = (COSDictionary) pageObj.getObject();
> >>>>
> >>>> if (parseMinimalCatalog && (!allPagesParsed))
> >>>> {
> >>>> // parse page resources since we did not do this on start
> >>>> COSDictionary resDict = (COSDictionary)
> >>>> pageDict.getDictionaryObject(COSName.RESOURCES);
> >>>> parseDictObjects(resDict);
> >>>> }
> >>>>
> >>>> return new PDPage(pageDict);
> >>>> }
> >>>>
> >>>>
> >>>>
> >>>> 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> PDF is a random access format with key information (the Cross
> Reference
> >>>>> where to find the objects) being at the end of the file and the PDF
> >>> objects
> >>>>> spread around the file.
> >>>>>
> >>>>> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
> >>>>> instead of PDDocument.load and set the system property
> >>>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which
> >>> does
> >>>>> a minimal parsing of the PDF. That could reduce the memory
> consumption a
> >>>>> little bit. Unfortunately once an object has been parsed it’s
> content
> >>>>> stays in memory so you would need to do a low level parsing yourself
> >>> with
> >>>>> the information available from the initial parsing stage.
> >>>>>
> >>>>> Maruan Sahyoun
> >>>>>
> >>>>> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
> >>>>> stefan.landro@gmail.com>:
> >>>>>
> >>>>>> Hi there,
> >>>>>>
> >>>>>> I'm trying to validate random pdfs (potentially huge - 100s of MBs)
> >>>>>> according to the following rule set:
> >>>>>> - Dimensions of all pages should be A4 (297 mm * 210 mm)
> >>>>>> - There should be no content within a certain rectangular area of a
> >>> page
> >>>>>> (left margin where the print shop inserts a bar code)
> >>>>>> - Number of pages should be less than N
> >>>>>> - PDF version used
> >>>>>>
> >>>>>> So far we've been using
> >>>>>>
> >>>>>> PDDocument.load with a scratch file, but with huge documents (e.g.
> >>>>> product
> >>>>>> catalogues), things explode.
> >>>>>> Is there a way to stream parse a PDF similar to stream parsing an
> XML
> >>>>>> document (e.g. using StAX) and validate one page at a time?
> >>>>>>
> >>>>>> Cheers
> >>>>>>
> >>>>>> Stefan
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> BEKK Open
> >>>> http://open.bekk.no
> >>>>
> >>>> TesTcl - a unit test framework for iRules
> >>>> http://testcl.com
> >>>
> >>>
> >>
> >>
> >> --
> >> BEKK Open
> >> http://open.bekk.no
> >>
> >> TesTcl - a unit test framework for iRules
> >> http://testcl.com
> >
>
>
--
BEKK Open
http://open.bekk.no
TesTcl - a unit test framework for iRules
http://testcl.com
Re: Stream parsing huge PDF document in order to prevent memory issues
Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Stefan,
unfortunately this is seems to be a bug. When the parseMinimal property is set to true indirect objects are not followed when the PDF is parsed. May I ask you to file a issue in Jira [https://issues.apache.org/jira/browse/PDFBOX/] and attach the pdf file in question.
BR
Maruan Sahyoun
Am 07.03.2014 um 07:11 schrieb Maruan Sahyoun <sa...@fileaffairs.de>:
> Hi Stefan,
>
> just fine. If I need more information I’ll let you know.
>
> BR
> Maruan Sahyoun
>
> Am 06.03.2014 um 23:53 schrieb Stefan Magnus Landrø <st...@gmail.com>:
>
>> Hi Maruan,
>>
>> So I created a small maven project containing a PDF-file I just generated
>> on my mac, and pushed it to https://github.com/landro/pdfboxbug
>> I could create a zip and upload to your bugtracker, but that feels kinda
>> awkward.
>> What do you prefer?
>>
>> Stefan
>>
>>
>>
>> 2014-03-06 15:47 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
>>
>>> Yes please, file a bug report together with a sample PDF and sample code
>>> to reproduce the issue. Which PDFBox version are you using?
>>>
>>> BR
>>> Maruan Sahyoun
>>>
>>> Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø <
>>> stefan.landro@gmail.com>:
>>>
>>>> Hi there,
>>>>
>>>> So I tried using the NonSequentialParser setting the
>>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property
>>> to
>>>> true.
>>>>
>>>> The memory footprint looks much better, however, I can't get the
>>> individual
>>>> pages due to a NPE in the getPage code.
>>>>
>>>> It turns out the resDict below is mostly null - which again causes a NPE
>>> in
>>>> parseDictObjects.
>>>>
>>>> Should I file a bug?
>>>>
>>>> Stefan
>>>>
>>>>
>>>> public PDPage getPage(int pageNr) throws IOException
>>>> {
>>>> getPagesObject();
>>>>
>>>> // ---- get list of top level pages
>>>> COSArray kids = (COSArray)
>>>> pagesDictionary.getDictionaryObject(COSName.KIDS);
>>>>
>>>> if (kids == null)
>>>> {
>>>> throw new IOException("Missing 'Kids' entry in pages
>>>> dictionary.");
>>>> }
>>>>
>>>> // ---- get page we are looking for (possibly going recursively
>>> into
>>>> // subpages)
>>>> COSObject pageObj = getPageObject(pageNr, kids, 0);
>>>>
>>>> if (pageObj == null)
>>>> {
>>>> throw new IOException("Page " + pageNr + " not found.");
>>>> }
>>>>
>>>> // ---- parse all objects necessary to load page.
>>>> COSDictionary pageDict = (COSDictionary) pageObj.getObject();
>>>>
>>>> if (parseMinimalCatalog && (!allPagesParsed))
>>>> {
>>>> // parse page resources since we did not do this on start
>>>> COSDictionary resDict = (COSDictionary)
>>>> pageDict.getDictionaryObject(COSName.RESOURCES);
>>>> parseDictObjects(resDict);
>>>> }
>>>>
>>>> return new PDPage(pageDict);
>>>> }
>>>>
>>>>
>>>>
>>>> 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
>>>>
>>>>> Hi,
>>>>>
>>>>> PDF is a random access format with key information (the Cross Reference
>>>>> where to find the objects) being at the end of the file and the PDF
>>> objects
>>>>> spread around the file.
>>>>>
>>>>> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
>>>>> instead of PDDocument.load and set the system property
>>>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which
>>> does
>>>>> a minimal parsing of the PDF. That could reduce the memory consumption a
>>>>> little bit. Unfortunately once an object has been parsed it’s content
>>>>> stays in memory so you would need to do a low level parsing yourself
>>> with
>>>>> the information available from the initial parsing stage.
>>>>>
>>>>> Maruan Sahyoun
>>>>>
>>>>> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
>>>>> stefan.landro@gmail.com>:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> I'm trying to validate random pdfs (potentially huge - 100s of MBs)
>>>>>> according to the following rule set:
>>>>>> - Dimensions of all pages should be A4 (297 mm * 210 mm)
>>>>>> - There should be no content within a certain rectangular area of a
>>> page
>>>>>> (left margin where the print shop inserts a bar code)
>>>>>> - Number of pages should be less than N
>>>>>> - PDF version used
>>>>>>
>>>>>> So far we've been using
>>>>>>
>>>>>> PDDocument.load with a scratch file, but with huge documents (e.g.
>>>>> product
>>>>>> catalogues), things explode.
>>>>>> Is there a way to stream parse a PDF similar to stream parsing an XML
>>>>>> document (e.g. using StAX) and validate one page at a time?
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> Stefan
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> BEKK Open
>>>> http://open.bekk.no
>>>>
>>>> TesTcl - a unit test framework for iRules
>>>> http://testcl.com
>>>
>>>
>>
>>
>> --
>> BEKK Open
>> http://open.bekk.no
>>
>> TesTcl - a unit test framework for iRules
>> http://testcl.com
>
Re: Stream parsing huge PDF document in order to prevent memory issues
Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Hi Stefan,
just fine. If I need more information I’ll let you know.
BR
Maruan Sahyoun
Am 06.03.2014 um 23:53 schrieb Stefan Magnus Landrø <st...@gmail.com>:
> Hi Maruan,
>
> So I created a small maven project containing a PDF-file I just generated
> on my mac, and pushed it to https://github.com/landro/pdfboxbug
> I could create a zip and upload to your bugtracker, but that feels kinda
> awkward.
> What do you prefer?
>
> Stefan
>
>
>
> 2014-03-06 15:47 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
>
>> Yes please, file a bug report together with a sample PDF and sample code
>> to reproduce the issue. Which PDFBox version are you using?
>>
>> BR
>> Maruan Sahyoun
>>
>> Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø <
>> stefan.landro@gmail.com>:
>>
>>> Hi there,
>>>
>>> So I tried using the NonSequentialParser setting the
>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property
>> to
>>> true.
>>>
>>> The memory footprint looks much better, however, I can't get the
>> individual
>>> pages due to a NPE in the getPage code.
>>>
>>> It turns out the resDict below is mostly null - which again causes a NPE
>> in
>>> parseDictObjects.
>>>
>>> Should I file a bug?
>>>
>>> Stefan
>>>
>>>
>>> public PDPage getPage(int pageNr) throws IOException
>>> {
>>> getPagesObject();
>>>
>>> // ---- get list of top level pages
>>> COSArray kids = (COSArray)
>>> pagesDictionary.getDictionaryObject(COSName.KIDS);
>>>
>>> if (kids == null)
>>> {
>>> throw new IOException("Missing 'Kids' entry in pages
>>> dictionary.");
>>> }
>>>
>>> // ---- get page we are looking for (possibly going recursively
>> into
>>> // subpages)
>>> COSObject pageObj = getPageObject(pageNr, kids, 0);
>>>
>>> if (pageObj == null)
>>> {
>>> throw new IOException("Page " + pageNr + " not found.");
>>> }
>>>
>>> // ---- parse all objects necessary to load page.
>>> COSDictionary pageDict = (COSDictionary) pageObj.getObject();
>>>
>>> if (parseMinimalCatalog && (!allPagesParsed))
>>> {
>>> // parse page resources since we did not do this on start
>>> COSDictionary resDict = (COSDictionary)
>>> pageDict.getDictionaryObject(COSName.RESOURCES);
>>> parseDictObjects(resDict);
>>> }
>>>
>>> return new PDPage(pageDict);
>>> }
>>>
>>>
>>>
>>> 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
>>>
>>>> Hi,
>>>>
>>>> PDF is a random access format with key information (the Cross Reference
>>>> where to find the objects) being at the end of the file and the PDF
>> objects
>>>> spread around the file.
>>>>
>>>> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
>>>> instead of PDDocument.load and set the system property
>>>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which
>> does
>>>> a minimal parsing of the PDF. That could reduce the memory consumption a
>>>> little bit. Unfortunately once an object has been parsed it’s content
>>>> stays in memory so you would need to do a low level parsing yourself
>> with
>>>> the information available from the initial parsing stage.
>>>>
>>>> Maruan Sahyoun
>>>>
>>>> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
>>>> stefan.landro@gmail.com>:
>>>>
>>>>> Hi there,
>>>>>
>>>>> I'm trying to validate random pdfs (potentially huge - 100s of MBs)
>>>>> according to the following rule set:
>>>>> - Dimensions of all pages should be A4 (297 mm * 210 mm)
>>>>> - There should be no content within a certain rectangular area of a
>> page
>>>>> (left margin where the print shop inserts a bar code)
>>>>> - Number of pages should be less than N
>>>>> - PDF version used
>>>>>
>>>>> So far we've been using
>>>>>
>>>>> PDDocument.load with a scratch file, but with huge documents (e.g.
>>>> product
>>>>> catalogues), things explode.
>>>>> Is there a way to stream parse a PDF similar to stream parsing an XML
>>>>> document (e.g. using StAX) and validate one page at a time?
>>>>>
>>>>> Cheers
>>>>>
>>>>> Stefan
>>>>
>>>>
>>>
>>>
>>> --
>>> BEKK Open
>>> http://open.bekk.no
>>>
>>> TesTcl - a unit test framework for iRules
>>> http://testcl.com
>>
>>
>
>
> --
> BEKK Open
> http://open.bekk.no
>
> TesTcl - a unit test framework for iRules
> http://testcl.com
Re: Stream parsing huge PDF document in order to prevent memory issues
Posted by Stefan Magnus Landrø <st...@gmail.com>.
Hi Maruan,
So I created a small maven project containing a PDF-file I just generated
on my mac, and pushed it to https://github.com/landro/pdfboxbug
I could create a zip and upload to your bugtracker, but that feels kinda
awkward.
What do you prefer?
Stefan
2014-03-06 15:47 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
> Yes please, file a bug report together with a sample PDF and sample code
> to reproduce the issue. Which PDFBox version are you using?
>
> BR
> Maruan Sahyoun
>
> Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø <
> stefan.landro@gmail.com>:
>
> > Hi there,
> >
> > So I tried using the NonSequentialParser setting the
> > org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property
> to
> > true.
> >
> > The memory footprint looks much better, however, I can't get the
> individual
> > pages due to a NPE in the getPage code.
> >
> > It turns out the resDict below is mostly null - which again causes a NPE
> in
> > parseDictObjects.
> >
> > Should I file a bug?
> >
> > Stefan
> >
> >
> > public PDPage getPage(int pageNr) throws IOException
> > {
> > getPagesObject();
> >
> > // ---- get list of top level pages
> > COSArray kids = (COSArray)
> > pagesDictionary.getDictionaryObject(COSName.KIDS);
> >
> > if (kids == null)
> > {
> > throw new IOException("Missing 'Kids' entry in pages
> > dictionary.");
> > }
> >
> > // ---- get page we are looking for (possibly going recursively
> into
> > // subpages)
> > COSObject pageObj = getPageObject(pageNr, kids, 0);
> >
> > if (pageObj == null)
> > {
> > throw new IOException("Page " + pageNr + " not found.");
> > }
> >
> > // ---- parse all objects necessary to load page.
> > COSDictionary pageDict = (COSDictionary) pageObj.getObject();
> >
> > if (parseMinimalCatalog && (!allPagesParsed))
> > {
> > // parse page resources since we did not do this on start
> > COSDictionary resDict = (COSDictionary)
> > pageDict.getDictionaryObject(COSName.RESOURCES);
> > parseDictObjects(resDict);
> > }
> >
> > return new PDPage(pageDict);
> > }
> >
> >
> >
> > 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
> >
> >> Hi,
> >>
> >> PDF is a random access format with key information (the Cross Reference
> >> where to find the objects) being at the end of the file and the PDF
> objects
> >> spread around the file.
> >>
> >> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
> >> instead of PDDocument.load and set the system property
> >> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which
> does
> >> a minimal parsing of the PDF. That could reduce the memory consumption a
> >> little bit. Unfortunately once an object has been parsed it’s content
> >> stays in memory so you would need to do a low level parsing yourself
> with
> >> the information available from the initial parsing stage.
> >>
> >> Maruan Sahyoun
> >>
> >> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
> >> stefan.landro@gmail.com>:
> >>
> >>> Hi there,
> >>>
> >>> I'm trying to validate random pdfs (potentially huge - 100s of MBs)
> >>> according to the following rule set:
> >>> - Dimensions of all pages should be A4 (297 mm * 210 mm)
> >>> - There should be no content within a certain rectangular area of a
> page
> >>> (left margin where the print shop inserts a bar code)
> >>> - Number of pages should be less than N
> >>> - PDF version used
> >>>
> >>> So far we've been using
> >>>
> >>> PDDocument.load with a scratch file, but with huge documents (e.g.
> >> product
> >>> catalogues), things explode.
> >>> Is there a way to stream parse a PDF similar to stream parsing an XML
> >>> document (e.g. using StAX) and validate one page at a time?
> >>>
> >>> Cheers
> >>>
> >>> Stefan
> >>
> >>
> >
> >
> > --
> > BEKK Open
> > http://open.bekk.no
> >
> > TesTcl - a unit test framework for iRules
> > http://testcl.com
>
>
--
BEKK Open
http://open.bekk.no
TesTcl - a unit test framework for iRules
http://testcl.com
Re: Stream parsing huge PDF document in order to prevent memory issues
Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
Yes please, file a bug report together with a sample PDF and sample code to reproduce the issue. Which PDFBox version are you using?
BR
Maruan Sahyoun
Am 06.03.2014 um 15:39 schrieb Stefan Magnus Landrø <st...@gmail.com>:
> Hi there,
>
> So I tried using the NonSequentialParser setting the
> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal property to
> true.
>
> The memory footprint looks much better, however, I can't get the individual
> pages due to a NPE in the getPage code.
>
> It turns out the resDict below is mostly null - which again causes a NPE in
> parseDictObjects.
>
> Should I file a bug?
>
> Stefan
>
>
> public PDPage getPage(int pageNr) throws IOException
> {
> getPagesObject();
>
> // ---- get list of top level pages
> COSArray kids = (COSArray)
> pagesDictionary.getDictionaryObject(COSName.KIDS);
>
> if (kids == null)
> {
> throw new IOException("Missing 'Kids' entry in pages
> dictionary.");
> }
>
> // ---- get page we are looking for (possibly going recursively into
> // subpages)
> COSObject pageObj = getPageObject(pageNr, kids, 0);
>
> if (pageObj == null)
> {
> throw new IOException("Page " + pageNr + " not found.");
> }
>
> // ---- parse all objects necessary to load page.
> COSDictionary pageDict = (COSDictionary) pageObj.getObject();
>
> if (parseMinimalCatalog && (!allPagesParsed))
> {
> // parse page resources since we did not do this on start
> COSDictionary resDict = (COSDictionary)
> pageDict.getDictionaryObject(COSName.RESOURCES);
> parseDictObjects(resDict);
> }
>
> return new PDPage(pageDict);
> }
>
>
>
> 2014-02-14 10:35 GMT+01:00 Maruan Sahyoun <sa...@fileaffairs.de>:
>
>> Hi,
>>
>> PDF is a random access format with key information (the Cross Reference
>> where to find the objects) being at the end of the file and the PDF objects
>> spread around the file.
>>
>> You can use the NonSequentialParser by calling PDDocument.loadNonSeq
>> instead of PDDocument.load and set the system property
>> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal which does
>> a minimal parsing of the PDF. That could reduce the memory consumption a
>> little bit. Unfortunately once an object has been parsed it’s content
>> stays in memory so you would need to do a low level parsing yourself with
>> the information available from the initial parsing stage.
>>
>> Maruan Sahyoun
>>
>> Am 14.02.2014 um 09:50 schrieb Stefan Magnus Landrø <
>> stefan.landro@gmail.com>:
>>
>>> Hi there,
>>>
>>> I'm trying to validate random pdfs (potentially huge - 100s of MBs)
>>> according to the following rule set:
>>> - Dimensions of all pages should be A4 (297 mm * 210 mm)
>>> - There should be no content within a certain rectangular area of a page
>>> (left margin where the print shop inserts a bar code)
>>> - Number of pages should be less than N
>>> - PDF version used
>>>
>>> So far we've been using
>>>
>>> PDDocument.load with a scratch file, but with huge documents (e.g.
>> product
>>> catalogues), things explode.
>>> Is there a way to stream parse a PDF similar to stream parsing an XML
>>> document (e.g. using StAX) and validate one page at a time?
>>>
>>> Cheers
>>>
>>> Stefan
>>
>>
>
>
> --
> BEKK Open
> http://open.bekk.no
>
> TesTcl - a unit test framework for iRules
> http://testcl.com