You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Pierre Huttin <pi...@huttin.com> on 2013/03/21 09:41:28 UTC

Open in ReadOnly very large file.

Hello,

I'm trying to work on very large PDF file (21GB), and I want to extract 
some pages, the problem is when I load the file in a PDDocument it 
create a scratchfile around the same size than the file, and yesterday 
evening it took 3H30 just to load the file.

PDDocument.loadNonSeq (method)

Is it possible to open the file in "Read Only" and "Read All from disk" 
? because I don't really understand why I need to load the complete file 
in scratchfile just for reading ?

thanks for yours answers/comments/ideas how to solve this.

Pierre Huttin

Re: Open in ReadOnly very large file.

Posted by Pierre Huttin <pi...@huttin.com>.

Hi Maruan,

thanks for the ideas I will test them.

from this huge file, I need to extract some specific pages  (the 
references are coming from an external system)  and create a small pdf 
with these specific pages (in the current context 1 to 5 pages max per 
PDF created)

Pierre Huttin

On 21.03.2013 09:49, Maruan Sahyoun wrote:
> Hi Pierre,
>
> If you load from an input stream a temporary file will be created.
> Try loading from java.io.File or pass the filename. In addition you 
> do
> not have to provide a scratch file. In that case your memory
> consumption will be much higher.
>
> In addition the NonSequentialParser supports a system property
> org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal.
> Setting that to 'true' object references in catalog are not followed.
> That might help (I have never used that though, looked it up in the
> sources). Depends on your use case.
>
> What are you trying to do with the file? Which information are you 
> looking for?
>
> Maruan Sahyoun
>
> Am 21.03.2013 um 09:41 schrieb Pierre Huttin <pi...@huttin.com>:
>
>> Hello,
>>
>> I'm trying to work on very large PDF file (21GB), and I want to 
>> extract some pages, the problem is when I load the file in a 
>> PDDocument it create a scratchfile around the same size than the file, 
>> and yesterday evening it took 3H30 just to load the file.
>>
>> PDDocument.loadNonSeq (method)
>>
>> Is it possible to open the file in "Read Only" and "Read All from 
>> disk" ? because I don't really understand why I need to load the 
>> complete file in scratchfile just for reading ?
>>
>> thanks for yours answers/comments/ideas how to solve this.
>>
>> Pierre Huttin

Re: Open in ReadOnly very large file.

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.

Hi Pierre,

If you load from an input stream a temporary file will be created. Try loading from java.io.File or pass the filename. In addition you do not have to provide a scratch file. In that case your memory consumption will be much higher. 

In addition the NonSequentialParser supports a system property org.apache.pdfbox.pdfparser.nonSequentialPDFParser.parseMinimal. Setting that to 'true' object references in catalog are not followed. That might help (I have never used that though, looked it up in the sources). Depends on your use case.

What are you trying to do with the file? Which information are you looking for?

Maruan Sahyoun

Am 21.03.2013 um 09:41 schrieb Pierre Huttin <pi...@huttin.com>:

> Hello,
> 
> I'm trying to work on very large PDF file (21GB), and I want to extract some pages, the problem is when I load the file in a PDDocument it create a scratchfile around the same size than the file, and yesterday evening it took 3H30 just to load the file.
> 
> PDDocument.loadNonSeq (method)
> 
> Is it possible to open the file in "Read Only" and "Read All from disk" ? because I don't really understand why I need to load the complete file in scratchfile just for reading ?
> 
> thanks for yours answers/comments/ideas how to solve this.
> 
> Pierre Huttin