You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Lars Juel Jensen <la...@gmail.com> on 2024/01/31 08:50:39 UTC

Loading a PDF using InputStream

In PDFBox2 I could do:

PDDocument.load(inputStream, MemoryUsageSetting.setupTempFileOnly())

But there is no equivalent to this in PDFBox3. How do I read a PDF from an
inputstream?

Re: Loading a PDF using InputStream

Posted by Tilman Hausherr <TH...@t-online.de>.
P.S.: thank you for having investigated and reported this!

Tilman

On 01.02.2024 16:06, Tilman Hausherr wrote:
> Oh. I had looked at the trunk and not at 3.0. That was likely a 
> mistake in refactoring. Fixed in
>
>  https://issues.apache.org/jira/browse/PDFBOX-5757
>
> and you get get a snapshot here
> https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/ 
>
>
> Tilman
>
>
> On 01.02.2024 15:25, Lars Juel Jensen wrote:
>> That is weird.. The source file I am looking at for version 3.0.1 
>> does not
>> pass it:
>> -->
>> https://github.com/apache/pdfbox/blob/3.0.1/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java#L91 
>>
>>
>> On Wed, Jan 31, 2024 at 4:57 PM Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>>> On 31.01.2024 16:19, Lars Juel Jensen wrote:
>>>> Well that's my problem.. It works with PDFBox2 with reasonable sized
>>> files.
>>>> When it comes to the big ones it crashes.. So reading the migration 
>>>> guide
>>>> for PDFBox3.0 I thought I saw some light in the tunnel as it says I 
>>>> can
>>>> create my own reader and stream cache. I see that I can provide my own
>>>> RandomAccessReader when I call Loader.loadPDF, but the loadPDF method
>>> that
>>>> takes a StreamCacheCreate function does not work as promised as the
>>>> StreamCacheCreateFunction is not passed from PDFParser to COSParser in
>>> the
>>>> PDFParser constructor. This works in v3.0.0, but not in v3.0.1. I 
>>>> guess
>>>> this is a bug?
>>> I don't know if there is a bug, but it is passed:
>>>
>>>       public PDFParser(RandomAccessRead source, String
>>> decryptionPassword, InputStream keyStore,
>>>               String alias, StreamCacheCreateFunction
>>> streamCacheCreateFunction) throws IOException
>>>       {
>>>           super(source, decryptionPassword, keyStore, alias,
>>> streamCacheCreateFunction);
>>>       }
>>>
>>> and here's COSParser:
>>>
>>>       public COSParser(RandomAccessRead source, String password,
>>> InputStream keyStore,
>>>               String keyAlias, StreamCacheCreateFunction
>>> streamCacheCreateFunction) throws IOException
>>>       {
>>>           super(source);
>>>           this.password = password;
>>>           this.keyAlias = keyAlias;
>>>           fileLen = source.length();
>>>           keyStoreInputStream = keyStore;
>>>           init(streamCacheCreateFunction);
>>>       }
>>>
>>> If you think 3.0.1 has a bigger memory footprint than 3.0.0, can you
>>> create a scenario to reproduce this? Preferably without using a 
>>> container.
>>>
>>> Tilman
>>>
>>>> On Wed, Jan 31, 2024 at 3:46 PM Tilman Hausherr 
>>>> <TH...@t-online.de>
>>>> wrote:
>>>>
>>>>> On 31.01.2024 14:48, Lars Juel Jensen wrote:
>>>>>> This creates another problem for me. I am running PDFBox in a
>>> kubernetes
>>>>>> cluster on premises with limited resources. I can not setup 
>>>>>> persistent
>>>>>> volume claims nor ephemeral volumes, and I can not change how my 
>>>>>> pods
>>> are
>>>>>> started. I have limited resources and an emptyDir that is mounted on
>>> /tmp
>>>>>> where the temporary files go. The emptyDir is mapped to a portion of
>>> the
>>>>>> kubernetes node's memory, and this memory is shared with many other
>>>>>> services. All in all - I need to keep a very low memory and tempFile
>>>>>> footprint, hence the InputStream. Using RandomAccessReadBuffer 
>>>>>> with an
>>>>>> InputStream loads the entire PDF into memory, and I can encounter 
>>>>>> PDF
>>>>>> documents that can be over 1GB in size. So loading everything into
>>> memory
>>>>>> is not an option.
>>>>> You can try to create your own class extending RandomAccessRead.
>>>>>
>>>>> If your /tmp is mapped on main memory, then it doesn't make sense 
>>>>> to use
>>>>> a temp file at all, you're just wasting time.
>>>>>
>>>>> Btw PDFBox 2 was also loading the whole PDF file into memory (or 
>>>>> into a
>>>>> scratch file) and had an even bigger footprint because it was also
>>>>> parsing the complete PDF. So if your project was working with 
>>>>> PDFBox 2
>>>>> then it should work with PDFBox 3.
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>>
>>>>>> On Wed, Jan 31, 2024 at 10:10 AM Tilman Hausherr <
>>> THausherr@t-online.de>
>>>>>> wrote:
>>>>>>
>>>>>>> On 31.01.2024 09:50, Lars Juel Jensen wrote:
>>>>>>>> In PDFBox2 I could do:
>>>>>>>>
>>>>>>>> PDDocument.load(inputStream, 
>>>>>>>> MemoryUsageSetting.setupTempFileOnly())
>>>>>>>>
>>>>>>>> But there is no equivalent to this in PDFBox3. How do I read a PDF
>>> from
>>>>>>> an
>>>>>>>> inputstream?
>>>>>>>>
>>>>>>> |Loader.loadPDF(new RandomAccessReadBuffer(inputStream),
>>>>>>> IOUtils.createTempFileOnlyStreamCache());|
>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Loading a PDF using InputStream

Posted by Tilman Hausherr <TH...@t-online.de>.
Oh. I had looked at the trunk and not at 3.0. That was likely a mistake 
in refactoring. Fixed in

  https://issues.apache.org/jira/browse/PDFBOX-5757

and you get get a snapshot here
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/3.0.2-SNAPSHOT/

Tilman


On 01.02.2024 15:25, Lars Juel Jensen wrote:
> That is weird.. The source file I am looking at for version 3.0.1 does not
> pass it:
> -->
> https://github.com/apache/pdfbox/blob/3.0.1/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java#L91
>
> On Wed, Jan 31, 2024 at 4:57 PM Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> On 31.01.2024 16:19, Lars Juel Jensen wrote:
>>> Well that's my problem.. It works with PDFBox2 with reasonable sized
>> files.
>>> When it comes to the big ones it crashes.. So reading the migration guide
>>> for PDFBox3.0 I thought I saw some light in the tunnel as it says I can
>>> create my own reader and stream cache. I see that I can provide my own
>>> RandomAccessReader when I call Loader.loadPDF, but the loadPDF method
>> that
>>> takes a StreamCacheCreate function does not work as promised as the
>>> StreamCacheCreateFunction is not passed from PDFParser to COSParser in
>> the
>>> PDFParser constructor. This works in v3.0.0, but not in v3.0.1. I guess
>>> this is a bug?
>> I don't know if there is a bug, but it is passed:
>>
>>       public PDFParser(RandomAccessRead source, String
>> decryptionPassword, InputStream keyStore,
>>               String alias, StreamCacheCreateFunction
>> streamCacheCreateFunction) throws IOException
>>       {
>>           super(source, decryptionPassword, keyStore, alias,
>> streamCacheCreateFunction);
>>       }
>>
>> and here's COSParser:
>>
>>       public COSParser(RandomAccessRead source, String password,
>> InputStream keyStore,
>>               String keyAlias, StreamCacheCreateFunction
>> streamCacheCreateFunction) throws IOException
>>       {
>>           super(source);
>>           this.password = password;
>>           this.keyAlias = keyAlias;
>>           fileLen = source.length();
>>           keyStoreInputStream = keyStore;
>>           init(streamCacheCreateFunction);
>>       }
>>
>> If you think 3.0.1 has a bigger memory footprint than 3.0.0, can you
>> create a scenario to reproduce this? Preferably without using a container.
>>
>> Tilman
>>
>>> On Wed, Jan 31, 2024 at 3:46 PM Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>>
>>>> On 31.01.2024 14:48, Lars Juel Jensen wrote:
>>>>> This creates another problem for me. I am running PDFBox in a
>> kubernetes
>>>>> cluster on premises with limited resources. I can not setup persistent
>>>>> volume claims nor ephemeral volumes, and I can not change how my pods
>> are
>>>>> started. I have limited resources and an emptyDir that is mounted on
>> /tmp
>>>>> where the temporary files go. The emptyDir is mapped to a portion of
>> the
>>>>> kubernetes node's memory, and this memory is shared with many other
>>>>> services. All in all - I need to keep a very low memory and tempFile
>>>>> footprint, hence the InputStream. Using RandomAccessReadBuffer with an
>>>>> InputStream loads the entire PDF into memory, and I can encounter PDF
>>>>> documents that can be over 1GB in size. So loading everything into
>> memory
>>>>> is not an option.
>>>> You can try to create your own class extending RandomAccessRead.
>>>>
>>>> If your /tmp is mapped on main memory, then it doesn't make sense to use
>>>> a temp file at all, you're just wasting time.
>>>>
>>>> Btw PDFBox 2 was also loading the whole PDF file into memory (or into a
>>>> scratch file) and had an even bigger footprint because it was also
>>>> parsing the complete PDF. So if your project was working with PDFBox 2
>>>> then it should work with PDFBox 3.
>>>>
>>>> Tilman
>>>>
>>>>
>>>>
>>>>> On Wed, Jan 31, 2024 at 10:10 AM Tilman Hausherr <
>> THausherr@t-online.de>
>>>>> wrote:
>>>>>
>>>>>> On 31.01.2024 09:50, Lars Juel Jensen wrote:
>>>>>>> In PDFBox2 I could do:
>>>>>>>
>>>>>>> PDDocument.load(inputStream, MemoryUsageSetting.setupTempFileOnly())
>>>>>>>
>>>>>>> But there is no equivalent to this in PDFBox3. How do I read a PDF
>> from
>>>>>> an
>>>>>>> inputstream?
>>>>>>>
>>>>>> |Loader.loadPDF(new RandomAccessReadBuffer(inputStream),
>>>>>> IOUtils.createTempFileOnlyStreamCache());|
>>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Loading a PDF using InputStream

Posted by Lars Juel Jensen <la...@gmail.com>.
That is weird.. The source file I am looking at for version 3.0.1 does not
pass it:
-->
https://github.com/apache/pdfbox/blob/3.0.1/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java#L91

On Wed, Jan 31, 2024 at 4:57 PM Tilman Hausherr <TH...@t-online.de>
wrote:

> On 31.01.2024 16:19, Lars Juel Jensen wrote:
> > Well that's my problem.. It works with PDFBox2 with reasonable sized
> files.
> > When it comes to the big ones it crashes.. So reading the migration guide
> > for PDFBox3.0 I thought I saw some light in the tunnel as it says I can
> > create my own reader and stream cache. I see that I can provide my own
> > RandomAccessReader when I call Loader.loadPDF, but the loadPDF method
> that
> > takes a StreamCacheCreate function does not work as promised as the
> > StreamCacheCreateFunction is not passed from PDFParser to COSParser in
> the
> > PDFParser constructor. This works in v3.0.0, but not in v3.0.1. I guess
> > this is a bug?
>
> I don't know if there is a bug, but it is passed:
>
>      public PDFParser(RandomAccessRead source, String
> decryptionPassword, InputStream keyStore,
>              String alias, StreamCacheCreateFunction
> streamCacheCreateFunction) throws IOException
>      {
>          super(source, decryptionPassword, keyStore, alias,
> streamCacheCreateFunction);
>      }
>
> and here's COSParser:
>
>      public COSParser(RandomAccessRead source, String password,
> InputStream keyStore,
>              String keyAlias, StreamCacheCreateFunction
> streamCacheCreateFunction) throws IOException
>      {
>          super(source);
>          this.password = password;
>          this.keyAlias = keyAlias;
>          fileLen = source.length();
>          keyStoreInputStream = keyStore;
>          init(streamCacheCreateFunction);
>      }
>
> If you think 3.0.1 has a bigger memory footprint than 3.0.0, can you
> create a scenario to reproduce this? Preferably without using a container.
>
> Tilman
>
> >
> > On Wed, Jan 31, 2024 at 3:46 PM Tilman Hausherr <TH...@t-online.de>
> > wrote:
> >
> >> On 31.01.2024 14:48, Lars Juel Jensen wrote:
> >>> This creates another problem for me. I am running PDFBox in a
> kubernetes
> >>> cluster on premises with limited resources. I can not setup persistent
> >>> volume claims nor ephemeral volumes, and I can not change how my pods
> are
> >>> started. I have limited resources and an emptyDir that is mounted on
> /tmp
> >>> where the temporary files go. The emptyDir is mapped to a portion of
> the
> >>> kubernetes node's memory, and this memory is shared with many other
> >>> services. All in all - I need to keep a very low memory and tempFile
> >>> footprint, hence the InputStream. Using RandomAccessReadBuffer with an
> >>> InputStream loads the entire PDF into memory, and I can encounter PDF
> >>> documents that can be over 1GB in size. So loading everything into
> memory
> >>> is not an option.
> >> You can try to create your own class extending RandomAccessRead.
> >>
> >> If your /tmp is mapped on main memory, then it doesn't make sense to use
> >> a temp file at all, you're just wasting time.
> >>
> >> Btw PDFBox 2 was also loading the whole PDF file into memory (or into a
> >> scratch file) and had an even bigger footprint because it was also
> >> parsing the complete PDF. So if your project was working with PDFBox 2
> >> then it should work with PDFBox 3.
> >>
> >> Tilman
> >>
> >>
> >>
> >>> On Wed, Jan 31, 2024 at 10:10 AM Tilman Hausherr <
> THausherr@t-online.de>
> >>> wrote:
> >>>
> >>>> On 31.01.2024 09:50, Lars Juel Jensen wrote:
> >>>>> In PDFBox2 I could do:
> >>>>>
> >>>>> PDDocument.load(inputStream, MemoryUsageSetting.setupTempFileOnly())
> >>>>>
> >>>>> But there is no equivalent to this in PDFBox3. How do I read a PDF
> from
> >>>> an
> >>>>> inputstream?
> >>>>>
> >>>> |Loader.loadPDF(new RandomAccessReadBuffer(inputStream),
> >>>> IOUtils.createTempFileOnlyStreamCache());|
> >>>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Loading a PDF using InputStream

Posted by Tilman Hausherr <TH...@t-online.de>.
On 31.01.2024 16:19, Lars Juel Jensen wrote:
> Well that's my problem.. It works with PDFBox2 with reasonable sized files.
> When it comes to the big ones it crashes.. So reading the migration guide
> for PDFBox3.0 I thought I saw some light in the tunnel as it says I can
> create my own reader and stream cache. I see that I can provide my own
> RandomAccessReader when I call Loader.loadPDF, but the loadPDF method that
> takes a StreamCacheCreate function does not work as promised as the
> StreamCacheCreateFunction is not passed from PDFParser to COSParser in the
> PDFParser constructor. This works in v3.0.0, but not in v3.0.1. I guess
> this is a bug?

I don't know if there is a bug, but it is passed:

     public PDFParser(RandomAccessRead source, String 
decryptionPassword, InputStream keyStore,
             String alias, StreamCacheCreateFunction 
streamCacheCreateFunction) throws IOException
     {
         super(source, decryptionPassword, keyStore, alias, 
streamCacheCreateFunction);
     }

and here's COSParser:

     public COSParser(RandomAccessRead source, String password, 
InputStream keyStore,
             String keyAlias, StreamCacheCreateFunction 
streamCacheCreateFunction) throws IOException
     {
         super(source);
         this.password = password;
         this.keyAlias = keyAlias;
         fileLen = source.length();
         keyStoreInputStream = keyStore;
         init(streamCacheCreateFunction);
     }

If you think 3.0.1 has a bigger memory footprint than 3.0.0, can you 
create a scenario to reproduce this? Preferably without using a container.

Tilman

>
> On Wed, Jan 31, 2024 at 3:46 PM Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> On 31.01.2024 14:48, Lars Juel Jensen wrote:
>>> This creates another problem for me. I am running PDFBox in a kubernetes
>>> cluster on premises with limited resources. I can not setup persistent
>>> volume claims nor ephemeral volumes, and I can not change how my pods are
>>> started. I have limited resources and an emptyDir that is mounted on /tmp
>>> where the temporary files go. The emptyDir is mapped to a portion of the
>>> kubernetes node's memory, and this memory is shared with many other
>>> services. All in all - I need to keep a very low memory and tempFile
>>> footprint, hence the InputStream. Using RandomAccessReadBuffer with an
>>> InputStream loads the entire PDF into memory, and I can encounter PDF
>>> documents that can be over 1GB in size. So loading everything into memory
>>> is not an option.
>> You can try to create your own class extending RandomAccessRead.
>>
>> If your /tmp is mapped on main memory, then it doesn't make sense to use
>> a temp file at all, you're just wasting time.
>>
>> Btw PDFBox 2 was also loading the whole PDF file into memory (or into a
>> scratch file) and had an even bigger footprint because it was also
>> parsing the complete PDF. So if your project was working with PDFBox 2
>> then it should work with PDFBox 3.
>>
>> Tilman
>>
>>
>>
>>> On Wed, Jan 31, 2024 at 10:10 AM Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>>
>>>> On 31.01.2024 09:50, Lars Juel Jensen wrote:
>>>>> In PDFBox2 I could do:
>>>>>
>>>>> PDDocument.load(inputStream, MemoryUsageSetting.setupTempFileOnly())
>>>>>
>>>>> But there is no equivalent to this in PDFBox3. How do I read a PDF from
>>>> an
>>>>> inputstream?
>>>>>
>>>> |Loader.loadPDF(new RandomAccessReadBuffer(inputStream),
>>>> IOUtils.createTempFileOnlyStreamCache());|
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Loading a PDF using InputStream

Posted by Lars Juel Jensen <la...@gmail.com>.
Well that's my problem.. It works with PDFBox2 with reasonable sized files.
When it comes to the big ones it crashes.. So reading the migration guide
for PDFBox3.0 I thought I saw some light in the tunnel as it says I can
create my own reader and stream cache. I see that I can provide my own
RandomAccessReader when I call Loader.loadPDF, but the loadPDF method that
takes a StreamCacheCreate function does not work as promised as the
StreamCacheCreateFunction is not passed from PDFParser to COSParser in the
PDFParser constructor. This works in v3.0.0, but not in v3.0.1. I guess
this is a bug?

On Wed, Jan 31, 2024 at 3:46 PM Tilman Hausherr <TH...@t-online.de>
wrote:

> On 31.01.2024 14:48, Lars Juel Jensen wrote:
> > This creates another problem for me. I am running PDFBox in a kubernetes
> > cluster on premises with limited resources. I can not setup persistent
> > volume claims nor ephemeral volumes, and I can not change how my pods are
> > started. I have limited resources and an emptyDir that is mounted on /tmp
> > where the temporary files go. The emptyDir is mapped to a portion of the
> > kubernetes node's memory, and this memory is shared with many other
> > services. All in all - I need to keep a very low memory and tempFile
> > footprint, hence the InputStream. Using RandomAccessReadBuffer with an
> > InputStream loads the entire PDF into memory, and I can encounter PDF
> > documents that can be over 1GB in size. So loading everything into memory
> > is not an option.
>
> You can try to create your own class extending RandomAccessRead.
>
> If your /tmp is mapped on main memory, then it doesn't make sense to use
> a temp file at all, you're just wasting time.
>
> Btw PDFBox 2 was also loading the whole PDF file into memory (or into a
> scratch file) and had an even bigger footprint because it was also
> parsing the complete PDF. So if your project was working with PDFBox 2
> then it should work with PDFBox 3.
>
> Tilman
>
>
>
> >
> > On Wed, Jan 31, 2024 at 10:10 AM Tilman Hausherr <TH...@t-online.de>
> > wrote:
> >
> >> On 31.01.2024 09:50, Lars Juel Jensen wrote:
> >>> In PDFBox2 I could do:
> >>>
> >>> PDDocument.load(inputStream, MemoryUsageSetting.setupTempFileOnly())
> >>>
> >>> But there is no equivalent to this in PDFBox3. How do I read a PDF from
> >> an
> >>> inputstream?
> >>>
> >> |Loader.loadPDF(new RandomAccessReadBuffer(inputStream),
> >> IOUtils.createTempFileOnlyStreamCache());|
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Loading a PDF using InputStream

Posted by Tilman Hausherr <TH...@t-online.de>.
On 31.01.2024 14:48, Lars Juel Jensen wrote:
> This creates another problem for me. I am running PDFBox in a kubernetes
> cluster on premises with limited resources. I can not setup persistent
> volume claims nor ephemeral volumes, and I can not change how my pods are
> started. I have limited resources and an emptyDir that is mounted on /tmp
> where the temporary files go. The emptyDir is mapped to a portion of the
> kubernetes node's memory, and this memory is shared with many other
> services. All in all - I need to keep a very low memory and tempFile
> footprint, hence the InputStream. Using RandomAccessReadBuffer with an
> InputStream loads the entire PDF into memory, and I can encounter PDF
> documents that can be over 1GB in size. So loading everything into memory
> is not an option.

You can try to create your own class extending RandomAccessRead.

If your /tmp is mapped on main memory, then it doesn't make sense to use 
a temp file at all, you're just wasting time.

Btw PDFBox 2 was also loading the whole PDF file into memory (or into a 
scratch file) and had an even bigger footprint because it was also 
parsing the complete PDF. So if your project was working with PDFBox 2 
then it should work with PDFBox 3.

Tilman



>
> On Wed, Jan 31, 2024 at 10:10 AM Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> On 31.01.2024 09:50, Lars Juel Jensen wrote:
>>> In PDFBox2 I could do:
>>>
>>> PDDocument.load(inputStream, MemoryUsageSetting.setupTempFileOnly())
>>>
>>> But there is no equivalent to this in PDFBox3. How do I read a PDF from
>> an
>>> inputstream?
>>>
>> |Loader.loadPDF(new RandomAccessReadBuffer(inputStream),
>> IOUtils.createTempFileOnlyStreamCache());|
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Loading a PDF using InputStream

Posted by Lars Juel Jensen <la...@gmail.com>.
This creates another problem for me. I am running PDFBox in a kubernetes
cluster on premises with limited resources. I can not setup persistent
volume claims nor ephemeral volumes, and I can not change how my pods are
started. I have limited resources and an emptyDir that is mounted on /tmp
where the temporary files go. The emptyDir is mapped to a portion of the
kubernetes node's memory, and this memory is shared with many other
services. All in all - I need to keep a very low memory and tempFile
footprint, hence the InputStream. Using RandomAccessReadBuffer with an
InputStream loads the entire PDF into memory, and I can encounter PDF
documents that can be over 1GB in size. So loading everything into memory
is not an option.

On Wed, Jan 31, 2024 at 10:10 AM Tilman Hausherr <TH...@t-online.de>
wrote:

> On 31.01.2024 09:50, Lars Juel Jensen wrote:
> > In PDFBox2 I could do:
> >
> > PDDocument.load(inputStream, MemoryUsageSetting.setupTempFileOnly())
> >
> > But there is no equivalent to this in PDFBox3. How do I read a PDF from
> an
> > inputstream?
> >
>
> |Loader.loadPDF(new RandomAccessReadBuffer(inputStream),
> IOUtils.createTempFileOnlyStreamCache());|
>

Re: Loading a PDF using InputStream

Posted by Tilman Hausherr <TH...@t-online.de>.
On 31.01.2024 09:50, Lars Juel Jensen wrote:
> In PDFBox2 I could do:
>
> PDDocument.load(inputStream, MemoryUsageSetting.setupTempFileOnly())
>
> But there is no equivalent to this in PDFBox3. How do I read a PDF from an
> inputstream?
>

|Loader.loadPDF(new RandomAccessReadBuffer(inputStream), 
IOUtils.createTempFileOnlyStreamCache());|