You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Tilman Hausherr <TH...@t-online.de> on 2022/08/01 18:20:40 UTC

Re: Replace methods using an InputStream from Loader.loadPDF

+1 but
- the explanation below (when to use which class) should be in the javadoc
- the removal should be in the migration guide

Tilman

Am 31.07.2022 um 15:18 schrieb Andreas Lehmkuehler:
> Hi fellow devs,
>
>
> there was a discussion on JIRA [1] about the changed behaviour of the 
> parser due to the removal of the ScratchFileBuffer when reading a pdf.
>
> Additionally there was the post "High memory usage with pdfbox 3" on 
> users@pdfbox targeting the very same topic
>
> After explaining myself and my changes twice I came to conclusion that 
> I'm going to have to do so in the future again and again if we don't 
> change the API of Loader.loadPDF
>
> People simply realize that all methods to be used for loading a pdf 
> are moved from PDDocument to Loader. They expect the very same 
> behaviour when using a similar api and that is understandable from a 
> user point of view.
>
> We have to remove the loadPDF variants using InputStream and replace 
> them with RandomAccessRead.
>
> It it comes to InputStreams users have to decide how to procide:
> * copy the InputStream to memory by using RandomAccessReadBuffer
> * copy the InputStream to a file and use RandomAccessReadBufferedFile 
> or RandomAccessReadMemoryMappedFile
>
> This would make it more transparent what happens under the hood when 
> using the different kinds of loadPDF methods:
>
> * a byte array as source is already in memory and the obvious choice 
> is to use RandomAccessReadBuffer as a wrapper
> * a file as source targets a local file and the most obvious choice is 
> to use RandomAccessReadBufferedFile as a wrapper. We should document 
> that as the other alternative RandomAccessReadMemoryMappedFile is 
> offered in this case
> * RandomAccessRead as source is the most obvious one and the user 
> decides how to create it. Additionally is ist possible to implement 
> some own caching loading and/or mechanism
>
> I know, this will lead to some changes in the codebase of our users, 
> but they have to do it in any case as the method was moved, so why not 
> change the data type as well
>
>
> WDYT? Am I missing something?
>
> Andreas
>
> [1] https://issues.apache.org/jira/browse/PDFBOX-5462
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: Replace methods using an InputStream from Loader.loadPDF

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 01.08.22 um 20:20 schrieb Tilman Hausherr:
> +1 but
> - the explanation below (when to use which class) should be in the javadoc
> - the removal should be in the migration guide
It is already on my TODO list

Andreas

> 
> Tilman
> 
> Am 31.07.2022 um 15:18 schrieb Andreas Lehmkuehler:
>> Hi fellow devs,
>>
>>
>> there was a discussion on JIRA [1] about the changed behaviour of the parser 
>> due to the removal of the ScratchFileBuffer when reading a pdf.
>>
>> Additionally there was the post "High memory usage with pdfbox 3" on 
>> users@pdfbox targeting the very same topic
>>
>> After explaining myself and my changes twice I came to conclusion that I'm 
>> going to have to do so in the future again and again if we don't change the 
>> API of Loader.loadPDF
>>
>> People simply realize that all methods to be used for loading a pdf are moved 
>> from PDDocument to Loader. They expect the very same behaviour when using a 
>> similar api and that is understandable from a user point of view.
>>
>> We have to remove the loadPDF variants using InputStream and replace them with 
>> RandomAccessRead.
>>
>> It it comes to InputStreams users have to decide how to procide:
>> * copy the InputStream to memory by using RandomAccessReadBuffer
>> * copy the InputStream to a file and use RandomAccessReadBufferedFile or 
>> RandomAccessReadMemoryMappedFile
>>
>> This would make it more transparent what happens under the hood when using the 
>> different kinds of loadPDF methods:
>>
>> * a byte array as source is already in memory and the obvious choice is to use 
>> RandomAccessReadBuffer as a wrapper
>> * a file as source targets a local file and the most obvious choice is to use 
>> RandomAccessReadBufferedFile as a wrapper. We should document that as the 
>> other alternative RandomAccessReadMemoryMappedFile is offered in this case
>> * RandomAccessRead as source is the most obvious one and the user decides how 
>> to create it. Additionally is ist possible to implement some own caching 
>> loading and/or mechanism
>>
>> I know, this will lead to some changes in the codebase of our users, but they 
>> have to do it in any case as the method was moved, so why not change the data 
>> type as well
>>
>>
>> WDYT? Am I missing something?
>>
>> Andreas
>>
>> [1] https://issues.apache.org/jira/browse/PDFBOX-5462
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org