You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (Jira)" <ji...@apache.org> on 2022/10/30 12:18:00 UTC

[jira] [Commented] (PDFBOX-5483) Replace methods using an InputStream from Loader.loadPDF

    [ https://issues.apache.org/jira/browse/PDFBOX-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626221#comment-17626221 ] 

Andreas Lehmkühler commented on PDFBOX-5483:
--------------------------------------------

[~mkl] I totally understand your point of view, but sorry, I don't like it. Let me explain why.

The code of {{org.apache.pdfbox.io}} is still a work in progress. Most likely the following features will be added in near future:
* setting the buffersize for {{org.apache.pdfbox.io.RandomAccessReadBuffer}}
* setting the buffersize for {{org.apache.pdfbox.io.RandomAccessReadBufferedFile}}
* paging support for memory mapped files so that we might want to set the buffer size as well for {{org.apache.pdfbox.io.RandomAccessReadMemoryMappedFile}}
* I'm thinking of a replacement for the current implementation and usage of {{org.apache.pdfbox.io.ScratchFile}}. Something that isn't burried somewhere in org.apache.pdfbox.cos

Maybe there will be some other implementations of {{org.apache.pdfbox.io.RandomAccessRead}} and I'm pretty sure there are other things I can't imagine now.

However, if the code is located somewhere in the parser and/or loader all of those modifications require changes within code of the parser/loader and depending on the 
kind of changes different method signatures. IMHO that code should no be responsible for the management of the source of the data. That stuff belongs to {{org.apache.pdfbox.io}}.

Saying that, if someone wants to provide some convenience code it should be added somewhere within {{org.apache.pdfbox.io}}.



> Replace methods using an InputStream from Loader.loadPDF
> --------------------------------------------------------
>
>                 Key: PDFBOX-5483
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5483
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Andreas Lehmkühler
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>             Fix For: 3.0.0 PDFBox
>
>
> As discussed on dev@pdfbox
> {quote}
> We have to remove the loadPDF variants using InputStream and replace them with RandomAccessRead.
> If it comes to InputStreams users have to decide how to procide:
> * copy the InputStream to memory by using RandomAccessReadBuffer
> * copy the InputStream to a file and use RandomAccessReadBufferedFile or RandomAccessReadMemoryMappedFile
> This would make it more transparent what happens under the hood when using the different kinds of loadPDF methods:
> * a byte array as source is already in memory and the obvious choice is to use RandomAccessReadBuffer as a wrapper
> * a file as source targets a local file and the most obvious choice is to use RandomAccessReadBufferedFile as a wrapper. We should document that as the other alternative RandomAccessReadMemoryMappedFile is offered in this case
> * RandomAccessRead as source is the most obvious one and the user decides how to create it. Additionally is ist possible to implement some own caching loading and/or mechanism
> {quote}
> see PDFBOX-5462 and [High memory usage with pdfbox 3|https://lists.apache.org/thread/6mmgp23v8b2yztj4hghkgkd14s1gzs8g] as well



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org