You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (Jira)" <ji...@apache.org> on 2022/10/17 06:28:00 UTC

[jira] [Commented] (PDFBOX-4542) Suggestion: Don't load large streams completely into memory, reference them instead

    [ https://issues.apache.org/jira/browse/PDFBOX-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618676#comment-17618676 ] 

Andreas Lehmkühler commented on PDFBOX-4542:
--------------------------------------------

PDFBOX-4659 implements a on demand parser which should help. It is part of the upcoming 3.0.0 release and svereal alpha version are already available to be tested

> Suggestion: Don't load large streams completely into memory, reference them instead
> -----------------------------------------------------------------------------------
>
>                 Key: PDFBOX-4542
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4542
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, PDModel
>    Affects Versions: 2.0.14
>            Reporter: Jonathan
>            Priority: Minor
>              Labels: memory, performance
>
> As we processed large PDF files, many of which containing large image streams, we wanted to avoid loading the entire streams into memory. Instead, we implemented a mechanism that merely referenced their location on disk.
> We eventually did this by subclassing COSStream, and then overriding COSParser.parseCOSStream(COSDictionary) to conditionally create our stream. Here is the code, this is currently still a work-in-progress. I've just refactored the entire mechanism.
> {code:java}
> public class ReferencedCOSStream
>    extends COSStream
> {
>    //~ Instance members ------------------------------------------------------------------------------------------------------------------------------
>    boolean isReference = false;
>    File    reference   = null;
>    long    offset      = -1;
>    long    length      = -1;
>    //~ Constructors ----------------------------------------------------------------------------------------------------------------------------------
>    private ReferencedCOSStream(final ScratchFile scratchFile)
>    {
>       super(scratchFile);
>    }
>    //~ Methods ---------------------------------------------------------------------------------------------------------------------------------------
>    public static ReferencedCOSStream createFromCOSStream(final COSStream stream)
>    {
>       final ReferencedCOSStream out = new ReferencedCOSStream(stream.getScratchFile());
>       for (final Map.Entry<COSName, COSBase> entry : stream.entrySet())
>       {
>          out.setItem(entry.getKey(), entry.getValue());
>       }
>       return out;
>    }
>    @Override
>    public COSInputStream createInputStream(final DecodeOptions options)
>       throws IOException
>    {
>       if (this.isReference)
>       {
>          final InputStream in = new SlicedFileInputStream(this.reference, this.offset, this.length);
>          return COSInputStream.create(getFilterList(), this, in, this.getScratchFile(), options);
>       }
>       else
>       {
>          return super.createInputStream(options);
>       }
>    }
>    @Override
>    public InputStream createRawInputStream()
>       throws IOException
>    {
>       if (this.isReference)
>       {
>          return new SlicedFileInputStream(this.reference, this.offset, this.length);
>       }
>       else
>       {
>          return super.createRawInputStream();
>       }
>    }
>    @Override
>    public OutputStream createOutputStream(final COSBase filters)
>       throws IOException
>    {
>       this.isReference = false;
>       return super.createOutputStream(filters);
>    }
>    @Override
>    public OutputStream createRawOutputStream()
>       throws IOException
>    {
>       this.isReference = false;
>       return super.createRawOutputStream();
>    }
>    public void setReference(final File file,
>                             final long offset,
>                             final long length)
>    {
>       this.isReference = true;
>       this.reference   = file;
>       this.offset      = offset;
>       this.length      = length;
>       this.setLong(COSName.LENGTH, length);
>    }
>    //~ Inner Classes ---------------------------------------------------------------------------------------------------------------------------------
>    private class SlicedFileInputStream
>       extends FileInputStream
>    {
>       //~ Instance members ---------------------------------------------------------------------------------------------------------------------------
>       private long       index;
>       private final long length;
>       //~ Constructors -------------------------------------------------------------------------------------------------------------------------------
>       public SlicedFileInputStream(final File file,
>                                    final long offset,
>                                    final long length)
>          throws FileNotFoundException, IOException
>       {
>          super(file);
>          this.length = length;
>          this.skip(offset);
>          this.index = 0;
>       }
>       //~ Methods ------------------------------------------------------------------------------------------------------------------------------------
>       @Override
>       public int available()
>          throws IOException
>       {
>          final long remaining = length - index;
>          if (remaining < 0)
>          {
>             return 0;
>          }
>          return (int)remaining;
>       }
>       @Override
>       public int read(final byte[] b)
>          throws IOException
>       {
>          final int remaining = this.available();
>          final int len       = (remaining < b.length) ? remaining : b.length;
>          index += len;
>          if (len > 0)
>          {
>             return super.read(b, 0, len);
>          }
>          else
>          {
>             return -1;
>          }
>       }
>       @Override
>       public int read(final byte[] b,
>                       final int    off,
>                       int          len)
>          throws IOException
>       {
>          final int remaining = this.available();
>          len   =  (remaining < len) ? remaining : len;
>          index += len;
>          if (len > 0)
>          {
>             return super.read(b, 0, len);
>          }
>          else
>          {
>             return -1;
>          }
>       }
>       @Override
>       public long skip(final long n)
>          throws IOException
>       {
>          index += n;
>          return super.skip(n);
>       }
>       @Override
>       public FileChannel getChannel()
>       {
>          throw new UnsupportedOperationException("Obtaining a FileChannel is not supported because a correct offset cannot be ensured.");
>       }
>    }
> }
> {code}
> {code:java}
>    @Override
>    protected COSStream parseCOSStream(final COSDictionary dic)
>       throws IOException
>    {
>       /*
>        * This needs to be dic.getItem because when we are parsing, the underlying object might still be null.
>        */
>       final COSNumber streamLengthObj = getLength(dic.getItem(COSName.LENGTH), dic.getCOSName(COSName.TYPE));
>       COSStream       stream          = document.createCOSStream(dic);
>       // read 'stream'; this was already tested in parseObjectsDynamically()
>       readString();
>       skipWhiteSpaces();
>       if (streamLengthObj == null)
>       {
>          if (isLenient)
>          {
>             LOG.warn("The stream doesn't provide any stream length, using fallback readUntilEnd, at offset " + source.getPosition());
>          }
>          else
>          {
>             throw new IOException("Missing length for stream.");
>          }
>       }
>       if ((streamLengthObj != null) && (streamLengthObj.longValue() >= 1024))
>       {
>          final long                streamBegPos = source.getPosition();
>          final ReferencedCOSStream refStream    = ReferencedCOSStream.createFromCOSStream(stream);
>          try
>          {
>             readValidStream(null, streamLengthObj);
>          }
>          finally
>          {
>             stream.setItem(COSName.LENGTH, streamLengthObj);
>          }
>          refStream.setReference(new File(reference), streamBegPos, source.getPosition() - streamBegPos);
>          stream = refStream;
>       }
>       else
>       {
>          try(final OutputStream out = stream.createRawOutputStream())
>          {
>             if ((streamLengthObj != null) && validateStreamLength(streamLengthObj.longValue()))
>             {
>                readValidStream(out, streamLengthObj);
>             }
>             else
>             {
>                readUntilEndStream(new EndstreamOutputStream(out));
>             }
>          }
>          finally
>          {
>             stream.setItem(COSName.LENGTH, streamLengthObj);
>          }
>       }
>       final String endStream = readString();
>       if (endStream.equals("endobj") && isLenient)
>       {
>          LOG.warn("stream ends with 'endobj' instead of 'endstream' at offset " + source.getPosition());
>          // avoid follow-up warning about missing endobj
>          source.rewind(ENDOBJ.length);
>       }
>       else if ((endStream.length() > 9) && isLenient && endStream.substring(0, 9).equals(ENDSTREAM_STRING))
>       {
>          LOG.warn("stream ends with '" + endStream + "' instead of 'endstream' at offset " + source.getPosition());
>          // unread the "extra" bytes
>          source.rewind(endStream.substring(9).getBytes(ISO_8859_1).length);
>       }
>       else if (!endStream.equals(ENDSTREAM_STRING))
>       {
>          throw new IOException("Error reading stream, expected='endstream' actual='" + endStream + "' at offset " + source.getPosition());
>       }
>       return stream;
>    }
> {code}
> The class ReferencedCOSStream exposes the underlying data in exactly the same way as it does COSStream, but instead of keeping the storage in memory, it always opens a FileInputStream to retrieve the content. SlicedFileInputStream basically wraps around a FileInputStream and tries to imitate the behaviour of an InputStream for this specific chunk of data.
> I needed to expose some APIs for these classes, the method ReferencedCOSStream.createFromCOSStream(COSStream) would better be located in PDDocument and create the stream directly, I just didn't want to also modify PDDocument.
> Right now, encrypted streams are currently loaded into memory by the SecurityHandler directly after creation. If you want to accept this proposal, it might make sense to move the decryption handling also into COSStream and ReferencedCOSStream and perform it upon request.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org