You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2015/07/18 20:39:04 UTC

[jira] [Created] (PDFBOX-2893) Simplify COSStream encoding and decoding

John Hewson created PDFBOX-2893:
-----------------------------------

             Summary: Simplify COSStream encoding and decoding
                 Key: PDFBOX-2893
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2893
             Project: PDFBox
          Issue Type: Improvement
    Affects Versions: 2.0.0
            Reporter: John Hewson
            Assignee: John Hewson
            Priority: Blocker
             Fix For: 2.0.0


Performance issues and memory usage issues surrounding streams are one of the few things blocking the release of 2.0 (see  PDFBOX-2301, PDFBOX-2882, PDFBOX-2883).

Though we've managed to reduce some of the memory used by RandomAccessBuffer and to take advantage of buffering of scratch files, we still have problems with the amount of memory which COSStream holds onto. Changes introduced in 2.0 have resulted in COSStream's having a very complex relationship with classes which hold a lot of memory in complex ways. Access to scratch file pages in particular does not seem to be well regulated, especially with regards to multithreading (an avenue we'd at least like to leave open).

Given recent flux, I'm doubtful that we can ship the current API for COSStream w.r.t. RandomAccess without shipping performance issues or flaws which will be unfixable without breaking changes.

One of the recent changes to COSStream is that it now exposes a RandomAccess, this is so that PDFStreamParser can parse content streams (as well as other subclasses which handle xref and object streams). However, streams are fundamentally not random access - stream filters are sequential. While the consumer of a stream may wish to buffer the data (in memory or scratch) for random access, COSStream itself does not need to expose such an elaborate API - many pieces of gymnastics are performed inside COSStream to present this illusion, at significant cost. We should remove that.

But what about providing a RandomAccess for PDFStreamParser, PDFObjectStreamParser, and PDFXrefStreamParser? It turns out that those classes don't actually perform random I/O. They perform sequential I/O with a buffer for peek/unread.

We need to simplify to get 2.0 fast, lean, and maintainable. Here's what I think we should do:

1. Split the interfaces for sequential and random I/O
- Introduce a new SequentialSource interface for sequential I/O, with wrappers for RandomAccessRead and InputStream.
- BaseParser will use SequentialSource rather than RandomAccessRead (this will be inherited by PDFStreamParser, PDFObjectStreamParser, and PDFXrefStreamParser).
- COSParser will use RandomAccessRead and pass a SequentialSource wrapper to it's superclass, BaseParser.

2. Remove RandomAccess APIs from COSStream, expose only InputStream and OutputStream, as we used to do. We can pass an InputStream to PDFStreamParser using a wrapper which implements SequentialSource.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org