You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2015/07/18 21:54:04 UTC
[jira] [Commented] (PDFBOX-2893) Simplify COSStream encoding and decoding

    [ https://issues.apache.org/jira/browse/PDFBOX-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632584#comment-14632584 ] 

Tilman Hausherr commented on PDFBOX-2893:
-----------------------------------------

Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project pdfbox: Compilation failure: Compilation failure:
org/apache/pdfbox/pdfparser/BaseParser.java:[115,21] cannot find symbol
symbol:   class SequentialSource
location: class org.apache.pdfbox.pdfparser.BaseParser
org/apache/pdfbox/pdfparser/BaseParser.java:[125,23] cannot find symbol
symbol:   class SequentialSource
location: class org.apache.pdfbox.pdfparser.BaseParser
org/apache/pdfbox/pdfparser/PDFStreamParser.java:[343,43] cannot find symbol
symbol:   class SequentialSource
location: class org.apache.pdfbox.pdfparser.PDFStreamParser
org/apache/pdfbox/pdfparser/PDFObjectStreamParser.java:[55,19] cannot find symbol
symbol:   class InputStreamSource
location: class org.apache.pdfbox.pdfparser.PDFObjectStreamParser
org/apache/pdfbox/pdfparser/COSParser.java:[164,19] cannot find symbol
symbol:   class RandomAccessSource
location: class org.apache.pdfbox.pdfparser.COSParser
org/apache/pdfbox/pdfparser/PDFXrefStreamParser.java:[56,19] cannot find symbol
symbol:   class InputStreamSource
location: class org.apache.pdfbox.pdfparser.PDFXrefStreamParser
org/apache/pdfbox/pdfparser/PDFStreamParser.java:[75,19] cannot find symbol
symbol:   class InputStreamSource
location: class org.apache.pdfbox.pdfparser.PDFStreamParser
org/apache/pdfbox/pdfparser/PDFStreamParser.java:[87,19] cannot find symbol
symbol:   class InputStreamSource
location: class org.apache.pdfbox.pdfparser.PDFStreamParser

> Simplify COSStream encoding and decoding
> ----------------------------------------
>
>                 Key: PDFBOX-2893
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2893
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: John Hewson
>            Assignee: John Hewson
>            Priority: Blocker
>             Fix For: 2.0.0
>
>         Attachments: PDFBOX-2893-1.patch
>
>
> Performance issues and memory usage issues surrounding streams are one of the few things blocking the release of 2.0 (see  PDFBOX-2301, PDFBOX-2882, PDFBOX-2883).
> Though we've managed to reduce some of the memory used by RandomAccessBuffer and to take advantage of buffering of scratch files, we still have problems with the amount of memory which COSStream holds onto. Changes introduced in 2.0 have resulted in COSStreams having a very complex relationship with classes which hold a lot of memory in complex ways (e.g. the fields: tempBuffer, filteredBuffer, unfilteredBuffer, filteredStream, unFilteredStream, scratchFile). Access to scratch file pages in particular does not seem to be well regulated, especially with regards to multithreading (an avenue we'd at least like to leave open).
> Given recent flux, I'm doubtful that we can ship the current API for COSStream w.r.t. RandomAccess without shipping performance issues or flaws which will be unfixable without breaking changes.
> One of the recent changes to COSStream is that it now exposes a RandomAccess, this is so that PDFStreamParser can parse content streams (as well as other subclasses which handle xref and object streams). However, streams are fundamentally not random access - stream filters are sequential. While the consumer of a stream may wish to buffer the data (in memory or scratch) for random access, COSStream itself does not need to expose such an elaborate API - many pieces of gymnastics are performed inside COSStream to present this illusion, at significant cost. We should remove that.
> But what about providing a RandomAccess for PDFStreamParser, PDFObjectStreamParser, and PDFXrefStreamParser? It turns out that those classes don't actually perform random I/O. They perform sequential I/O with a buffer for peek/unread.
> We need to simplify to get 2.0 fast, lean, and maintainable. Here's what I think we should do:
> 1. Split the interfaces for sequential and random I/O
> - Introduce a new SequentialSource interface for sequential I/O, with thin wrappers for RandomAccessRead and InputStream.
> - BaseParser will use SequentialSource rather than RandomAccessRead (this will be inherited by PDFStreamParser, PDFObjectStreamParser, and PDFXrefStreamParser).
> - COSParser will use RandomAccessRead and pass a SequentialSource wrapper to it's superclass, BaseParser.
> 2. Remove RandomAccess APIs from COSStream, expose only InputStream and OutputStream, as we used to do. We can pass an InputStream to PDFStreamParser using a wrapper which implements SequentialSource. This will remove tempBuffer, filteredBuffer, and unfilteredBuffer from COSStream, all of which hold memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org