You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Stefan Bodewig <bo...@apache.org> on 2013/10/01 06:09:25 UTC

Re: [compress] Do we want 7z Archive*Stream-like classes

On 2013-09-30, Benedikt Ritter wrote:

> 2013/9/30 Stefan Bodewig <bo...@apache.org>

>> I'm in no way as familiar with the format as Damian is but IMHO it is
>> feasible - but likely pretty memory hungry.  Even more so for the
>> writing side.  Similar to zip some information is stored in a central
>> place but in this case at the front of the archive.

> just out of curiosity: is this memory problem related to Java or to 7z in
> general?

What Bernd said.

Reading may be simpler, here you can store the meta-information from the
start of the file in memory and then read entries as you go, ZipFile
inside the zip package does something like this.

When you consider writing you'll have to write metadata about all
entries before you even start to write the first bytes of the first
entry.  Either you build up everything in memory or you use a temporary
output.  This is not without precedent in Compress, pack200 allows users
to chose between two strategies that provide exactly those two options.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [compress] Do we want 7z Archive*Stream-like classes

Posted by "dam6923 ." <da...@gmail.com>.

> Since we now have multiple archivers that require seeking, I suggest
> we add a SeekableStream class or something along those lines. The
> Commons Imaging project also has the same problem to solve for images,
> and it uses ByteSources, which can be arrays, files, or an InputStream
> wrapper that caches what has been read (so seeking is efficient, while
> it only reads as much from the InputStream as is necessary).

I would also like to advocate for this approach.  I was looking into
writing up an implementation of Google SNAPPY decompressor, but was
unable to effectively wrap it into an InputStream.  Having a seekable
stream would make my efforts a better fit for this library.

On Sun, Oct 6, 2013 at 9:25 AM, Stefan Bodewig <bo...@apache.org> wrote:
> On 2013-10-01, Damjan Jovanovic wrote:
>
>> On Tue, Oct 1, 2013 at 6:09 AM, Stefan Bodewig <bo...@apache.org> wrote:
>
>>> Reading may be simpler, here you can store the meta-information from the
>>> start of the file in memory and then read entries as you go, ZipFile
>>> inside the zip package does something like this.
>
>> From what I remember:
>
>> The "meta-information" can be anywhere in the file, as can the
>> compressed files themselves. The 7zip tool seems to write the
>> meta-information at the end of the 7z file when multi-file archives
>> are created.
>
> Oh yes, my understanding has been pretty much wrong and re-reading your
> implementation has helped me to see clearer.  Right now I think the
> important metadata actually is at the end but there is a smaller part at
> the front - in particular a pointer to the Header holding the metadata.
>
>> Compressed file codecs, positions, lengths, and solid compression
>> details are only stored in the meta-information, so it's not possible
>> to write a streaming reader without O(n) memory in the worst case.
>
> I agree.
>
>> Writing also requires seeking or O(n) memory, as the initial header at
>> the beginning of the file contains the offset to the next header, and
>> we only know the size/contents/location of the next header once all
>> the files have been written.
>
> or a temporary file to which the first header could be prepended - but
> if you have that, you could use seeking as well.  So yes, I agree again.
>
>> Since we now have multiple archivers that require seeking, I suggest
>> we add a SeekableStream class or something along those lines. The
>> Commons Imaging project also has the same problem to solve for images,
>> and it uses ByteSources, which can be arrays, files, or an InputStream
>> wrapper that caches what has been read (so seeking is efficient, while
>> it only reads as much from the InputStream as is necessary).
>
> Interesting idea.
>
> Right now I'm willing to postpone and streaming API for 7z and rather
> cut a release with a files only API.
>
> Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [compress] Do we want 7z Archive*Stream-like classes

Posted by Stefan Bodewig <bo...@apache.org>.

On 2013-10-01, Damjan Jovanovic wrote:

> On Tue, Oct 1, 2013 at 6:09 AM, Stefan Bodewig <bo...@apache.org> wrote:

>> Reading may be simpler, here you can store the meta-information from the
>> start of the file in memory and then read entries as you go, ZipFile
>> inside the zip package does something like this.

> From what I remember:

> The "meta-information" can be anywhere in the file, as can the
> compressed files themselves. The 7zip tool seems to write the
> meta-information at the end of the 7z file when multi-file archives
> are created.

Oh yes, my understanding has been pretty much wrong and re-reading your
implementation has helped me to see clearer.  Right now I think the
important metadata actually is at the end but there is a smaller part at
the front - in particular a pointer to the Header holding the metadata.

> Compressed file codecs, positions, lengths, and solid compression
> details are only stored in the meta-information, so it's not possible
> to write a streaming reader without O(n) memory in the worst case.

I agree.

> Writing also requires seeking or O(n) memory, as the initial header at
> the beginning of the file contains the offset to the next header, and
> we only know the size/contents/location of the next header once all
> the files have been written.

or a temporary file to which the first header could be prepended - but
if you have that, you could use seeking as well.  So yes, I agree again.

> Since we now have multiple archivers that require seeking, I suggest
> we add a SeekableStream class or something along those lines. The
> Commons Imaging project also has the same problem to solve for images,
> and it uses ByteSources, which can be arrays, files, or an InputStream
> wrapper that caches what has been read (so seeking is efficient, while
> it only reads as much from the InputStream as is necessary).

Interesting idea.

Right now I'm willing to postpone and streaming API for 7z and rather
cut a release with a files only API.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [compress] Do we want 7z Archive*Stream-like classes

Posted by Damjan Jovanovic <da...@gmail.com>.

On Tue, Oct 1, 2013 at 6:09 AM, Stefan Bodewig <bo...@apache.org> wrote:
> On 2013-09-30, Benedikt Ritter wrote:
>
>> 2013/9/30 Stefan Bodewig <bo...@apache.org>
>
>>> I'm in no way as familiar with the format as Damian is but IMHO it is
>>> feasible - but likely pretty memory hungry.  Even more so for the
>>> writing side.  Similar to zip some information is stored in a central
>>> place but in this case at the front of the archive.
>
>> just out of curiosity: is this memory problem related to Java or to 7z in
>> general?
>
> What Bernd said.
>
> Reading may be simpler, here you can store the meta-information from the
> start of the file in memory and then read entries as you go, ZipFile
> inside the zip package does something like this.

>From what I remember:

The "meta-information" can be anywhere in the file, as can the
compressed files themselves. The 7zip tool seems to write the
meta-information at the end of the 7z file when multi-file archives
are created. Compressed file codecs, positions, lengths, and solid
compression details are only stored in the meta-information, so it's
not possible to write a streaming reader without O(n) memory in the
worst case.

> When you consider writing you'll have to write metadata about all
> entries before you even start to write the first bytes of the first
> entry.  Either you build up everything in memory or you use a temporary
> output.  This is not without precedent in Compress, pack200 allows users
> to chose between two strategies that provide exactly those two options.

Writing also requires seeking or O(n) memory, as the initial header at
the beginning of the file contains the offset to the next header, and
we only know the size/contents/location of the next header once all
the files have been written.

Since we now have multiple archivers that require seeking, I suggest
we add a SeekableStream class or something along those lines. The
Commons Imaging project also has the same problem to solve for images,
and it uses ByteSources, which can be arrays, files, or an InputStream
wrapper that caches what has been read (so seeking is efficient, while
it only reads as much from the InputStream as is necessary).

> Stefan
>

Damjan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org