You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Torsten Curdt <tc...@vafer.org> on 2013/09/29 16:11:57 UTC

Re: [compress] Do we want 7z Archive*Stream-like classes

Hm - it is indeed a little misleading. So I am +0 for an inclusion.
Is a stream based implementation of 7z somewhat feasible - at least in
theory?

cheers,
Torsten


On Sun, Sep 29, 2013 at 8:09 AM, Stefan Bodewig <bo...@apache.org> wrote:

> Hi all,
>
> over this weekend I added 7z support to the compress antlib which I also
> like to use as a second testbed for Commons Compress - I even found a
> bug for archives that only contain empty directories.
>
> The antlib is based on the interface provided by Archive*Stream even
> when it is not using any streams at all, so I added
> SevenZ(In|Out)putStreams that only work on files and delegate all calls
> to the corresponding SevenZ(Out)File[1].  They are no streams at all.
>
> Would those classes be useful inside of Commons Compress or should they
> better be kept out as they'd promise more than they can hold?
>
> [1]
> http://svn.apache.org/repos/asf/ant/antlibs/compress/trunk/src/main/org/apache/ant/compress/util/SevenZStreamFactory.java
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

Re: [compress] Do we want 7z Archive*Stream-like classes

Posted by Bernd Eckenfels <ec...@zusammenkunft.net>.

Hello,

I think it is not related to java, but a general problem of some file  
formats in regards to streaming access.

If a format needs seeking/random-access there are basically three options  
(with the Java classes but also other languages). The first is having a  
random access file (which in this context mean you write the stream to a  
temp file and work on it), the second is doing the buffering in memory  
(mark/reset style). This might be a problem if you have to read from the  
end of the file as you need to keep everything in between in memory. The  
third option would be to allow to open the provided input stream multiple  
times (eighter by providing some form on "data source" or by supporting  
clone/reset on the input stream). (another option would be a random  
access-like buffer, but the amount of work to do that might not be worth  
it as you can easyly use a temp file).

For the 7z stream I guess the minimum which can be done is working with a  
temp file. But a general idea for this (and other compressors) is a "if  
you can provide multiple input streams you can use ..." API.

Greetings
Bernd

Am 30.09.2013, 18:47 Uhr, schrieb Benedikt Ritter <br...@apache.org>:

> 2013/9/30 Stefan Bodewig <bo...@apache.org>
>
>> On 2013-09-29, Torsten Curdt wrote:
>>
>> > Hm - it is indeed a little misleading. So I am +0 for an inclusion.
>>
>> This is what I feel as well.
>>
>> > Is a stream based implementation of 7z somewhat feasible - at least in
>> > theory?
>>
>> I'm in no way as familiar with the format as Damian is but IMHO it is
>> feasible - but likely pretty memory hungry.  Even more so for the
>> writing side.  Similar to zip some information is stored in a central
>> place but in this case at the front of the archive.
>>
>
> Hi Stefan,
>
> just out of curiosity: is this memory problem related to Java or to 7z in
> general?
>
> Benedikt
>
>
>>
>> Stefan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>
>
>

-- 
http://www.zusammenkunft.net

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [compress] Do we want 7z Archive*Stream-like classes

Posted by "dam6923 ." <da...@gmail.com>.

> Since we now have multiple archivers that require seeking, I suggest
> we add a SeekableStream class or something along those lines. The
> Commons Imaging project also has the same problem to solve for images,
> and it uses ByteSources, which can be arrays, files, or an InputStream
> wrapper that caches what has been read (so seeking is efficient, while
> it only reads as much from the InputStream as is necessary).

I would also like to advocate for this approach.  I was looking into
writing up an implementation of Google SNAPPY decompressor, but was
unable to effectively wrap it into an InputStream.  Having a seekable
stream would make my efforts a better fit for this library.

On Sun, Oct 6, 2013 at 9:25 AM, Stefan Bodewig <bo...@apache.org> wrote:
> On 2013-10-01, Damjan Jovanovic wrote:
>
>> On Tue, Oct 1, 2013 at 6:09 AM, Stefan Bodewig <bo...@apache.org> wrote:
>
>>> Reading may be simpler, here you can store the meta-information from the
>>> start of the file in memory and then read entries as you go, ZipFile
>>> inside the zip package does something like this.
>
>> From what I remember:
>
>> The "meta-information" can be anywhere in the file, as can the
>> compressed files themselves. The 7zip tool seems to write the
>> meta-information at the end of the 7z file when multi-file archives
>> are created.
>
> Oh yes, my understanding has been pretty much wrong and re-reading your
> implementation has helped me to see clearer.  Right now I think the
> important metadata actually is at the end but there is a smaller part at
> the front - in particular a pointer to the Header holding the metadata.
>
>> Compressed file codecs, positions, lengths, and solid compression
>> details are only stored in the meta-information, so it's not possible
>> to write a streaming reader without O(n) memory in the worst case.
>
> I agree.
>
>> Writing also requires seeking or O(n) memory, as the initial header at
>> the beginning of the file contains the offset to the next header, and
>> we only know the size/contents/location of the next header once all
>> the files have been written.
>
> or a temporary file to which the first header could be prepended - but
> if you have that, you could use seeking as well.  So yes, I agree again.
>
>> Since we now have multiple archivers that require seeking, I suggest
>> we add a SeekableStream class or something along those lines. The
>> Commons Imaging project also has the same problem to solve for images,
>> and it uses ByteSources, which can be arrays, files, or an InputStream
>> wrapper that caches what has been read (so seeking is efficient, while
>> it only reads as much from the InputStream as is necessary).
>
> Interesting idea.
>
> Right now I'm willing to postpone and streaming API for 7z and rather
> cut a release with a files only API.
>
> Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [compress] Do we want 7z Archive*Stream-like classes

Posted by Stefan Bodewig <bo...@apache.org>.

On 2013-10-01, Damjan Jovanovic wrote:

> On Tue, Oct 1, 2013 at 6:09 AM, Stefan Bodewig <bo...@apache.org> wrote:

>> Reading may be simpler, here you can store the meta-information from the
>> start of the file in memory and then read entries as you go, ZipFile
>> inside the zip package does something like this.

> From what I remember:

> The "meta-information" can be anywhere in the file, as can the
> compressed files themselves. The 7zip tool seems to write the
> meta-information at the end of the 7z file when multi-file archives
> are created.

Oh yes, my understanding has been pretty much wrong and re-reading your
implementation has helped me to see clearer.  Right now I think the
important metadata actually is at the end but there is a smaller part at
the front - in particular a pointer to the Header holding the metadata.

> Compressed file codecs, positions, lengths, and solid compression
> details are only stored in the meta-information, so it's not possible
> to write a streaming reader without O(n) memory in the worst case.

I agree.

> Writing also requires seeking or O(n) memory, as the initial header at
> the beginning of the file contains the offset to the next header, and
> we only know the size/contents/location of the next header once all
> the files have been written.

or a temporary file to which the first header could be prepended - but
if you have that, you could use seeking as well.  So yes, I agree again.

> Since we now have multiple archivers that require seeking, I suggest
> we add a SeekableStream class or something along those lines. The
> Commons Imaging project also has the same problem to solve for images,
> and it uses ByteSources, which can be arrays, files, or an InputStream
> wrapper that caches what has been read (so seeking is efficient, while
> it only reads as much from the InputStream as is necessary).

Interesting idea.

Right now I'm willing to postpone and streaming API for 7z and rather
cut a release with a files only API.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [compress] Do we want 7z Archive*Stream-like classes

Posted by Damjan Jovanovic <da...@gmail.com>.

On Tue, Oct 1, 2013 at 6:09 AM, Stefan Bodewig <bo...@apache.org> wrote:
> On 2013-09-30, Benedikt Ritter wrote:
>
>> 2013/9/30 Stefan Bodewig <bo...@apache.org>
>
>>> I'm in no way as familiar with the format as Damian is but IMHO it is
>>> feasible - but likely pretty memory hungry.  Even more so for the
>>> writing side.  Similar to zip some information is stored in a central
>>> place but in this case at the front of the archive.
>
>> just out of curiosity: is this memory problem related to Java or to 7z in
>> general?
>
> What Bernd said.
>
> Reading may be simpler, here you can store the meta-information from the
> start of the file in memory and then read entries as you go, ZipFile
> inside the zip package does something like this.

>From what I remember:

The "meta-information" can be anywhere in the file, as can the
compressed files themselves. The 7zip tool seems to write the
meta-information at the end of the 7z file when multi-file archives
are created. Compressed file codecs, positions, lengths, and solid
compression details are only stored in the meta-information, so it's
not possible to write a streaming reader without O(n) memory in the
worst case.

> When you consider writing you'll have to write metadata about all
> entries before you even start to write the first bytes of the first
> entry.  Either you build up everything in memory or you use a temporary
> output.  This is not without precedent in Compress, pack200 allows users
> to chose between two strategies that provide exactly those two options.

Writing also requires seeking or O(n) memory, as the initial header at
the beginning of the file contains the offset to the next header, and
we only know the size/contents/location of the next header once all
the files have been written.

Since we now have multiple archivers that require seeking, I suggest
we add a SeekableStream class or something along those lines. The
Commons Imaging project also has the same problem to solve for images,
and it uses ByteSources, which can be arrays, files, or an InputStream
wrapper that caches what has been read (so seeking is efficient, while
it only reads as much from the InputStream as is necessary).

> Stefan
>

Damjan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [compress] Do we want 7z Archive*Stream-like classes

Posted by Stefan Bodewig <bo...@apache.org>.

On 2013-09-30, Benedikt Ritter wrote:

> 2013/9/30 Stefan Bodewig <bo...@apache.org>

>> I'm in no way as familiar with the format as Damian is but IMHO it is
>> feasible - but likely pretty memory hungry.  Even more so for the
>> writing side.  Similar to zip some information is stored in a central
>> place but in this case at the front of the archive.

> just out of curiosity: is this memory problem related to Java or to 7z in
> general?

What Bernd said.

Reading may be simpler, here you can store the meta-information from the
start of the file in memory and then read entries as you go, ZipFile
inside the zip package does something like this.

When you consider writing you'll have to write metadata about all
entries before you even start to write the first bytes of the first
entry.  Either you build up everything in memory or you use a temporary
output.  This is not without precedent in Compress, pack200 allows users
to chose between two strategies that provide exactly those two options.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [compress] Do we want 7z Archive*Stream-like classes

Posted by Benedikt Ritter <br...@apache.org>.

2013/9/30 Stefan Bodewig <bo...@apache.org>

> On 2013-09-29, Torsten Curdt wrote:
>
> > Hm - it is indeed a little misleading. So I am +0 for an inclusion.
>
> This is what I feel as well.
>
> > Is a stream based implementation of 7z somewhat feasible - at least in
> > theory?
>
> I'm in no way as familiar with the format as Damian is but IMHO it is
> feasible - but likely pretty memory hungry.  Even more so for the
> writing side.  Similar to zip some information is stored in a central
> place but in this case at the front of the archive.
>

Hi Stefan,

just out of curiosity: is this memory problem related to Java or to 7z in
general?

Benedikt


>
> Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>


-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Re: [compress] Do we want 7z Archive*Stream-like classes

Posted by Stefan Bodewig <bo...@apache.org>.

On 2013-09-29, Torsten Curdt wrote:

> Hm - it is indeed a little misleading. So I am +0 for an inclusion.

This is what I feel as well.

> Is a stream based implementation of 7z somewhat feasible - at least in
> theory?

I'm in no way as familiar with the format as Damian is but IMHO it is
feasible - but likely pretty memory hungry.  Even more so for the
writing side.  Similar to zip some information is stored in a central
place but in this case at the front of the archive.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org