You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Stefan Bodewig <bo...@apache.org> on 2013/01/03 17:22:29 UTC

[compress] not reading archive stream completely

Hi all,

COMPRESS-202 and COMPRESS-206 only talk about TAR but something similar
aplies at least to ZIP as well: once we detect that an archive doesn't
contain any more entries, we stop reading the input stream, even if it
contains more stuff that is part of the archive.  This causes problems
for use cases where the stream holds interesting data behind the
archive.

I'm a bit torn between two approaches - and it is quite possible I'm
overlooking even more alternatives.

(1) As soon as we detect there are no more entries, we immediately read
the remainder and consume all of the stream that made up the archive.
At least for ZIP and TAR this is possible as getNextEntry "knows" when
it has seen the last entry.

(2) Add additional "readRemainderOfArchive() throws IOException" method
in ArchiveInputStream (or just those of the affected formats) that could
be invoked any time and consume as much of the stream as belongs to the
archive.

Alternative (1) is somewhat breaking backwards compatibility - but only
for some contrieved cases AFAICT.  Alternative (2) would be useful in a
case where the user isn't interested in the rest of an archive's content
after finding an entry but wants to consume it completely.

I realize the two alternative could be implemented at the same time
where the most naive implementation of readRemainderOfArchive simply
reads entries until getNextEntry returns null.

Any thoughts?

    Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [compress] not reading archive stream completely

Posted by Stefan Bodewig <bo...@apache.org>.

On 2013-01-17, Bear Giles wrote:

> I think a number of applications use a concatenation of a standard archive
> format and custom data.

Absolutely, and I think we should support that use-case.

I'll try to free up some coding time this weekend.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [compress] not reading archive stream completely

Posted by Bear Giles <bg...@coyotesong.com>.

I think a number of applications use a concatenation of a standard archive
format and custom data. The most well known probably is .rpm which is/was a
cpio stream immediately followed by additional information (iirc - it might
go the other way). In any case a developer might expect to have the input
stream placed at the end of the archive, not the end of the input stream.

On the zip 'central directory' - one of the big 'wins' for zip format is
that it allows you to seek directly to a file instead of having to scan it
sequentially. For various reasons (e.g., the need to support streaming
modes) it has to go at the end of the archive. The unix backup format has
the directory at the top of the archive but it was optimized for backups
that spanned multiple tapes so the cost of precomputing the values was
worth it.

Bear

On Thu, Jan 17, 2013 at 7:42 AM, Torsten Curdt <tc...@vafer.org> wrote:

> > For tar it would be one block (usualy 512 bytes), for zip the full
> > central directory has to be read which could be quite a bit.
> >
>
> Urgh. Because that's at the end for zips? That's not so good then.
>
>
> >
> > I currently plan to implement it inside getNextEntry as it is cleaner.
> > In the tar case I vaguely recall some implementation only write one EOF
> > marker so a more careful aproach is needed in order to not read beyond
> > the end of the archive (likely mark and reset if the stream supports
> > mark).
> >
>
> Hm. That suddenly makes (2) much interesting again.
> I can see the back and forth on this :)
>

Re: [compress] not reading archive stream completely

Posted by Torsten Curdt <tc...@vafer.org>.

> For tar it would be one block (usualy 512 bytes), for zip the full
> central directory has to be read which could be quite a bit.
>

Urgh. Because that's at the end for zips? That's not so good then.


>
> I currently plan to implement it inside getNextEntry as it is cleaner.
> In the tar case I vaguely recall some implementation only write one EOF
> marker so a more careful aproach is needed in order to not read beyond
> the end of the archive (likely mark and reset if the stream supports
> mark).
>

Hm. That suddenly makes (2) much interesting again.
I can see the back and forth on this :)

Re: [compress] not reading archive stream completely

Posted by Stefan Bodewig <bo...@apache.org>.

On 2013-01-17, Torsten Curdt wrote:

> If we see `getNextEntry` return null we should position the stream at
> the end of the archive.  I think that's your (1). Sounds simpler and
> more straight forward from an API POV.  IIUC that reading should only
> be a few bytes. A second EOF marker for TAR for example.

For tar it would be one block (usualy 512 bytes), for zip the full
central directory has to be read which could be quite a bit.

I currently plan to implement it inside getNextEntry as it is cleaner.
In the tar case I vaguely recall some implementation only write one EOF
marker so a more careful aproach is needed in order to not read beyond
the end of the archive (likely mark and reset if the stream supports
mark).

Thanks

        Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [compress] not reading archive stream completely

Posted by Torsten Curdt <tc...@vafer.org>.

Late reply but anyway.

If we see `getNextEntry` return null we should position the stream at the
end of the archive.
I think that's your (1). Sounds simpler and more straight forward from an
API POV.
IIUC that reading should only be a few bytes. A second EOF marker for TAR
for example.

Did I get that right?

My 2 cents
--
Torsten


On Thu, Jan 3, 2013 at 5:22 PM, Stefan Bodewig <bo...@apache.org> wrote:

> Hi all,
>
> COMPRESS-202 and COMPRESS-206 only talk about TAR but something similar
> aplies at least to ZIP as well: once we detect that an archive doesn't
> contain any more entries, we stop reading the input stream, even if it
> contains more stuff that is part of the archive.  This causes problems
> for use cases where the stream holds interesting data behind the
> archive.
>
> I'm a bit torn between two approaches - and it is quite possible I'm
> overlooking even more alternatives.
>
> (1) As soon as we detect there are no more entries, we immediately read
> the remainder and consume all of the stream that made up the archive.
> At least for ZIP and TAR this is possible as getNextEntry "knows" when
> it has seen the last entry.
>
> (2) Add additional "readRemainderOfArchive() throws IOException" method
> in ArchiveInputStream (or just those of the affected formats) that could
> be invoked any time and consume as much of the stream as belongs to the
> archive.
>
> Alternative (1) is somewhat breaking backwards compatibility - but only
> for some contrieved cases AFAICT.  Alternative (2) would be useful in a
> case where the user isn't interested in the rest of an archive's content
> after finding an entry but wants to consume it completely.
>
> I realize the two alternative could be implemented at the same time
> where the most naive implementation of readRemainderOfArchive simply
> reads entries until getNextEntry returns null.
>
> Any thoughts?
>
>     Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>