You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Stefan Bodewig <bo...@apache.org> on 2021/06/04 19:43:56 UTC

[compress] 7z and Recovering Corrupt Archives

Hi all

7z archives provide CRCs for the metadata section so you can quickly
identify a wide range of broken archives - which is far better than what
you get for ZIP for example.

It is possible to recover from a certain type of broken archive. A case
where the archive has been written almost completely and just the CRC
and the locator of metadata are missing. The docs talk about
disks/drives being removed prematurely.

The basic idea is to search backwards from the end of the file for the
metadata and try to parse it. This is what SevenZFile does and has
always done. This is the root cause of
https://issues.apache.org/jira/browse/COMPRESS-542 - the file ends with
something that looks like metadata of an archive with lots and lots of
files in it and the allocation of arrays leads to a OOM.

Current master will detect corrupt archives more quickly - in particular
without excessive allocations - but still it may take quite some time to
reject thousands of candidates of "this could be the first byte of
proper meta data". We are scanning the last megabyte of the file and
there is ample chance this last megabyte may contain random noise that
looks promising.

Personally I believe that almost nobody actually needs this mode of
recovery.

Therefore I've thought we might want to introduce an option that enables
the recovery mode. If it was disabled and we found the CRC was missing
we'd throw a new specific exception that says "you may want to try with
recovery enabled instead".

Making this new option default to disabling recovery would break
backwards compatibility but it is tempting to think this could be
fine. I'm a bit torn here. What do you think?


Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] 7z and Recovering Corrupt Archives

Posted by Peter Lee <pe...@apache.org>.
+1 for the new option.

A fast fail for corrupted archive could help a lot.
Lee
On 6 5 2021, at 4:32, Gary Gregory <ga...@gmail.com> wrote:
> In general, I think fail fast is ok with a clear exception message.
>
> Gary
> On Fri, Jun 4, 2021, 15:44 Stefan Bodewig <bo...@apache.org> wrote:
> > Hi all
> >
> > 7z archives provide CRCs for the metadata section so you can quickly
> > identify a wide range of broken archives - which is far better than what
> > you get for ZIP for example.
> >
> > It is possible to recover from a certain type of broken archive. A case
> > where the archive has been written almost completely and just the CRC
> > and the locator of metadata are missing. The docs talk about
> > disks/drives being removed prematurely.
> >
> > The basic idea is to search backwards from the end of the file for the
> > metadata and try to parse it. This is what SevenZFile does and has
> > always done. This is the root cause of
> > https://issues.apache.org/jira/browse/COMPRESS-542 - the file ends with
> > something that looks like metadata of an archive with lots and lots of
> > files in it and the allocation of arrays leads to a OOM.
> >
> > Current master will detect corrupt archives more quickly - in particular
> > without excessive allocations - but still it may take quite some time to
> > reject thousands of candidates of "this could be the first byte of
> > proper meta data". We are scanning the last megabyte of the file and
> > there is ample chance this last megabyte may contain random noise that
> > looks promising.
> >
> > Personally I believe that almost nobody actually needs this mode of
> > recovery.
> >
> > Therefore I've thought we might want to introduce an option that enables
> > the recovery mode. If it was disabled and we found the CRC was missing
> > we'd throw a new specific exception that says "you may want to try with
> > recovery enabled instead".
> >
> > Making this new option default to disabling recovery would break
> > backwards compatibility but it is tempting to think this could be
> > fine. I'm a bit torn here. What do you think?
> >
> >
> > Stefan
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > For additional commands, e-mail: dev-help@commons.apache.org
> >
> >
>


Re: [compress] 7z and Recovering Corrupt Archives

Posted by Bernd Eckenfels <ec...@zusammenkunft.net>.
Hello,

I would agree, fail fast and be strict as a default. Should also help us with Iess (fuzzing-discovered)  DOS security reports and would prevent file type confusion which is a very real attack especially for Archives.

Bernd
--
http://bernd.eckenfels.net
________________________________
Von: Gary Gregory <ga...@gmail.com>
Gesendet: Friday, June 4, 2021 10:32:22 PM
An: Commons Developers List <de...@commons.apache.org>
Betreff: Re: [compress] 7z and Recovering Corrupt Archives

In general, I think fail fast is ok with a clear exception message.

Gary

On Fri, Jun 4, 2021, 15:44 Stefan Bodewig <bo...@apache.org> wrote:

> Hi all
>
> 7z archives provide CRCs for the metadata section so you can quickly
> identify a wide range of broken archives - which is far better than what
> you get for ZIP for example.
>
> It is possible to recover from a certain type of broken archive. A case
> where the archive has been written almost completely and just the CRC
> and the locator of metadata are missing. The docs talk about
> disks/drives being removed prematurely.
>
> The basic idea is to search backwards from the end of the file for the
> metadata and try to parse it. This is what SevenZFile does and has
> always done. This is the root cause of
> https://issues.apache.org/jira/browse/COMPRESS-542 - the file ends with
> something that looks like metadata of an archive with lots and lots of
> files in it and the allocation of arrays leads to a OOM.
>
> Current master will detect corrupt archives more quickly - in particular
> without excessive allocations - but still it may take quite some time to
> reject thousands of candidates of "this could be the first byte of
> proper meta data". We are scanning the last megabyte of the file and
> there is ample chance this last megabyte may contain random noise that
> looks promising.
>
> Personally I believe that almost nobody actually needs this mode of
> recovery.
>
> Therefore I've thought we might want to introduce an option that enables
> the recovery mode. If it was disabled and we found the CRC was missing
> we'd throw a new specific exception that says "you may want to try with
> recovery enabled instead".
>
> Making this new option default to disabling recovery would break
> backwards compatibility but it is tempting to think this could be
> fine. I'm a bit torn here. What do you think?
>
>
> Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

Re: [compress] 7z and Recovering Corrupt Archives

Posted by Gary Gregory <ga...@gmail.com>.
In general, I think fail fast is ok with a clear exception message.

Gary

On Fri, Jun 4, 2021, 15:44 Stefan Bodewig <bo...@apache.org> wrote:

> Hi all
>
> 7z archives provide CRCs for the metadata section so you can quickly
> identify a wide range of broken archives - which is far better than what
> you get for ZIP for example.
>
> It is possible to recover from a certain type of broken archive. A case
> where the archive has been written almost completely and just the CRC
> and the locator of metadata are missing. The docs talk about
> disks/drives being removed prematurely.
>
> The basic idea is to search backwards from the end of the file for the
> metadata and try to parse it. This is what SevenZFile does and has
> always done. This is the root cause of
> https://issues.apache.org/jira/browse/COMPRESS-542 - the file ends with
> something that looks like metadata of an archive with lots and lots of
> files in it and the allocation of arrays leads to a OOM.
>
> Current master will detect corrupt archives more quickly - in particular
> without excessive allocations - but still it may take quite some time to
> reject thousands of candidates of "this could be the first byte of
> proper meta data". We are scanning the last megabyte of the file and
> there is ample chance this last megabyte may contain random noise that
> looks promising.
>
> Personally I believe that almost nobody actually needs this mode of
> recovery.
>
> Therefore I've thought we might want to introduce an option that enables
> the recovery mode. If it was disabled and we found the CRC was missing
> we'd throw a new specific exception that says "you may want to try with
> recovery enabled instead".
>
> Making this new option default to disabling recovery would break
> backwards compatibility but it is tempting to think this could be
> fine. I'm a bit torn here. What do you think?
>
>
> Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>