You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Tijmen R (Jira)" <ji...@apache.org> on 2019/09/01 22:54:00 UTC
[jira] [Commented] (COMPRESS-450) Enable skipping past invalid tar header entries

    [ https://issues.apache.org/jira/browse/COMPRESS-450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920527#comment-16920527 ] 

Tijmen R commented on COMPRESS-450:
-----------------------------------

Hi Stefan,

I finally had some time to look at the code, two remarks that I found:
 * Currently, headerErrorOccurred is initialized to false on construction of the TAIS, and then only read in line 288. I think it should be (re)set to false as well between lines 308-309 (when a new TAE was successfully created), and set to true between lines 309-310 (if the TAE creation failed, before throwing the error).
 * If the getRecord() call in line 286 returns null and headerErrorOccurred is true, this leads to a NullPointerException in the call to arUtils.verifyChecksum(headerBuf), I think?

I'll try to get some corrupted archives and test later this week.

 

> Enable skipping past invalid tar header entries
> -----------------------------------------------
>
>                 Key: COMPRESS-450
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-450
>             Project: Commons Compress
>          Issue Type: Improvement
>          Components: Archivers
>    Affects Versions: 1.16.1
>            Reporter: Tijmen R
>            Priority: Minor
>              Labels: newbie
>         Attachments: TarArchiveInputStream.java
>
>
> In TarArchiveInputStream::getNextTarEntry(), if reading an parsing the header fails, an IOException is thrown. State (e.g. currEntry) is not cleared, and trying to get any further entries/data from the archive is thus not possible.
> In our use case, we sometimes encounter corrupt tar archives where the data following a header (that specifies a non-zero data size) is completely or partly missing; for example as for hdr_b in the stream:
>  
> {noformat}
> ...[hdr_a][data_a1]...[data_an][hdr_b][hdr_c][data_c1][data_c2]...[data_cn]...{noformat}
>  
> We have no influence on how these archives are created, so cannot fix it on that side. However, it would be nice to be able to at least pick up reading the tar file at the next valid header it finds, so at least most of the data can be retrieved. In other words, similar to the behaviour of gnu tar:
>  * If reading/parsing the header fails, and no header was read successfully before, or the previous header read attempt failed as well, then fail completely
>  * Otherwise if reading/parsing the header fails, throw an error. A next call to getNextTarEntry will read blocks until it finds one that has a valid header checksum, and try to parse that as a header.
> The attached version of TarArchiveInputStream does this.
> Some issues with this approach:
>  * In the example stream given above, the hdr_c and subsequent blocks (depending on the data size specified in hdr_b) will already have been returned/read as data for b. However, that is also the case in the current version of TarArchiveInputStream.
>  * So, (at least) file c is lost, and the next entry to be picked up will likely be hdr_d (or even later). Data blocks that look like a tar header at first sight but actually (in the current context) aren't, might be misinterpreted to be headers (this can occur for example with a tar archive stored inside a main tar archive).
>  * Currently, the code just throws an IOException with a different error message, as I didn't want to change the behaviour too much. But it would be a lot better to have a different exception (child of IOException) for a "header parse" error, to distinguish it from a general IO exception reading the underlying stream.
>  * I'm not too sure about what to do in case of a "fatal" error (skip to the end of file?)
> Still, the above has been useful for us, and maybe this benefits others as well.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)