You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Stefan Bodewig (JIRA)" <ji...@apache.org> on 2018/04/26 16:06:00 UTC
[jira] [Commented] (COMPRESS-450) Enable skipping past invalid tar header entries

    [ https://issues.apache.org/jira/browse/COMPRESS-450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16454458#comment-16454458 ] 

Stefan Bodewig commented on COMPRESS-450:
-----------------------------------------

Many thanks.

IIUC {{getNextTarEntry}} still throws an exception if it encounters an invalid header but you've changed it so that {{getNextTarEntry}} can be invoked a second time and this time it tries to scan forward and if this fails again the stream fives up.

One alternative I see is moving the state machine out of {{TarArchiveInputStream}} into the calling code and provide the "scan forward" logic as a new method. In your case you'd call {{getNextTarEntry}}, catch the exception, call the new method which positions the stream correctly for the next call of  {{getNextTarEntry}} (if possible, and I'm glossing over implementation details here) and could then invoke {{getNextTarEntry}}, a second time. Do you see any drawbacks in this approach? This would allow scanning forward more than once, if anybody needed it and I think it might be easier to implement inside of the stream as well - I may certainly be wrong.

> Enable skipping past invalid tar header entries
> -----------------------------------------------
>
>                 Key: COMPRESS-450
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-450
>             Project: Commons Compress
>          Issue Type: Improvement
>          Components: Archivers
>    Affects Versions: 1.16.1
>            Reporter: Tijmen Ramakers
>            Priority: Minor
>              Labels: newbie
>         Attachments: TarArchiveInputStream.java
>
>
> In TarArchiveInputStream::getNextTarEntry(), if reading an parsing the header fails, an IOException is thrown. State (e.g. currEntry) is not cleared, and trying to get any further entries/data from the archive is thus not possible.
> In our use case, we sometimes encounter corrupt tar archives where the data following a header (that specifies a non-zero data size) is completely or partly missing; for example as for hdr_b in the stream:
>  
> {noformat}
> ...[hdr_a][data_a1]...[data_an][hdr_b][hdr_c][data_c1][data_c2]...[data_cn]...{noformat}
>  
> We have no influence on how these archives are created, so cannot fix it on that side. However, it would be nice to be able to at least pick up reading the tar file at the next valid header it finds, so at least most of the data can be retrieved. In other words, similar to the behaviour of gnu tar:
>  * If reading/parsing the header fails, and no header was read successfully before, or the previous header read attempt failed as well, then fail completely
>  * Otherwise if reading/parsing the header fails, throw an error. A next call to getNextTarEntry will read blocks until it finds one that has a valid header checksum, and try to parse that as a header.
> The attached version of TarArchiveInputStream does this.
> Some issues with this approach:
>  * In the example stream given above, the hdr_c and subsequent blocks (depending on the data size specified in hdr_b) will already have been returned/read as data for b. However, that is also the case in the current version of TarArchiveInputStream.
>  * So, (at least) file c is lost, and the next entry to be picked up will likely be hdr_d (or even later). Data blocks that look like a tar header at first sight but actually (in the current context) aren't, might be misinterpreted to be headers (this can occur for example with a tar archive stored inside a main tar archive).
>  * Currently, the code just throws an IOException with a different error message, as I didn't want to change the behaviour too much. But it would be a lot better to have a different exception (child of IOException) for a "header parse" error, to distinguish it from a general IO exception reading the underlying stream.
>  * I'm not too sure about what to do in case of a "fatal" error (skip to the end of file?)
> Still, the above has been useful for us, and maybe this benefits others as well.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)