You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2020/06/10 21:29:00 UTC

[jira] [Commented] (TIKA-3110) cannot extract metadata from 7z .tar archive

    [ https://issues.apache.org/jira/browse/TIKA-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132728#comment-17132728 ] 

Tim Allison commented on TIKA-3110:
-----------------------------------

{{noformat}}
Caused by: java.io.IOException: tried to skip 7168 but actually skipped: 0
	at org.apache.tika.io.TikaInputStream.skip(TikaInputStream.java:717)
	at org.apache.commons.io.input.ProxyInputStream.skip(ProxyInputStream.java:117)
	at org.apache.commons.compress.utils.IOUtils.skip(IOUtils.java:113)
	at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.consumeRemainderOfLastBlock(TarArchiveInputStream.java:987)
	at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getRecord(TarArchiveInputStream.java:487)
	at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextTarEntry(TarArchiveInputStream.java:360)
	at org.apache.commons.compress.archivers.tar.TarArchiveInputStream.getNextEntry(TarArchiveInputStream.java:799)

{{noformat}}

This is a regression (or new feature?) going from 1.24 -> 1.24.1.

For the sake of security, I changed TikaInputStream's skip() to require that the given number of bytes actually be skipped.  This prevents infinite loops in parsers that forget to check and/or trust FileInputStream.skip() which no one ever, ever should.

My sense was that there may be some mp4's out there that will cause problems (e.g. they sometimes can end mid frame), and I'm now thinking we hit this earlier with .tar files.

[~bodewig] would you or a colleague on commons-compress know if we should expect this behavior for tar files...where they allege they have more data but actually don't. 

In short, is this something we should throw an exception for or should we happily let the tar file allege it has more bytes than it does?

> cannot extract metadata from 7z .tar archive
> --------------------------------------------
>
>                 Key: TIKA-3110
>                 URL: https://issues.apache.org/jira/browse/TIKA-3110
>             Project: Tika
>          Issue Type: Bug
>          Components: mime, parser
>    Affects Versions: 1.24.1
>            Reporter: Alex
>            Priority: Major
>
> When I extracted metadata from .tar archive wich was created by linux bash it's works as I expect but if .tar archive was created by 7z I got an error:
>  TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.PackageParser@4d0f2471 
> I created a project on GitHub for your convenience. It includes 2 files and code for play around: [https://github.com/AlexOkayJ/apache-tika-tar-issue.git]
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)