You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "HARI RAM (Jira)" <ji...@apache.org> on 2020/03/11 09:17:00 UTC

[jira] [Commented] (TIKA-2714) Tika Parse Errors for certain attachments

    [ https://issues.apache.org/jira/browse/TIKA-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056777#comment-17056777 ] 

HARI RAM commented on TIKA-2714:
--------------------------------

[~tallison], it turns out that files that are compressed using rar5 format of Winrar, will throw this exception, as junrar does not support rar5 format decompression. And it looks like Beothorn is not planning to support rar5 format either.

Do you have any scope for adding rar5 support in tika by any other means?

> Tika Parse Errors for certain attachments
> -----------------------------------------
>
>                 Key: TIKA-2714
>                 URL: https://issues.apache.org/jira/browse/TIKA-2714
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.9
>            Reporter: Suman Moorthy
>            Priority: Major
>
> Tika fails to parse certain attachments that our customers send to our application.
> We got a sample rar file from our customer that fails parsing, it only has simple pdf files in them  and we were able to re-produce the issue.
> However. If WE create a new rar file out of the same contents (using winrar) and try to parse it, that succeeds. 
> The rar file that our customer used is not encrypted or corrupted. Not sure why their rar file fails parsing, but a new rar file with same contents succeeds.
> Can you please provide a solution or feedback to this problem?
>  
> Below is the exception thrown when we try to parse the rar file attachment from our customer:
>  
> Aug 02, 2018 5:04:09 AM com.github.junrar.Archive setFile
> WARNING: exception in archive constructor maybe file is encrypted or currupt
> com.github.junrar.exception.RarException: badRarArchive
>      at com.github.junrar.Archive.readHeaders(Archive.java:250)
>      at com.github.junrar.Archive.setFile(Archive.java:136)
>      at com.github.junrar.Archive.setVolume(Archive.java:581)
>      at com.github.junrar.Archive.<init>(Archive.java:108)
>      at com.github.junrar.Archive.<init>(Archive.java:113)
>      at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:72)
>      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>      at com.actiance.platform.sfab.cis.etl.documentProcessor.internal.DocumentProcessorImpl.getExtractedContent(DocumentProcessorImpl.java:160)
>      at test.TikaParserAPIExample.main(TikaParserAPIExample.java:31)
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from [org.apache.tika.parser.pkg.RarParser@1372ed45|mailto:org.apache.tika.parser.pkg.RarParser@1372ed45]
>      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
> 05:04:09.488 [main] DEBUG com.actiance.platform.commons.spi.FileReaderUtils - Deleted Temp File - 0a44423c-6fad-47e6-943b-7b56178b0b7f.tmp
>      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>      at com.actiance.platform.sfab.cis.etl.documentProcessor.internal.DocumentProcessorImpl.getExtractedContent(DocumentProcessorImpl.java:160)
>      at test.TikaParserAPIExample.main(TikaParserAPIExample.java:31)
> Caused by: java.lang.NullPointerException: mainheader is null
>      at com.github.junrar.Archive.isEncrypted(Archive.java:206)
>      at org.apache.tika.parser.pkg.RarParser.parse(RarParser.java:74)
>      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
>      ... 4 more
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)