You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2017/08/17 15:59:00 UTC

[jira] [Commented] (TIKA-2443) Plain text file identified as rfc822 and which can cause StackOverflowError

    [ https://issues.apache.org/jira/browse/TIKA-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130726#comment-16130726 ] 

Nick Burch commented on TIKA-2443:
----------------------------------

It doesn't matter what priority we put on the Date magic, it will still win out over text/plain because text/plain has no magic at all

With the structure of the match magic, we can't easily say "find one of these headers at the start, then another one of these after a newline in the next xxx bytes", without the magic having so much duplication as to be unmaintainable

For now, I'd suggest you define a custom mime type for you log file, and create a custom mimetype file with an entry that matches on the date and on the log level. Pop that anywhere you want on your classpath (as long as the file naming is unchanged)

> Plain text file identified as rfc822 and which can cause StackOverflowError
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-2443
>                 URL: https://issues.apache.org/jira/browse/TIKA-2443
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.11, 1.16
>            Reporter: Viorica Visan
>
> I have a file called test.txt, containing only:
> Date:		06/25/2014 15:54:19
> And some more text I am writing. This will
> be detected as rfc822
> This file is detected and parsed as message/rfc822. 
> I think the magic rule on "Date: " is too strong and it should have detected only as plain/text file. It looks to me like the reverse of  https://issues.apache.org/jira/browse/TIKA-879 
> We noticed this issue, because we have a large log file, which has many lines with Date, Log level and Message which is parsed as message/rfc822 (only because it starts with "Date:") and which throws 
> StackOverflowError in the end. 
> Is there some workaround to make this rule weaker ? through configuration ? 
> We use DefaultParser and everything default. We use tika in 1.11 version, but we tried also  with tika 1.16 and we saw the same StackOverflowError (which probably again happened because it was parsed as a rc822 type).
> The only workaround that I found was to add 
> custom-mimetypes.xml like this
>  <mime-type type="text/plain">
>     <magic priority="70">
>       <match value="Date:" type="string" offset="0"/>
>     </magic>
>   </mime-type>
> Would you recomend some other workaround to make sure the file does not get parsed as rfc822 ? 
> And I have another question: can this custom-mimetypes.xml be specified from an external location? 
> Many thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)