You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Luís Filipe Nassif (Jira)" <ji...@apache.org> on 2019/11/19 04:15:00 UTC

[jira] [Comment Edited] (TIKA-2986) Edge case (?) in file type detection

    [ https://issues.apache.org/jira/browse/TIKA-2986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977050#comment-16977050 ] 

Luís Filipe Nassif edited comment on TIKA-2986 at 11/19/19 4:14 AM:
--------------------------------------------------------------------

Hi [~tallison],

It is not an edge case, I have found this situation several times processing our forensic corpus, formed by deleted and recovered data that is often corrupted or totally overwritten. I think like you, that it makes sense to just use extension detection for mimetypes that do not have registered magics, or to refine parent mimetypes previously detected based on magics.

But it is a dramatic change in behaviour and has to be carefully tested...

[~nick], probably the .fdf found by Tim is nothing, just garbage data, and will not have any magic, like the deleted and overwritten data my organization often needs to work with.


was (Author: lfcnassif):
Hi [~tallison],

It is not an edge case, I have found this situation several times processing our forensic corpus, formed by deleted and recovered data that is often corrupted or totally overwritten. I think exactly like you, that it makes sense to just use extension detection for mimetypes that do not have registered magics, or to refine parent mimetypes previously detected based on magics.

But it is a dramatic change in behaviour and has to be carefully tested...

[~nick], probably the .fdf found by Tim is nothing, just garbage data, and will not have any magic, like the deleted and overwritten data my organization often needs to work with.

> Edge case (?) in file type detection
> ------------------------------------
>
>                 Key: TIKA-2986
>                 URL: https://issues.apache.org/jira/browse/TIKA-2986
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Trivial
>
> I recently came across a file that was identified as an Acrobat fdf file.  The particular file was some kind of binary file with a ".fdf" extension, but not an Acrobat fdf.  
> Our current MimeTypes algorithm runs magic first, and then it tries to use the file extension.  If the file extension suggests a child mime type of what was found via magic, that is used.  The problem with this file was that the magic {{%FDF-}} was not found, so from the magic step, it was {{application/octet}}, and then the file extension, which was ".fdf", was selected because {{application/vnd.fdf}} is a child of {{application/octet}}.
> If feels like we might want to add a rule that if a mime definition has a defined magic and that magic is not found, we should not then fall back to the file extension. Or, is there a better way to prevent this from happening? Or, is this just an edge case that we should ignore?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)