You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (Jira)" <ji...@apache.org> on 2020/06/04 05:04:00 UTC

[jira] [Commented] (TIKA-3106) Tika Fails to detect some EML files if extension is not .eml

    [ https://issues.apache.org/jira/browse/TIKA-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125545#comment-17125545 ] 

Nick Burch commented on TIKA-3106:
----------------------------------

This email starts with a series of long {{ARC-}} headers, which means that the "normal" email headers don't occur until a lot longer in the file than typical

I've added a match for this in {{1e02f0181}}, which allows an ARC signature header firt to be matched like we already did for DKIM header first. With that commit in place, your file is then detected with contents only

Can you please give a nightly build / override tika mime types file a try with your files, and see if any other email first headers are still being missed for detection?

> Tika Fails to detect some EML files if extension is not .eml
> ------------------------------------------------------------
>
>                 Key: TIKA-3106
>                 URL: https://issues.apache.org/jira/browse/TIKA-3106
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, mime
>    Affects Versions: 1.24
>            Reporter: Xiaohong Yang
>            Priority: Critical
>         Attachments: EmlFile.txt
>
>
> I have an eml file that can be detected as message/rfc822 only if the file extension is .eml,  otherwise it will be detected as text/plain.  Following is the code that I use to detect the file type and extension.
>        TikaConfig config = TikaConfigFactory.getTikaConfig();
>        Detector detector = config.getDetector();
>        Metadata metadata = new Metadata();
>        TikaInputStream stream = TikaInputStream.get(fis = new FileInputStream(filePath));
>        metadata.add(Metadata.RESOURCE_NAME_KEY, filePath);
>        MediaType mediaType = detector.detect(stream, metadata);
>        MimeType mimeType = config.getMimeRepository().forName(mediaType.toString());
>        String tikaExtension = mimeType.getExtension();
>  
> When the sample file has .eml extension,  mimeType is message/rfc822 and  tikaExtension is eml. When I change the extension to .txt, mimeType is text/plain and  tikaExtension is .txt.
>  
> The same mimeType and tikaExtension should be detected regardless the file extension. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)