You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ross Johnson (Jira)" <ji...@apache.org> on 2022/04/26 00:59:00 UTC

[jira] [Commented] (TIKA-3732) Word doc MediaType detected as RTF

    [ https://issues.apache.org/jira/browse/TIKA-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527822#comment-17527822 ] 

Ross Johnson commented on TIKA-3732:
------------------------------------

I took a quick look at the attached file in a hex editor and can confirm that it is indeed an RTF file despite the file extension being .DOC. It appears that Tika is detecting the type correctly.

> Word doc MediaType detected as RTF
> ----------------------------------
>
>                 Key: TIKA-3732
>                 URL: https://issues.apache.org/jira/browse/TIKA-3732
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 2.2.1
>            Reporter: Caleb Postlethwait
>            Priority: Major
>         Attachments: example.DOC
>
>
> When executing Detector.detect(InputStream input, Metadata metadata) on a particular Word document, we're getting back a MediaType of RTF which has some downstream effects for us.
> Here's the relevant bit of code:
> TikaConfig config = TikaConfigFactory.getTikaConfig();
> Detector detector = config.getDetector();
> Metadata metadata = new Metadata();
> stream = TikaInputStream.get(fis = new FileInputStream(paths));
> metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, paths);
> *MediaType mediaType = detector.detect(stream, metadata);*
> Attaching the file that we came across this issue on.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)