You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (Jira)" <ji...@apache.org> on 2020/06/03 13:05:00 UTC

[jira] [Commented] (TIKA-3105) OFT format detection based on file name (extension) instead of file content

    [ https://issues.apache.org/jira/browse/TIKA-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124934#comment-17124934 ] 

Nick Burch commented on TIKA-3105:
----------------------------------

At a quick glance, that first 4 bytes isn't unique-enough. There look to be a few common variations, all of which could be mistaken for text or another type. However, some sort of mask on the next few bytes might be enough. Especially if we can find some patterns on the entries in the table record after

Maybe something based on finding one of the listed 4 bytes at the start (0x00010000 or 0x4F54544F=OTTO or true or typ1), then a couple of the required table names plus plausible looking offsets in the next few hundred bytes?



> OFT format detection based on file name (extension) instead of file content
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-3105
>                 URL: https://issues.apache.org/jira/browse/TIKA-3105
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.24
>            Reporter: Ondřej Duchoň
>            Priority: Major
>
> There is a problem with detecting OTF file format. Unlike TTF format, which is detected from the inside of a file (file name can be without extension .ttf) format OFT can not. As soon as the extension .otf is removed from a file name, detection will fail and content type will be evaluated as "application/octet-stream"
>  
> I found here simple way how to detect OTF files:
> [https://docs.microsoft.com/en-us/typography/opentype/spec/otff#organization-of-an-opentype-font]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)