You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Josh McCullough (Jira)" <ji...@apache.org> on 2023/10/23 20:41:00 UTC

[jira] (TIKA-3992) Add common missing mimes based on Common Crawl data

    [ https://issues.apache.org/jira/browse/TIKA-3992 ]


    Josh McCullough deleted comment on TIKA-3992:
    ---------------------------------------

was (Author: joshm):
`las` file-type detection is returning `application/octet-stream` while `laz` is working correctly. Using Tika `2.9.1` with `tika-parsers-standard-package`.

> Add common missing mimes based on Common Crawl data
> ---------------------------------------------------
>
>                 Key: TIKA-3992
>                 URL: https://issues.apache.org/jira/browse/TIKA-3992
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: mimes.zip
>
>
> In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as detected by Tika.  It would be useful to extract those (even if truncated) and run 'file' and 'siegfried' against those file types that are unknown to Tika.  We can prioritize the most common file formats as identified by file and siegfried for addition to our mime-types.xml.
> Separately, we might also want to do the same thing for `application/zip`...there are likely zip-based file types that we could do a better job on.
> Thanks to [~snagel] for a dump of stats on the most recent crawl.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)