You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "martin k. (Jira)" <ji...@apache.org> on 2023/11/22 20:05:00 UTC

[jira] [Created] (TIKA-4172) Apple binary file incorrectly identified as text/x-sql due to filename

martin k. created TIKA-4172:
-------------------------------

             Summary: Apple binary file incorrectly identified as text/x-sql due to filename
                 Key: TIKA-4172
                 URL: https://issues.apache.org/jira/browse/TIKA-4172
             Project: Tika
          Issue Type: Bug
          Components: general
    Affects Versions: 2.9.1
            Reporter: martin k.


This is related to [https://github.com/eikek/docspell/issues/2376] and [https://github.com/eikek/docspell/issues/2403.]

Take the following Base64 encoding of a binary Apple-generated file. No idea what it does. You can get the file by piping the following to e.g. {{base64 -d > something.sql}}
{code:java}
ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAbUJJTgAA AAAAAAAAAAAAAAAAAACCgf+/AAA=
{code}
If this file is name {{{}something.sql{}}}, then Tika will classify it as {{{}text/x-sql{}}}, which it is not. It seems like more weight is given to the filename (extension) than the fact that the file is binary anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)