You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Eduardas Kazakas (Jira)" <ji...@apache.org> on 2022/08/08 17:58:00 UTC

[jira] [Created] (TIKA-3833) bzip2 MIME type is detected as bzip instead when using tika-core

Eduardas Kazakas created TIKA-3833:
--------------------------------------

             Summary: bzip2 MIME type is detected as bzip instead when using tika-core
                 Key: TIKA-3833
                 URL: https://issues.apache.org/jira/browse/TIKA-3833
             Project: Tika
          Issue Type: Bug
          Components: core
    Affects Versions: 2.4.1
            Reporter: Eduardas Kazakas


Hello, I'm having a bit of a problem when using the tika-core module (v2.4.1).
I am trying to detect the MIME type of a bzip2 file and, instead of
application/x-bzip2, I am getting application/x-bzip. I believe it has
something to do with the mime-type definitions in the
tika-mimetypes.xml file.

<mime-type type="application/x-bzip">
  <magic priority="40">
    <match value="BZh" type="string" offset="0"/>
  </magic>
  <glob pattern="*.bz"/>
  <glob pattern="*.tbz"/>
</mime-type>

<mime-type type="application/x-bzip2">
  <sub-class-of type="application/x-bzip"/>
  <_comment>Bzip 2 UNIX Compressed File</_comment>
  <magic priority="40">
    <match value="\x42\x5a\x68\x39\x31" type="string" offset="0"/>
  </magic>
  <glob pattern="*.bz2"/>
  <glob pattern="*.tbz2"/>
  <glob pattern="*.boz"/>
</mime-type>

The priority for these is set to 40, I believe that the priority of
application/x-bzip2 should be higher, because string value "BZh" and
hex value part "\x42\x5a\x68" are equal. x42\x5a\x68 = BZh.

Maybe I am missing something here? Does this look like a bug or this
works as intended? Maybe I can provide some sort of hint for the
default detector?

A small example in Scala:
{code:java}
import org.apache.tika.config.TikaConfig
import org.apache.tika.detect.DefaultProbDetector
import org.apache.tika.metadata.{Metadata, TikaCoreProperties}

import java.io.{BufferedInputStream, File, FileInputStream}

object AAA {
  def main(args: Array[String]): Unit = {
    val config = TikaConfig.getDefaultConfig

    val file = new File("/home/ekazakas/test.csv.bz2")
    val detector = new DefaultProbDetector()
    val mediaType = detector.detect(new BufferedInputStream(new FileInputStream(file)), new Metadata)
    val mimeType = config.getMimeRepository.forName(mediaType.toString)
    println(mimeType)
  }
} {code}
This prints `application/x-bzip` instead of `application/x-bzip2`.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)