You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Robert Burrell Donkin <ro...@gmail.com> on 2009/05/21 19:48:17 UTC

Mime Detection

the documentation could do with an explanation of mime typing best
practice. i'm create a patch once i'm sure i understand it...

please jump in with corrections

- robert

---

A. from the basic user perspective, the quick start way to mime type is to

1. Use MimeTypesFactory#createMimeTypes() to create a MimeTypes with
the default tika configuration
2. if you want just name based heuristics call getMimeType passing a
file, url or name
3. if you want full typing heuristics including magic call getMimeType
passing an input stream

B. from an advanced user perspective, the heuristics can be customised by

1.passing a different configuration file to
MimeTypesFactory#createMimeTypes(XYZ)
2 & 3 as above

C. developers of new detectors should take a look at the detector
interface and then customise as above

if B or C then the tika team would be very interested in contributions

---

Re: Mime Detection

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Sat, May 23, 2009 at 8:41 AM, Robert Burrell Donkin
<ro...@gmail.com> wrote:
> ok - i'll make a start at writing up some documentation

Cool, thanks!

> should i add it to the bottom of
> http://lucene.apache.org/tika/documentation.html or would a separate
> document be better?

Sooner or later we're going to have to split documentation.html, but
for now I guess it's OK to add new stuff there.

BR,

Jukka Zitting

Re: Mime Detection

Posted by Robert Burrell Donkin <ro...@gmail.com>.
On Fri, May 22, 2009 at 10:45 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Thu, May 21, 2009 at 7:48 PM, Robert Burrell Donkin
> <ro...@gmail.com> wrote:
>> A. from the basic user perspective, the quick start way to mime type is to
>>
>> 1. Use MimeTypesFactory#createMimeTypes() to create a MimeTypes with
>> the default tika configuration
>> 2. if you want just name based heuristics call getMimeType passing a
>> file, url or name
>> 3. if you want full typing heuristics including magic call getMimeType
>> passing an input stream
>
> Yeah. That's the original mechanism we've had in place since Tika 0.1.
> It works, but I'm not entirely happy with the current MimeTypes
> mechanism (see TIKA-87 and TIKA-89). Most notably the MimeTypes class
> is hard to configure or extend. I'm hoping to refactor things before
> we reach Tika 1.0.
>
> The current best practice for type detection would be to use the
> Detector interface and the MimeTypes class as a Detector
> implementation. The MimeTypes.detect() method currently contains the
> best detection heuristics we have. That's also what the
> AutoDetectParser is using for automatic type detection.
>
>> B. from an advanced user perspective, the heuristics can be customised by
>>
>> 1.passing a different configuration file to
>> MimeTypesFactory#createMimeTypes(XYZ)
>> 2 & 3 as above
>
> Yep. The type configuration included in Tika is already quite good,
> but there are still lots of details missing. Contributions are
> welcome...
>
> For per-application customizations the current best practice is to
> take a copy of the existing type configuration file from Tika and
> modify it. Note that you'll need to update this copy per each Tika
> upgrade to get the latest improvements. TIKA-87 should solve this
> problem.
>
>> C. developers of new detectors should take a look at the detector
>> interface and then customise as above
>
> We don't yet have a configuration mechanism for Detector
> implementations, but I would still recommend any custom detection
> algorithms to be implemented using the Detector interface. The
> CompositeDetector class makes it easy to combine custom detectors with
> the default functionality in Tika:
>
>    Detector composite = new CompositeDetector(
>        Arrays.asList(new MyCustomDetector(), MimeTypesFactory.create(...)));
>
> The composite detector will use each of the given component detectors
> in sequence and will return the most specific detected media type.

ok - i'll make a start at writing up some documentation

should i add it to the bottom of
http://lucene.apache.org/tika/documentation.html or would a separate
document be better?

- robert

Re: Mime Detection

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, May 21, 2009 at 7:48 PM, Robert Burrell Donkin
<ro...@gmail.com> wrote:
> A. from the basic user perspective, the quick start way to mime type is to
>
> 1. Use MimeTypesFactory#createMimeTypes() to create a MimeTypes with
> the default tika configuration
> 2. if you want just name based heuristics call getMimeType passing a
> file, url or name
> 3. if you want full typing heuristics including magic call getMimeType
> passing an input stream

Yeah. That's the original mechanism we've had in place since Tika 0.1.
It works, but I'm not entirely happy with the current MimeTypes
mechanism (see TIKA-87 and TIKA-89). Most notably the MimeTypes class
is hard to configure or extend. I'm hoping to refactor things before
we reach Tika 1.0.

The current best practice for type detection would be to use the
Detector interface and the MimeTypes class as a Detector
implementation. The MimeTypes.detect() method currently contains the
best detection heuristics we have. That's also what the
AutoDetectParser is using for automatic type detection.

> B. from an advanced user perspective, the heuristics can be customised by
>
> 1.passing a different configuration file to
> MimeTypesFactory#createMimeTypes(XYZ)
> 2 & 3 as above

Yep. The type configuration included in Tika is already quite good,
but there are still lots of details missing. Contributions are
welcome...

For per-application customizations the current best practice is to
take a copy of the existing type configuration file from Tika and
modify it. Note that you'll need to update this copy per each Tika
upgrade to get the latest improvements. TIKA-87 should solve this
problem.

> C. developers of new detectors should take a look at the detector
> interface and then customise as above

We don't yet have a configuration mechanism for Detector
implementations, but I would still recommend any custom detection
algorithms to be implemented using the Detector interface. The
CompositeDetector class makes it easy to combine custom detectors with
the default functionality in Tika:

    Detector composite = new CompositeDetector(
        Arrays.asList(new MyCustomDetector(), MimeTypesFactory.create(...)));

The composite detector will use each of the given component detectors
in sequence and will return the most specific detected media type.

BR,

Jukka Zitting