You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Antoni Mylka <an...@aduna-software.com> on 2010/08/17 17:57:41 UTC

Working with multiple mime type definition files

Hi,

The tika mime type detection code has improved greatly since I last 
looked it a while ago. The root-XML-based detection and 
ContainerAwareDetector are things we (Aperture) have wanted to do 
ourselves since at least 2007 but never got round to it :)

Unfortunately there are many subtle differences between the mime 
definition files which would break existing Aperture applications. 
Therefore I'd like to implement a temporary solution that would work in 
the interim and allow for gradual migration.

first create a normal MimeTypes
mimeTypes = MimeTypesFactory.create("tika-mimetypes.xml");

then delete some definitions with
mimeTypes.deleteMimeType("application/vnd.ms-outlook")
// in tika this is an msg file
// in aperture this is a pst file - clearly wrong, but...

and then read our definitions file
new MimeTypesReader(mimeTypes).read(inputStreamFromOurFile);

Questions:
0. Does this make sense? Am I missing something?
1. there is no deleteMimeType method. Is it possible to delete a mime 
type definition from a MimeTypes instance? I just wanted to ask before 
trying to implement it myself.
2. the MimeTypesReader class is not public. Is there any particular 
reason for that? The code seems to augment, not replace the definitions 
so it seems suitable for our use case, but the reader is not public.
3. It seems that there is a rule that all minor types either begin with 
x- or are IANA-approved. Please confirm.
4. It also seems that your mime definition file is not related to the 
one at freedesktop.org, I mean, there are no policies like "First submit 
to freedesktop, wait until they approve and commit and then update the 
tika definitions". Please confirm.

Antoni Myłka
antoni.mylka@aduna-software.com




Re: Working with multiple mime type definition files

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Antoni,

> 
> The tika mime type detection code has improved greatly since I last
> looked it a while ago. The root-XML-based detection and
> ContainerAwareDetector are things we (Aperture) have wanted to do
> ourselves since at least 2007 but never got round to it :)

Thanks! 

> 
> Unfortunately there are many subtle differences between the mime
> definition files which would break existing Aperture applications.
> Therefore I'd like to implement a temporary solution that would work in
> the interim and allow for gradual migration.
> 
> first create a normal MimeTypes
> mimeTypes = MimeTypesFactory.create("tika-mimetypes.xml");
> 
> then delete some definitions with
> mimeTypes.deleteMimeType("application/vnd.ms-outlook")
> // in tika this is an msg file
> // in aperture this is a pst file - clearly wrong, but...
> 
> and then read our definitions file
> new MimeTypesReader(mimeTypes).read(inputStreamFromOurFile);
> 
> Questions:
> 0. Does this make sense? Am I missing something?

It makes sense if you want to programmatically manage the media types rather
than curate them in XML outside of the Tika application. Another option
would be to provide easy means for refreshing the Detector interface for a
Parser, or just in general (it's possible to do this now but involves lower
level APIs that should probably be better insulated).

> 1. there is no deleteMimeType method. Is it possible to delete a mime
> type definition from a MimeTypes instance? I just wanted to ask before
> trying to implement it myself.

Yeah there isn't a deleteMimeType, or editMimeType. We never really provided
CRUD type operations, just what was needed from a reader perspective. Maybe
it makes sense to implement this now, but it would be great to not clutter
the existing reader-focused APIs with these methods and instead to create
like a MimeTypesWriter interface, or MimeTypesEditor interface and put those
methods there.

> 2. the MimeTypesReader class is not public. Is there any particular
> reason for that? The code seems to augment, not replace the definitions
> so it seems suitable for our use case, but the reader is not public.

Yeah, same rationale as for #1 on this.

> 3. It seems that there is a rule that all minor types either begin with
> x- or are IANA-approved. Please confirm.

That's the way we've curated so far. But that's Tika's approach doesn't mean
that it's the only (or most correct) one.

> 4. It also seems that your mime definition file is not related to the
> one at freedesktop.org, I mean, there are no policies like "First submit
> to freedesktop, wait until they approve and commit and then update the
> tika definitions". Please confirm.

It's related in that they are formatted similarly. However, the process for
curating media types is insulated from outside entities, which I actually
see as a very good thing. That way Tika can serve to bring together existing
curation efforts, but not let those bog down the ability to move forward
with code, and with writing applications that take advantage of these
features.

HTH,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++