You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Ken Krugler <kk...@transpac.com> on 2009/09/28 23:40:02 UTC

Super-types for text mime types

Hi all,

I was trying out my mbox parser via the TikaCLI command line tool.

My parser wasn't getting called - rather, the generic text parser was  
used.

The problem is that in tika-mimetypes.xml, the application/mbox entry  
didn't specify that it was a subtype of text/plain.

So even though the name detector code correctly generated application/ 
mbox as the type hint due to the .mbox suffix, this was ignored  
because it wasn't a subtype of the content-based type that was derived  
previously as text/plain.

Easy enough to fix, but in looking through the tika-mimetypes.xml file  
I wonder how many other types need similar treatment. For example:

   <mime-type type="application/xspf+xml">
     <glob pattern="*.xspf"/>
   </mime-type>

If I use the TikaCLI with a test.xspf file, the mime-type it derives  
is application/xml, not application/xspf+xml as expected.

One partial fix here would be to extend the MimeTypes.forName
method to check for "+xml" at the end, similar to how it checks for  
"text/" at the beginning, and auto-set the parent to application/xml.

-- Ken

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


Re: Super-types for text mime types

Posted by Ken Krugler <kk...@transpac.com>.
I created TIKA-308 to capture this issue.

Leaving on vacation soon, but should be able to submit a patch in  
early November.

-- Ken

On Sep 30, 2009, at 2:38am, Jukka Zitting wrote:

> Hi,
>
> On Mon, Sep 28, 2009 at 11:40 PM, Ken Krugler
> <kk...@transpac.com> wrote:
>> Easy enough to fix, but in looking through the tika-mimetypes.xml  
>> file I
>> wonder how many other types need similar treatment.
>
> As mentioned in my previous email, the type registry could do with
> some implicit supertype settings like the following (in order):
>
> * a type with an explicit <sub-class-of/> setting is a specialization
> of the specified type
> * a type with parameters is a specialization of the same type  
> without parameters
> * all */*+xml types are specializations of application/xml
> * all text/* types are specializations of text/plain
> * everything is a specialization of application/octet-stream
>
> BR,
>
> Jukka Zitting

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


Re: Super-types for text mime types

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Sep 28, 2009 at 11:40 PM, Ken Krugler
<kk...@transpac.com> wrote:
> Easy enough to fix, but in looking through the tika-mimetypes.xml file I
> wonder how many other types need similar treatment.

As mentioned in my previous email, the type registry could do with
some implicit supertype settings like the following (in order):

* a type with an explicit <sub-class-of/> setting is a specialization
of the specified type
* a type with parameters is a specialization of the same type without parameters
* all */*+xml types are specializations of application/xml
* all text/* types are specializations of text/plain
* everything is a specialization of application/octet-stream

BR,

Jukka Zitting