You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Ken Krugler <kk...@transpac.com> on 2009/09/28 23:40:02 UTC
Super-types for text mime types
Hi all,
I was trying out my mbox parser via the TikaCLI command line tool.
My parser wasn't getting called - rather, the generic text parser was
used.
The problem is that in tika-mimetypes.xml, the application/mbox entry
didn't specify that it was a subtype of text/plain.
So even though the name detector code correctly generated application/
mbox as the type hint due to the .mbox suffix, this was ignored
because it wasn't a subtype of the content-based type that was derived
previously as text/plain.
Easy enough to fix, but in looking through the tika-mimetypes.xml file
I wonder how many other types need similar treatment. For example:
<mime-type type="application/xspf+xml">
<glob pattern="*.xspf"/>
</mime-type>
If I use the TikaCLI with a test.xspf file, the mime-type it derives
is application/xml, not application/xspf+xml as expected.
One partial fix here would be to extend the MimeTypes.forName
method to check for "+xml" at the end, similar to how it checks for
"text/" at the beginning, and auto-set the parent to application/xml.
-- Ken
--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378
Re: Super-types for text mime types
Posted by Ken Krugler <kk...@transpac.com>.
I created TIKA-308 to capture this issue.
Leaving on vacation soon, but should be able to submit a patch in
early November.
-- Ken
On Sep 30, 2009, at 2:38am, Jukka Zitting wrote:
> Hi,
>
> On Mon, Sep 28, 2009 at 11:40 PM, Ken Krugler
> <kk...@transpac.com> wrote:
>> Easy enough to fix, but in looking through the tika-mimetypes.xml
>> file I
>> wonder how many other types need similar treatment.
>
> As mentioned in my previous email, the type registry could do with
> some implicit supertype settings like the following (in order):
>
> * a type with an explicit <sub-class-of/> setting is a specialization
> of the specified type
> * a type with parameters is a specialization of the same type
> without parameters
> * all */*+xml types are specializations of application/xml
> * all text/* types are specializations of text/plain
> * everything is a specialization of application/octet-stream
>
> BR,
>
> Jukka Zitting
--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378
Re: Super-types for text mime types
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Mon, Sep 28, 2009 at 11:40 PM, Ken Krugler
<kk...@transpac.com> wrote:
> Easy enough to fix, but in looking through the tika-mimetypes.xml file I
> wonder how many other types need similar treatment.
As mentioned in my previous email, the type registry could do with
some implicit supertype settings like the following (in order):
* a type with an explicit <sub-class-of/> setting is a specialization
of the specified type
* a type with parameters is a specialization of the same type without parameters
* all */*+xml types are specializations of application/xml
* all text/* types are specializations of text/plain
* everything is a specialization of application/octet-stream
BR,
Jukka Zitting