You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Ken Krugler <kk...@transpac.com> on 2009/10/11 17:50:12 UTC

Re: Fall-back parser in AutoDetectParser

I filed TIKA-298 to capture this issue.

Unfortunately the patch will need to wait until I get back from  
vacation, I think (so early Nov).

BTW, is there any info on the "ongoing redesign of the mime type  
registry"? The only Jira issue I see is TIKA-89 (minor renaming).

Thanks,

-- Ken

On Sep 30, 2009, at 2:31am, Jukka Zitting wrote:

> Hi,
>
> On Tue, Sep 29, 2009 at 12:08 AM, Ken Krugler
> <kk...@transpac.com> wrote:
>> Just for grins, I set up for types with names ending in +xml to
>> automatically get application/xml as the parent mimetype.
>>
>> But when I used TikaCLI to process a test.xspf file, no content was
>> generated.
>>
>> The issue is that CompositeParser.getParser() doesn't use  
>> supertypes when
>> falling back - if it can't get a parser for the exact mimetype,  
>> then it goes
>> straight to the fallback parser.
>>
>> It seems like it should try to use the mimetype hierarchy. If so, I  
>> can file
>> an issue and a patch.
>
> Correct, that would be great.
>
> Note that both the MimeType.getSuperType()  method already does some
> of this and we have related supertype settings stored in the
> tika-mimetypes.xml configuration. The type registry could also be told
> about the +xml convention and related implicit supertype settings like
> the ones encoded in the MediaType.isSpecializationOf() method.
>
> (Note that we currently have both MimeType and MediaType classes for
> similar purposes. This is due to an ongoing redesign of the mime type
> registry. For now it's probably best to work on the MimeType class
> until the redesign is more complete.)
>
> BR,
>
> Jukka Zitting

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


Re: Fall-back parser in AutoDetectParser

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Sun, Oct 11, 2009 at 5:50 PM, Ken Krugler
<kk...@transpac.com> wrote:
> I filed TIKA-298 to capture this issue.

Thanks!

> BTW, is there any info on the "ongoing redesign of the mime type registry"?
> The only Jira issue I see is TIKA-89 (minor renaming).

See also TIKA-87 (MimeTypes should allow modification of MIME types)
and a pretty old thread about this at
http://markmail.org/message/27u6gifef2gwobwm.

What I'm hoping to achieve is to refactor our current pretty
monolithic MIME type registry code to something that would be easier
to reuse and extend both within Tika and in other applications.

BR,

Jukka Zitting