You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Grant Ingersoll <gs...@apache.org> on 2015/01/02 17:27:01 UTC

Multiple parsers for the same MIME type

Hi,

I'm prototyping a new parser (to be donated) for a file type that already
has a parser.  This parser will only be applicable for certain sub types of
that file type.  How is this best handled in an auto-detection scenario?
Are there hints we can give the MIME detector?  For instance, I think for
this particular file type, it will work best on smaller files or those
containing certain kind of content.  The upside is it will do a lot more
than the current parser, which only extracts metadata.

Sorry for being obtuse on what the file type is, but I'm not ready to say
what it is until I get a bit further.

Thanks,
Grant

---
Grant Ingersoll
http://www.lucidworks.com

Re: Multiple parsers for the same MIME type

Posted by Grant Ingersoll <gs...@apache.org>.

Thanks, will check it out.

On Fri, Jan 2, 2015 at 5:07 PM, Jukka Zitting <ju...@gmail.com>
wrote:

> Hi,
>
> 2015-01-02 16:37 GMT-05:00 Grant Ingersoll <gs...@apache.org>:
> > I think the problem is that the file types in question are not
> discernible
> > by anything other than the actual content, with the big problem being
> this
> > is an expensive operation.
>
> Right, then approach 2 might work better, or Tyler's suggestion to
> just modify the existing parser.
>
> > I'll poke around here a bit and see if anything stands out.
>
> A related point is the way the POI container detector uses the
> TikaInputStream.get/setOpenContainer() mechanism [1] to pass the
> results of any early heavy lifting from type detection to the parsing
> phase [2].
>
> [1]
> https://tika.apache.org/1.6/api/org/apache/tika/io/TikaInputStream.html#getOpenContainer()
> [2]
> https://github.com/apache/tika/blob/1.6/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java#L385
>
> BR,
>
> Jukka Zitting
>

Re: Multiple parsers for the same MIME type

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

2015-01-02 16:37 GMT-05:00 Grant Ingersoll <gs...@apache.org>:
> I think the problem is that the file types in question are not discernible
> by anything other than the actual content, with the big problem being this
> is an expensive operation.

Right, then approach 2 might work better, or Tyler's suggestion to
just modify the existing parser.

> I'll poke around here a bit and see if anything stands out.

A related point is the way the POI container detector uses the
TikaInputStream.get/setOpenContainer() mechanism [1] to pass the
results of any early heavy lifting from type detection to the parsing
phase [2].

[1] https://tika.apache.org/1.6/api/org/apache/tika/io/TikaInputStream.html#getOpenContainer()
[2] https://github.com/apache/tika/blob/1.6/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java#L385

BR,

Jukka Zitting

Re: Multiple parsers for the same MIME type

Posted by Grant Ingersoll <gs...@apache.org>.

On Fri, Jan 2, 2015 at 11:50 AM, Jukka Zitting <ju...@gmail.com>
wrote:

> Hi,
>
> 2015-01-02 11:27 GMT-05:00 Grant Ingersoll <gs...@apache.org>:
> > I'm prototyping a new parser (to be donated) for a file type that already
> > has a parser.  This parser will only be applicable for certain sub types
> of
> > that file type.  How is this best handled in an auto-detection scenario?
> > Are there hints we can give the MIME detector?
>
> I see two ways to handle this:
>
> 1. The "do the right thing" approach: Tika knows how to handle media
> type hierarchies and optional type parameters when matching the
> detected media type to the appropriate parser. So you could either
> define an extra media type and mark it as a subtype of the more
> generic type (like application/java-archive is to application/zip) or
> add extra type parameters to add more detailed type information (like
> text/plain;charset=utf-8 is to text/plain). You can then define your
> new parser to only accept files of that specific subtype or parameter.
> Once type detection can correctly detect such files, your parser will
> automatically be used to parse them.
>
>
I think the problem is that the file types in question are not discernible
by anything other than the actual content, with the big problem being this
is an expensive operation.

For example purposes, let's say I had a parser that parsed JPEGs that had >
50% of blue in them w/ better accuracy than a general purpose parser.  All
other indicators point to it as being just another JPEG.  So, ideally, when
I see > 50% blue, I would choose the smarter parser.  The challenge, of
course, being that you have to do most of the work to even determine if it
meets the criteria.  That being said, there may be heuristics one could
employ to determine if the input is of the desired form or not.

I'll poke around here a bit and see if anything stands out.


> 2. The "worse is better" option: The above option requires you to
> defining a new subtype or a parameter and to modify the type detection
> mechanism to correctly detect such files. To avoid the extra work, you
> could simply mark your new parser as being able to handle all files of
> the more generic type, and then in your parser include a fallback
> option to call the original Tika parser when encountering a file the
> new parser can't handle.
>
> BR,
>
> Jukka Zitting
>

Re: Multiple parsers for the same MIME type

Posted by Tyler Palsulich <tp...@gmail.com>.

Hi,

Both of Jukka's options look good to me. Another option is to modify the
existing Parser -- extract the extra information when possible, stick with
current behavior if not.

We've run into this problem with images by trying to run OCR and extract
metadata at the same time. Please see TIKA-1445 (
https://issues.apache.org/jira/browse/TIKA-1445) for a more involved
solution.

Best wishes,
Tyler

On Fri, Jan 2, 2015 at 11:50 AM, Jukka Zitting <ju...@gmail.com>
wrote:

> Hi,
>
> 2015-01-02 11:27 GMT-05:00 Grant Ingersoll <gs...@apache.org>:
> > I'm prototyping a new parser (to be donated) for a file type that already
> > has a parser.  This parser will only be applicable for certain sub types
> of
> > that file type.  How is this best handled in an auto-detection scenario?
> > Are there hints we can give the MIME detector?
>
> I see two ways to handle this:
>
> 1. The "do the right thing" approach: Tika knows how to handle media
> type hierarchies and optional type parameters when matching the
> detected media type to the appropriate parser. So you could either
> define an extra media type and mark it as a subtype of the more
> generic type (like application/java-archive is to application/zip) or
> add extra type parameters to add more detailed type information (like
> text/plain;charset=utf-8 is to text/plain). You can then define your
> new parser to only accept files of that specific subtype or parameter.
> Once type detection can correctly detect such files, your parser will
> automatically be used to parse them.
>
> 2. The "worse is better" option: The above option requires you to
> defining a new subtype or a parameter and to modify the type detection
> mechanism to correctly detect such files. To avoid the extra work, you
> could simply mark your new parser as being able to handle all files of
> the more generic type, and then in your parser include a fallback
> option to call the original Tika parser when encountering a file the
> new parser can't handle.
>
> BR,
>
> Jukka Zitting
>

Re: Multiple parsers for the same MIME type

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

2015-01-02 11:27 GMT-05:00 Grant Ingersoll <gs...@apache.org>:
> I'm prototyping a new parser (to be donated) for a file type that already
> has a parser.  This parser will only be applicable for certain sub types of
> that file type.  How is this best handled in an auto-detection scenario?
> Are there hints we can give the MIME detector?

I see two ways to handle this:

1. The "do the right thing" approach: Tika knows how to handle media
type hierarchies and optional type parameters when matching the
detected media type to the appropriate parser. So you could either
define an extra media type and mark it as a subtype of the more
generic type (like application/java-archive is to application/zip) or
add extra type parameters to add more detailed type information (like
text/plain;charset=utf-8 is to text/plain). You can then define your
new parser to only accept files of that specific subtype or parameter.
Once type detection can correctly detect such files, your parser will
automatically be used to parse them.

2. The "worse is better" option: The above option requires you to
defining a new subtype or a parameter and to modify the type detection
mechanism to correctly detect such files. To avoid the extra work, you
could simply mark your new parser as being able to handle all files of
the more generic type, and then in your parser include a fallback
option to call the original Tika parser when encountering a file the
new parser can't handle.

BR,

Jukka Zitting