You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Antoni Mylka <an...@gmail.com> on 2010/12/03 01:07:03 UTC

TikaMimeTypeIdentifier in Aperture

Hello Aperture

(cc tika-dev, may be interesting for you too)

As you know Tika has made certain advances in the field of mime type
identification, which we (Aperture) wanted to implement for a long
time. This is the feature request 3043080 but it applies to a bug
3025427 and feature requests: 2210328 (ZipContainerDetector), 1838840
and 1650532 (root-XML-based detection). The oldest is almost 4 years
old.

That's why I decided to explore the idea of an implementation of the
Aperture MimeTypeIdentifier interface, which would delegate the actual
identification to Tika ContainerAwareDetector backed by Tika MimeTypes
class. I worked in aperture-addons, and now, I moved this to
aperture-core, to be included in the next release.

This turned out to be (much) more complex than I thought. There were
certain files which Tika recognized better, and certain that Aperture
recognized better. I submitted 7 issues to Tika JIRA and prepared a
little hack that allowed me to augment the tika-mimetypes.xml with the
knowledge from our mimetypes.xml file. As of now the only things that
the MagicMimeTypeIdentifier does better than TikaMimeTypeIdentifier
are:

- support for string patterns in UTF-16 documents. E.g. Tika can't
recognize XML, or HTML in a full UTF-16 file
- support for allowsWhiteSpace before a pattern, e.g. Tika had
problems recognizing the <html> tag if there is some whitespace in
front of it (now it works around that limitation in a good enough way
though, so it's actually not a problem)
- support for multiple parent types.
   - quattro pro 6 used a wordperfect magic, while later ones used
office magics,
   - older Corel Presentations used wordperfect magic, newer use office,
   - works spreadsheets 3.0 used a wordperfect magic, 4.0 used their
own format, 7.0 uses office
   The problem with Tika, is that it treats all those cases correctly
when only the name is provided, but when both name and bytes are
provided, the byte-based mime type trumps the name-based mime type,
because name-based is not a specialization of byte-based (because one
type can only have a single parent, so if we say that office is the
parent of works, we won't recognize works 3.0 and 4.0 but only 7.0).
 - getExtensionsFor(String mimeType), useful in many apps, in tika the
the mime knowledge base is hidden in private fields and
package-protected classes

Yet apart from these minor inconveniences, all of which will probably
disappear in near future, Tika brings benefits
- more mime type descriptions,
- "correct" names, either IANA-approved, or "proper" vendor-made
starting with "vnd." or "invented" ones starting with "x-"
- detection based on root XML element (at last we can correctly detect
XHTML docs with <?xml version="1.0" encoding="utf-8"?> header)
- better detection of OOXML and OLE docs without a name (thanks to
ZipContainerDetector and PoiContainerDetector), though only slightly,
the ContainerAwareDetector works best with a full file, but we give it
only the first 8KB
- better plaintext detection, and a couple of other improvements

I made TikaMimeTypeIdentifier the default choice in ApertureRuntime
and in Aperture's Example Application. Existing apps, which use the
MagicMimeTypeIdentifier will not see any difference, though their
authors are advised to take a look at the new implementation. The new
MimeTypeIdentifier uses different names for many mime types. In most
cases these different names are "better", yet they are different and
might require a modification of the client code.

Fixing the four limitations outlined above will require additional
patches to Tika. I wanted to "release" the code now, to allow for
testing, before the next Aperture release. In the long term, I think
that maintaining two separate mime type identifiers is a bad idea.

So, play with the ApertureRuntime, or the CLI apps, and try to
substitute "new MagicMimeTypeIdentifier" with "new
TikaMimeTypeIdentifier()" and see what happens.

Links:

The file with mime type info which was present in Aperture's
mimetypes-xml, but not in tika-mimetypes.xml
https://aperture.svn.sourceforge.net/svnroot/aperture/aperture/trunk/core/src/main/resources/org/semanticdesktop/aperture/tika/diff-mimetypes.xml

A diff between these two files, shows the differences in mimetype
identification.
Aperture identification (name, identification by data, identification
by name and data):
https://aperture.svn.sourceforge.net/svnroot/aperture/aperture/trunk/core/src/test/java/org/semanticdesktop/aperture/mime/identifier/magic/ApertureDocumentsIdentificationTest.java
Tika-based identification (only 8KB of each file is taken into
account, tika-mimetypes.xml is enhanced via MimeTypesEnhancer with the
content of diff-mimetypes.xml)
https://aperture.svn.sourceforge.net/svnroot/aperture/aperture/trunk/core/src/test/java/org/semanticdesktop/aperture/tika/TikaMimeTypeIdentifierTest.java

--
Antoni Myłka
antoni.mylka@gmail.com

Re: TikaMimeTypeIdentifier in Aperture

Posted by Maxim Valyanskiy <ma...@jet.msk.su>.

Hello!

03.12.2010 03:07, Antoni Mylka пишет:
>   - getExtensionsFor(String mimeType), useful in many apps, in tika the
> the mime knowledge base is hidden in private fields and
> package-protected classes

I think that getExtension() for mime-type method is good idea. It is useful for 
creating file name of embedded documents in formats that does not store original 
filename (i.e. OLE-based containers in some cases).

I do not sure about right place for this method. Maybe add extensions to 
tika-mimetypes.xml and the method to org.apache.tika.mime.MimeTypes?

best wishes, Max

Re: TikaMimeTypeIdentifier in Aperture

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hi Antoni,

Awesome! Thanks for the FYI here...

Cheers,
Chris

On Jun 10, 2011, at 5:22 PM, Antoni Mylka wrote:

> W dniu 2010-12-03 01:07, Antoni Mylka pisze:
>> Hello Aperture
>> 
>> (cc tika-dev, may be interesting for you too)
> 
> Brought our TikaMimeTypeIdentifier up to date with the latest Tika 
> trunk. I increased the number of bytes passed to the identifier to 
> 512KB. It's a lot, but these days CPU is cheap. This large buffer will 
> allow for most small files to fit in their entirety.
> 
> With this, the container aware detection really started to work. OOXML, 
> ODT and MS-Office-based formats can now be recognized without the file 
> name as long as file is small. In my use cases, most of the unnamed .doc 
> files I meet come in email attachments or embedded in other docs. They 
> tend to be small and will fit in 512KB, so that's where the most bang 
> for the buck is to be found.
> 
>> - support for string patterns in UTF-16 documents. E.g. Tika can't
>> recognize XML, or HTML in a full UTF-16 file
> 
> Still doesn't work. Do such documents occur in the wild at all? Anyone 
> with more experience in CJK? Do CJK websites use UTF-16? I myself 
> haven't seen one.
> 
>> - support for allowsWhiteSpace before a pattern, e.g. Tika had
>> problems recognizing the<html>  tag if there is some whitespace in
>> front of it (now it works around that limitation in a good enough way
>> though, so it's actually not a problem)
> 
> workaround works
> 
>> - support for multiple parent types.
>>    - quattro pro 6 used a wordperfect magic, while later ones used
>> office magics,
>>    - older Corel Presentations used wordperfect magic, newer use office,
>>    - works spreadsheets 3.0 used a wordperfect magic, 4.0 used their
>> own format, 7.0 uses office
>>    The problem with Tika, is that it treats all those cases correctly
>> when only the name is provided, but when both name and bytes are
>> provided, the byte-based mime type trumps the name-based mime type,
>> because name-based is not a specialization of byte-based (because one
>> type can only have a single parent, so if we say that office is the
>> parent of works, we won't recognize works 3.0 and 4.0 but only 7.0).
> 
> Still doesn't work, will see what I can do about that. It's about 
> historical formats though, which rarely occur in practice. We don't have 
> Extractors for them anyway, so it's not much of a real problem.
> 
>>  - getExtensionsFor(String mimeType), useful in many apps, in tika the
>> the mime knowledge base is hidden in private fields and
>> package-protected classes
> 
> This has been actually implemented last month. I wrote a 
> getExtensionsFor method in TikaMimeTypeIdentifier.
> 
> With this I would like to "officially" deprecate the 
> MagicMimeTypeIdentifier in favour of the TikaMimeTypeIdentifier. If 
> nobody objects, I'd add the @deprecated javadoc annotation in near future.
> 
> See http://bit.ly/iTqCs0 for details of what we can do now.
> 
> Tika, you rule.
> 
> Antoni Myłka
> antoni.mylka@gmail.com


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: [Aperture-devel] TikaMimeTypeIdentifier in Aperture

Posted by Nick Burch <ni...@alfresco.com>.

On Tue, 14 Jun 2011, Antoni Mylka wrote:
> The way I see it, overhead appears if I pass a normal ZIP file, which I 
> can process in a streaming way (limitations notwithstanding). Then when 
> I pass the file to the container detector, it has to be buffered 
> regardless of whether the buffering is necessary.

The detection requires checking potentially all the entries in the zip 
file. I don't think there's any way around that - if the container 
detector is to do any better than "yup, the outer package is a zip file" 
we have to check either for certain entries (random access) or check each 
one in turn (streaming). Only when that's finished can the parser have a 
go, and that'll want to be at the start too. So, buffering is needed 
always - it's always necessary.

> So overhead appears when I detect the data type without text extraction 
> AND when I detect data type on plain, non-ooxml zips which could be 
> streamed.

If you know for certain that a zip file isn't one of the "zip using" 
formats (such as ooxml, odf, iworks etc), then don't pass it to the 
container detector!

The container detector is there for the cases when you don't know that for 
certain, and want to be able to do better than "the first 4 bytes look 
like a zip file"

Nick

Re: [Aperture-devel] TikaMimeTypeIdentifier in Aperture

Posted by Antoni Mylka <an...@gmail.com>.

W dniu 2011-06-14 16:13, Maxim Valyanskiy pisze:
> Tika detects datatype and extracts text in one pass through supplied
> input stream. OOXML parser requires random access to ZIP archive files,
> so there is only two alternativies - to buffer data in memory or store
> it on disk. Overhead appears only when you need just to detect data type
> without text extraction.

The way I see it, overhead appears if I pass a normal ZIP file, which I 
can process in a streaming way (limitations notwithstanding). Then when 
I pass the file to the container detector, it has to be buffered 
regardless of whether the buffering is necessary.

So overhead appears when I detect the data type without text extraction 
AND when I detect data type on plain, non-ooxml zips which could be 
streamed.

Am I right?

Antoni Myłka
antoni.mylka@gmail.com

Re: [Aperture-devel] TikaMimeTypeIdentifier in Aperture

Posted by Maxim Valyanskiy <ma...@jet.msk.su>.

Hello!

14.06.2011 17:53, Antoni Mylka пишет:
> Doesn't the "we'll need to buffer the whole file for zip anyway" boil down to 
> the question of using the commons-compress ZipFile vs. ZipArchiveInputStream? I 
> know that in a general case the zip file format isn't well suited for streaming 
> processing, which makes ZipArchiveInputStream less reliable. The stream can 
> contain entries which aren't supposed to appear in the zip or multiple entries 
> with the same name. Yet if I agree to that, I can crawl 50M zips in email 
> attachments without copying them.
>
> You are already committed to using ZipFile in zip-processing code so using 
> TikaInputStream.getFile() in ZipContainerDetector is not a problem. We stay with 
> ZipArchiveInputStream (for the time being) and would therefore be interested in 
> a stream-based ZipContainerDetector consuming just a few kilobytes, knowing that 
> in certain cases the accuracy may drop, because the entries in a zip are in 
> general unordered.
>
> It's a reliability vs. performance tradeoff. Or am I missing something?
>

Tika detects datatype and extracts text in one pass through supplied input stream. 
OOXML parser requires random access to ZIP archive files, so there is only two 
alternativies - to buffer data in memory or store it on disk. Overhead appears 
only when you need just to detect data type without text extraction.

best wishes, Max

Re: [Aperture-devel] TikaMimeTypeIdentifier in Aperture

Posted by Nick Burch <ni...@alfresco.com>.

On Tue, 14 Jun 2011, Antoni Mylka wrote:
> But the ZipContainerDetector works by looking for the existence of 
> certain entries with known names and in certain cases it reads the 
> content of those entries (e.g. "[Content_Types].xml"). If those entries 
> happen to live at the beginning of the file then you can detect the type 
> by giving the detector a fixed header (of say 512KB).

They often won't be at the start of the file, so I would advise against 
this as a strategy...

> The only drawback is the necessity to create the temp file from this 
> small header. If you're already do lots of filesystem activity in your 
> app (crawling the files and generating an index), especially on a 
> desktop machine with a single non-SSD disk, then every fs operation you 
> don't make, allows the fs cache and disk head to do more stuff in other 
> places. Yet the profits will likely be small, and the cost of rewriting 
> the ZipContainerDetector seems large.

I've a feeling that changing the code to work with either a file or a 
stream isn't too much work with commons compress. Can you open a jira 
enhancement for this? And if you want to work on it, great! Otherwise I'll 
take a look when I can.

Nick

Re: [Aperture-devel] TikaMimeTypeIdentifier in Aperture

Posted by Antoni Mylka <an...@gmail.com>.

W dniu 2011-06-14 16:11, Nick Burch pisze:
> On Tue, 14 Jun 2011, Antoni Mylka wrote:
>>> We'll need to buffer the whole file for zip either way. The current way
>>> will create a temp file if you start with an input stream (not if you
>>> have
>>> a file already), will scan through the file looking for entries that'll
>>> identify the file. The parser needs the whole file, so if we did a
>>> streaming parse of the file for detection we'd need to have buffered
>>> so we
>>> can rewind for the parser
>>
>> Why?
>>
>> Doesn't the "we'll need to buffer the whole file for zip anyway" boil
>> down to the question of using the commons-compress ZipFile vs.
>> ZipArchiveInputStream?
>
> No, it's because we need two different things to process the zip file.
> Firstly there's the detector, and only once that has finished can the
> parser have a go. Even if we did have streaming parsing, we need to
> buffer the whole contents of the zip so we could rewind and let the
> parser then look at the same zip. In that situation you probably might
> as well do the buffering with the file, and then get the advantages of
> zip parsing from a file that it offers.

But the ZipContainerDetector works by looking for the existence of 
certain entries with known names and in certain cases it reads the 
content of those entries (e.g. "[Content_Types].xml"). If those entries 
happen to live at the beginning of the file then you can detect the type 
by giving the detector a fixed header (of say 512KB). ZipFile seems to 
work well enough with truncated zips, so if an interesting entry fits in 
the part of the file you have, you're done. That's what I'm doing right 
now. With a fixed-size header, I know how much needs to be buffered and 
re-wound before extraction is applied. I accept the fact that there 
exist zip files which will not be recognized correctly this way.

The only drawback is the necessity to create the temp file from this 
small header. If you're already do lots of filesystem activity in your 
app (crawling the files and generating an index), especially on a 
desktop machine with a single non-SSD disk, then every fs operation you 
don't make, allows the fs cache and disk head to do more stuff in other 
places. Yet the profits will likely be small, and the cost of rewriting 
the ZipContainerDetector seems large.

So in short what I want to say:
  1. I want to do detection based on a fixed-size prefix of a zip file 
with graceful degradation on weird files, where interesting entries 
aren't at the beginning.
  2. Tika can do that already (as far as I understand),
  3. It will create a temp file, but this is a non-issue at the moment.

Antoni Myłka
antoni.mylka@gmail.com

Re: [Aperture-devel] TikaMimeTypeIdentifier in Aperture

Posted by Nick Burch <ni...@alfresco.com>.

On Tue, 14 Jun 2011, Antoni Mylka wrote:
>> We'll need to buffer the whole file for zip either way. The current way
>> will create a temp file if you start with an input stream (not if you have
>> a file already), will scan through the file looking for entries that'll
>> identify the file. The parser needs the whole file, so if we did a
>> streaming parse of the file for detection we'd need to have buffered so we
>> can rewind for the parser
>
> Why?
>
> Doesn't the "we'll need to buffer the whole file for zip anyway" boil 
> down to the question of using the commons-compress ZipFile vs. 
> ZipArchiveInputStream?

No, it's because we need two different things to process the zip file. 
Firstly there's the detector, and only once that has finished can the 
parser have a go. Even if we did have streaming parsing, we need to buffer 
the whole contents of the zip so we could rewind and let the parser then 
look at the same zip. In that situation you probably might as well do the 
buffering with the file, and then get the advantages of zip parsing from a 
file that it offers.

>> Pass in a TikaInputStream. That supports attaching the opened (and
>> processed) container to the stream, so the parser can re-use it.
>
> I know. I was referring to making the Aperture extractors aware of the fact 
> that they can reuse the NPOIFileSystem, which is something I want to 
> implement before we fully migrate. For us, the problem is that we give only 
> the first few KB of a file to the mime type identifier, therefore for larger 
> files the PoiContainerDetector can NOT build a proper poi filesystem for 
> extractors to reuse.

Correct. As there's no requirement for the properties table to live at the 
front of the file, you can't even do nasty shortcut hacks. If you want to 
know what's in the ole2 file, you do have to load the whole thing. At 
least NPOIFSFileSystem makes this quicker and easier than it used to be!

Nick

Re: [Aperture-devel] TikaMimeTypeIdentifier in Aperture

Posted by Antoni Mylka <an...@gmail.com>.

W dniu 2011-06-14 15:02, Nick Burch pisze:
> On Tue, 14 Jun 2011, Antoni Mylka wrote:
>> You are right. There is still room for improvement. ZipContainerDetector
>> creates a temp file, which I'd rather avoid
>
> We'll need to buffer the whole file for zip either way. The current way
> will create a temp file if you start with an input stream (not if you have
> a file already), will scan through the file looking for entries that'll
> identify the file. The parser needs the whole file, so if we did a
> streaming parse of the file for detection we'd need to have buffered so we
> can rewind for the parser

Why?

Doesn't the "we'll need to buffer the whole file for zip anyway" boil 
down to the question of using the commons-compress ZipFile vs. 
ZipArchiveInputStream? I know that in a general case the zip file format 
isn't well suited for streaming processing, which makes 
ZipArchiveInputStream less reliable. The stream can contain entries 
which aren't supposed to appear in the zip or multiple entries with the 
same name. Yet if I agree to that, I can crawl 50M zips in email 
attachments without copying them.

You are already committed to using ZipFile in zip-processing code so 
using TikaInputStream.getFile() in ZipContainerDetector is not a 
problem. We stay with ZipArchiveInputStream (for the time being) and 
would therefore be interested in a stream-based ZipContainerDetector 
consuming just a few kilobytes, knowing that in certain cases the 
accuracy may drop, because the entries in a zip are in general unordered.

It's a reliability vs. performance tradeoff. Or am I missing something?

>> and with POI detector, the entire stream is parsed once in detector, and
>> for the second time in the extractor/parser, which is bad for
>> performance
>
> Pass in a TikaInputStream. That supports attaching the opened (and
> processed) container to the stream, so the parser can re-use it.

I know. I was referring to making the Aperture extractors aware of the 
fact that they can reuse the NPOIFileSystem, which is something I want 
to implement before we fully migrate. For us, the problem is that we 
give only the first few KB of a file to the mime type identifier, 
therefore for larger files the PoiContainerDetector can NOT build a 
proper poi filesystem for extractors to reuse. That's why I'm building a 
generic POI extractor which will get the entire stream, build a proper 
filesystem and perform the detection and extraction directly from it.

Antoni Myłka
antoni.mylka@gmail.com

Re: [Aperture-devel] TikaMimeTypeIdentifier in Aperture

Posted by Nick Burch <ni...@alfresco.com>.

On Tue, 14 Jun 2011, Antoni Mylka wrote:
> You are right. There is still room for improvement. ZipContainerDetector 
> creates a temp file, which I'd rather avoid

We'll need to buffer the whole file for zip either way. The current way 
will create a temp file if you start with an input stream (not if you have 
a file already), will scan through the file looking for entries that'll 
identify the file. The parser needs the whole file, so if we did a 
streaming parse of the file for detection we'd need to have buffered so we 
can rewind for the parser

> and with POI detector, the entire stream is parsed once in detector, and 
> for the second time in the extractor/parser, which is bad for 
> performance

Pass in a TikaInputStream. That supports attaching the opened (and 
processed) container to the stream, so the parser can re-use it.

Nick

Re: [Aperture-devel] TikaMimeTypeIdentifier in Aperture

Posted by Antoni Mylka <an...@gmail.com>.

W dniu 2011-06-14 11:50, Arjohn Kampman pisze:
> On 11/06/2011 02:22, Antoni Mylka wrote:
>> Brought our TikaMimeTypeIdentifier up to date with the latest Tika
>> trunk. I increased the number of bytes passed to the identifier to
>> 512KB. It's a lot, but these days CPU is cheap. This large buffer will
>> allow for most small files to fit in their entirety.
>
> Hi Antoni,
>
> I don't understand this file size limitation. The way I would imagine
> how the extraction of an MS Word document would/could work is something
> like this:
>
> - Aperture picks up a file.
>
> - The mime type identifier investigates the file's header, detects an
>    Ole header and sets the item's mime type to "application/x-oleobject".
>
> - Aperture passes the ole-tagged data to an Ole processor. This
>    processor detects a Word document, changes the item's mime type to
>    "application/msword" and forwards the data to a MSWord processor.
>
> I suspect that the current mime type identifier also assumes the task of
> interpreting the container format. Is that correct? If so, would it make
> sense to modify this to the above behaviour?

This was a private mail, but I'm replying to both mailing lists, why 
not. Mime type identification is at the core of what we all do.

You are right. There is still room for improvement. ZipContainerDetector 
creates a temp file, which I'd rather avoid and with POI detector, the 
entire stream is parsed once in detector, and for the second time in the 
extractor/parser, which is bad for performance. The POI detector should 
also be allowed to consume the ENTIRE stream, not the first few 
kilobytes, as it is now.

Moreover, I have seen cases where the PoiFSContainerDetector didn't know 
some ancient Quattro-Pro format, but the MimeTypes could recognize it 
correctly on the basis of the name. That's why combining the 
magic-based, zip-based and poi-based detection should follow a slightly 
different logic.

So as it is now, in Aperture the TikaMimeTypeIdentifier is worse in 
terms of performance (cpu (poi), and disk activity (zip)) than 
MagicMimeTypeIdentifier. The poi detection is also limited to files 
whose size is below a certain arbitrary threshold. Yet the detection 
accuracy is still better.

Will see what I can do about the reservations outlined above.

Antoni Myłka
antoni.mylka@gmail.com

Re: TikaMimeTypeIdentifier in Aperture

Posted by Antoni Mylka <an...@gmail.com>.

W dniu 2010-12-03 01:07, Antoni Mylka pisze:
> Hello Aperture
>
> (cc tika-dev, may be interesting for you too)

Brought our TikaMimeTypeIdentifier up to date with the latest Tika 
trunk. I increased the number of bytes passed to the identifier to 
512KB. It's a lot, but these days CPU is cheap. This large buffer will 
allow for most small files to fit in their entirety.

With this, the container aware detection really started to work. OOXML, 
ODT and MS-Office-based formats can now be recognized without the file 
name as long as file is small. In my use cases, most of the unnamed .doc 
files I meet come in email attachments or embedded in other docs. They 
tend to be small and will fit in 512KB, so that's where the most bang 
for the buck is to be found.

> - support for string patterns in UTF-16 documents. E.g. Tika can't
> recognize XML, or HTML in a full UTF-16 file

Still doesn't work. Do such documents occur in the wild at all? Anyone 
with more experience in CJK? Do CJK websites use UTF-16? I myself 
haven't seen one.

> - support for allowsWhiteSpace before a pattern, e.g. Tika had
> problems recognizing the<html>  tag if there is some whitespace in
> front of it (now it works around that limitation in a good enough way
> though, so it's actually not a problem)

workaround works

> - support for multiple parent types.
>     - quattro pro 6 used a wordperfect magic, while later ones used
> office magics,
>     - older Corel Presentations used wordperfect magic, newer use office,
>     - works spreadsheets 3.0 used a wordperfect magic, 4.0 used their
> own format, 7.0 uses office
>     The problem with Tika, is that it treats all those cases correctly
> when only the name is provided, but when both name and bytes are
> provided, the byte-based mime type trumps the name-based mime type,
> because name-based is not a specialization of byte-based (because one
> type can only have a single parent, so if we say that office is the
> parent of works, we won't recognize works 3.0 and 4.0 but only 7.0).

Still doesn't work, will see what I can do about that. It's about 
historical formats though, which rarely occur in practice. We don't have 
Extractors for them anyway, so it's not much of a real problem.

>   - getExtensionsFor(String mimeType), useful in many apps, in tika the
> the mime knowledge base is hidden in private fields and
> package-protected classes

This has been actually implemented last month. I wrote a 
getExtensionsFor method in TikaMimeTypeIdentifier.

With this I would like to "officially" deprecate the 
MagicMimeTypeIdentifier in favour of the TikaMimeTypeIdentifier. If 
nobody objects, I'd add the @deprecated javadoc annotation in near future.

See http://bit.ly/iTqCs0 for details of what we can do now.

Tika, you rule.

Antoni Myłka
antoni.mylka@gmail.com