You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Nick Burch <ni...@alfresco.com> on 2010/06/15 19:25:13 UTC

Detecting container formats

Hi All

I've been thinking about TIKA-391 (intermittent incorrect mime type 
detection of office formats), and I think we might need to do something 
different for container formats.

At the moment, for OLE2 based files (.xls, .ppt, .doc, .msg, .vsd etc), 
and for ZIP based files (.zip, but also .xlsx, .pptx, .docx, .odf, .odt, 
.ots, .sxw etc), I don't think the current method works well. AFAICT,
we detect the container, then have sub-class matches that try to look for 
the appropriate children by hoping we can guess where the definition might 
hide within the container. However, I think this is too unreliable - for 
example, with a .doc file, the entry for the Word stream can come anywhere 
in the list of top level entries, so is very hard to reliably find without 
properly parsing the OLE2 structure

So, I'd like to suggest a slightly different approach, one of loading the 
container format to decide the mime type. This will, of course, make the 
detection step slower and more memory hungry for detecting these (but only 
these) kinds of documents. However, provided that we keep the open 
container around and pass it to the parser in a later step, it's work we 
would've done anyway.

I'd then see the mime process be something like:
* Loop over all magic rules
   * If the magic fits and the file extension fits, pick this one
   * Otherwise if the magic fits and it's a container:
     * Load the container
     * Check the top level entries against our list for that container
     * If we get a hit, pick that
     * If nothing hits, assume it's just the container

eg we have a file with the zip magic, but no / unreliable filename.
  We open the zip file and look at the top level directory entries.
  If we spot [Content_Types].xml and /xl/ we know it's an OOXML Excel file
  If we spot meta.xml and mimetype then read mimetype and go from there
  ...
  Else decide it's just a zipfile of files, and handle appropriately

What does everyone else think? Is the extra work in the mime detection 
step (but only for container formats with no reliable filename) worth it 
for the improved detection?

note - the issue of when given a filename with a useful extension of being
  able to reliably pick the right mime type still needs to be solved, but
  largely wouldn't be affected by this

Nick

Re: Detecting container formats

Posted by Nick Burch <ni...@alfresco.com>.

On Thu, 17 Jun 2010, Max Valjanski wrote:
> I tried to do that, but I found that this does not fit into Tika 
> architecture. It is required to read whole file to parse OLE-container.

Yup, I've found much the same thing. My idea was to have a new detector 
that you can layer in between the others, which will parse the containers 
and keep them around if needed. If you don't want it, skip it from the 
chain.

I'm not sure if what I've done makes sense, but I've attached a patch that 
demos the idea to TIKA-447 . Do people think the idea is worth pursuing 
further, or should we try something different?

Nick

Re: Detecting container formats

Posted by Max Valjanski <ma...@jet.msk.su>.

Hello!

-10.01.-28163 22:59, Nick Burch пишет:
> At the moment, for OLE2 based files (.xls, .ppt, .doc, .msg, .vsd 
> etc), and for ZIP based files (.zip, but also .xlsx, .pptx, .docx, 
> .odf, .odt, .ots, .sxw etc), I don't think the current method works 
> well. AFAICT,
> we detect the container, then have sub-class matches that try to look 
> for the appropriate children by hoping we can guess where the 
> definition might hide within the container. However, I think this is 
> too unreliable - for example, with a .doc file, the entry for the Word 
> stream can come anywhere in the list of top level entries, so is very 
> hard to reliably find without properly parsing the OLE2 structure
>
I tried to do that, but I found that this does not fit into Tika 
architecture. It is required to read whole file to parse OLE-container. 
Tika works with streams, so we can

1) remove streaming support and work only with files (or save stream 
into temporaty file before processing), or
2) parse OLE-container on mime-type detection and transfer it to text 
extractor (parser)

I do not like first solution, but the second requires architecture 
changes in Tika.

Anyway, I wrote type detection code for OLE in TIKA-437.

best wishes, Max

Re: Detecting container formats

Posted by Nick Burch <ni...@alfresco.com>.

On Tue, 15 Jun 2010, Ken Krugler wrote:
> I think this is a reasonable approach, as long as (per Alex's suggestion) 
> it's configurable in various ways.
>
> E.g. if you know you don't want to parse OLE2-based files, so you've 
> removed jars for those parser, then it would be great to have an easy 
> way of disabling the (more expensive) mime-type detection, and 
> potentially avoid the dependency on these same jars.

Avoiding the expensive detection shouldn't be too hard, as long as we can 
figure out what to return for the mime type when we don't do the detailed 
passing.

Avoiding the jars might be a bit more tricky, but with a little bit of 
wrapping and some catching of ClassNotFoundException we should probably be 
able to manage it

Anyone know of how we could best pass the open zip / poifs objects back 
from the detector so they parsers can re-use them?

Nick

Re: Detecting container formats

Posted by Alex Ott <al...@gmail.com>.

Hello

Ken Krugler  at "Tue, 15 Jun 2010 11:56:51 -0700" wrote:
 KK> I think this is a reasonable approach, as long as (per Alex's suggestion) it's
 KK> configurable in various ways.

 KK> E.g. if you know you don't want to parse OLE2-based files, so you've removed jars for
 KK> those parser, then it would be great to have an easy  way of disabling the (more
 KK> expensive) mime-type detection, and  potentially avoid the dependency on these same jars.

 KK> Separately, I think this issue might also trigger improvements to the existing "magic
 KK> bytes" detection code in Tika. IIRC, we wound up  adding full regex with some additional
 KK> matching rules in Krugle, to  extend the (from Nutch, same as Tika) mime-type detection
 KK> code to  better handle things like source code files. I imagine something  similar might
 KK> be needed to reliably handle container matching.

I'm not sure - does Tika need full regex support, while in most mime type
detection tasks it's enough (from my experience in this branch) to have
only search function dynamic addressing function (for example, find Zip
signature somewhere, and then use mix of getByte(offset) to check other
values)

For source code it's better to use something like naive bayes - it works
well (as I remember from tests, that we made 6 years ago)...

-- 
With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/        http://alexott.net/
http://alexott-ru.blogspot.com/
Skype: alex.ott

Re: Detecting container formats

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hi Ken, and all,

FWIW, it's Tika can handle full regex on glob patterns now via the isregex attribute that I added way back when in TIKA-194 [1].

https://issues.apache.org/jira/browse/TIKA-194

Cheers,
Chris



On 6/15/10 11:56 AM, "Ken Krugler" <kk...@transpac.com> wrote:

I think this is a reasonable approach, as long as (per Alex's
suggestion) it's configurable in various ways.

E.g. if you know you don't want to parse OLE2-based files, so you've
removed jars for those parser, then it would be great to have an easy
way of disabling the (more expensive) mime-type detection, and
potentially avoid the dependency on these same jars.

Separately, I think this issue might also trigger improvements to the
existing "magic bytes" detection code in Tika. IIRC, we wound up
adding full regex with some additional matching rules in Krugle, to
extend the (from Nutch, same as Tika) mime-type detection code to
better handle things like source code files. I imagine something
similar might be needed to reliably handle container matching.

-- Ken


On Jun 15, 2010, at 10:25am, Nick Burch wrote:

> Hi All
>
> I've been thinking about TIKA-391 (intermittent incorrect mime type
> detection of office formats), and I think we might need to do
> something different for container formats.
>
> At the moment, for OLE2 based files (.xls, .ppt, .doc, .msg, .vsd
> etc), and for ZIP based files (.zip, but
> also .xlsx, .pptx, .docx, .odf, .odt, .ots, .sxw etc), I don't think
> the current method works well. AFAICT,
> we detect the container, then have sub-class matches that try to
> look for the appropriate children by hoping we can guess where the
> definition might hide within the container. However, I think this is
> too unreliable - for example, with a .doc file, the entry for the
> Word stream can come anywhere in the list of top level entries, so
> is very hard to reliably find without properly parsing the OLE2
> structure
>
> So, I'd like to suggest a slightly different approach, one of
> loading the container format to decide the mime type. This will, of
> course, make the detection step slower and more memory hungry for
> detecting these (but only these) kinds of documents. However,
> provided that we keep the open container around and pass it to the
> parser in a later step, it's work we would've done anyway.
>
> I'd then see the mime process be something like:
> * Loop over all magic rules
>  * If the magic fits and the file extension fits, pick this one
>  * Otherwise if the magic fits and it's a container:
>    * Load the container
>    * Check the top level entries against our list for that container
>    * If we get a hit, pick that
>    * If nothing hits, assume it's just the container
>
> eg we have a file with the zip magic, but no / unreliable filename.
> We open the zip file and look at the top level directory entries.
> If we spot [Content_Types].xml and /xl/ we know it's an OOXML Excel
> file
> If we spot meta.xml and mimetype then read mimetype and go from there
> ...
> Else decide it's just a zipfile of files, and handle appropriately
>
> What does everyone else think? Is the extra work in the mime
> detection step (but only for container formats with no reliable
> filename) worth it for the improved detection?
>
> note - the issue of when given a filename with a useful extension of
> being
> able to reliably pick the right mime type still needs to be solved,
> but
> largely wouldn't be affected by this
>
> Nick

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Detecting container formats

Posted by Ken Krugler <kk...@transpac.com>.

I think this is a reasonable approach, as long as (per Alex's  
suggestion) it's configurable in various ways.

E.g. if you know you don't want to parse OLE2-based files, so you've  
removed jars for those parser, then it would be great to have an easy  
way of disabling the (more expensive) mime-type detection, and  
potentially avoid the dependency on these same jars.

Separately, I think this issue might also trigger improvements to the  
existing "magic bytes" detection code in Tika. IIRC, we wound up  
adding full regex with some additional matching rules in Krugle, to  
extend the (from Nutch, same as Tika) mime-type detection code to  
better handle things like source code files. I imagine something  
similar might be needed to reliably handle container matching.

-- Ken


On Jun 15, 2010, at 10:25am, Nick Burch wrote:

> Hi All
>
> I've been thinking about TIKA-391 (intermittent incorrect mime type  
> detection of office formats), and I think we might need to do  
> something different for container formats.
>
> At the moment, for OLE2 based files (.xls, .ppt, .doc, .msg, .vsd  
> etc), and for ZIP based files (.zip, but  
> also .xlsx, .pptx, .docx, .odf, .odt, .ots, .sxw etc), I don't think  
> the current method works well. AFAICT,
> we detect the container, then have sub-class matches that try to  
> look for the appropriate children by hoping we can guess where the  
> definition might hide within the container. However, I think this is  
> too unreliable - for example, with a .doc file, the entry for the  
> Word stream can come anywhere in the list of top level entries, so  
> is very hard to reliably find without properly parsing the OLE2  
> structure
>
> So, I'd like to suggest a slightly different approach, one of  
> loading the container format to decide the mime type. This will, of  
> course, make the detection step slower and more memory hungry for  
> detecting these (but only these) kinds of documents. However,  
> provided that we keep the open container around and pass it to the  
> parser in a later step, it's work we would've done anyway.
>
> I'd then see the mime process be something like:
> * Loop over all magic rules
>  * If the magic fits and the file extension fits, pick this one
>  * Otherwise if the magic fits and it's a container:
>    * Load the container
>    * Check the top level entries against our list for that container
>    * If we get a hit, pick that
>    * If nothing hits, assume it's just the container
>
> eg we have a file with the zip magic, but no / unreliable filename.
> We open the zip file and look at the top level directory entries.
> If we spot [Content_Types].xml and /xl/ we know it's an OOXML Excel  
> file
> If we spot meta.xml and mimetype then read mimetype and go from there
> ...
> Else decide it's just a zipfile of files, and handle appropriately
>
> What does everyone else think? Is the extra work in the mime  
> detection step (but only for container formats with no reliable  
> filename) worth it for the improved detection?
>
> note - the issue of when given a filename with a useful extension of  
> being
> able to reliably pick the right mime type still needs to be solved,  
> but
> largely wouldn't be affected by this
>
> Nick

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Detecting container formats

Posted by Alex Ott <al...@gmail.com>.

Re

Nick Burch  at "Wed, 16 Jun 2010 12:01:48 +0100 (BST)" wrote:
 NB> On Tue, 15 Jun 2010, Alex Ott wrote:
 >> Hmmm, WordDocument stream in .doc could be only under / directory entry, but yes - it
 >> could anywhere in list of OLE2 entries...

 NB> And the list of ole2 entries can come anywhere in the file - the header block contains a
 NB> pointer to the block holding the entries, which is normally near the start but isn't
 NB> required to be...

 NB> Detecting OLE2 or Zip with magic seems easy enough, but as mentioned it's whats inside
 NB> them that I don't think magic + a few regexps on the first few kbs will cut it :/

Yep, for OLE2 we need to get the whole file and generate list of entries in
it.  For Zip, we also need to get the whole file, but it could be enough to
read list of entries, although, sometimes we need to read some files from
archive to get correct mime type (odf, {doc,ppt,xls}x, ...)

I'm not sure how it's better to implement this in Tika, I need to look into
sources.  One possibility is to create hierarchy of container processors,
each of that will set corresponding subtype of container, and this value
will used in mime-type description. Something like

if (string at 0 = "PK\x03\x04" and subtype == 10)
then mimetype = "application/java-archive"

-- 
With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/           http://alexott.net
http://alexott-ru.blogspot.com/

Re: Detecting container formats

Posted by Nick Burch <ni...@alfresco.com>.

On Tue, 15 Jun 2010, Alex Ott wrote:
> Hmmm, WordDocument stream in .doc could be only under / directory entry, 
> but yes - it could anywhere in list of OLE2 entries...

And the list of ole2 entries can come anywhere in the file - the header 
block contains a pointer to the block holding the entries, which is 
normally near the start but isn't required to be...

Detecting OLE2 or Zip with magic seems easy enough, but as mentioned it's 
whats inside them that I don't think magic + a few regexps on the first 
few kbs will cut it :/

> Maybe it would useful to make this configurable? Sometimes it's useful 
> to force media type detection by magic only, not by extension (for 
> example, file could be renamed)...

IIRC, if you don't set the filename in the Metadata object that you pass 
into the detector, then it can't use the file extension!

Not sure how you could best turn it off though, short of a config that 
would disable the loading of ole2 and zip files (and maybe other 
containers in the future), but then what (if any) would we return for the 
mimetype? Maybe just a generic one?

Nick

Re: Detecting container formats

Posted by Alex Ott <al...@gmail.com>.

Hello

Nick Burch  at "Tue, 15 Jun 2010 18:25:13 +0100 (BST)" wrote:
 NB> Hi All

 NB> I've been thinking about TIKA-391 (intermittent incorrect mime type detection of office
 NB> formats), and I think we might need to do something different for container formats.

 NB> At the moment, for OLE2 based files (.xls, .ppt, .doc, .msg, .vsd etc), and for ZIP based
 NB> files (.zip, but also .xlsx, .pptx, .docx, .odf, .odt, .ots, .sxw etc), I don't think the
 NB> current method works well. AFAICT,
 NB> we detect the container, then have sub-class matches that try to look for the appropriate
 NB> children by hoping we can guess where the definition might hide within the
 NB> container. However, I think this is too unreliable - for example, with a .doc file, the
 NB> entry for the Word stream can come anywhere in the list of top level entries, so is very
 NB> hard to reliably find without properly parsing the OLE2 structure

Hmmm, WordDocument stream in .doc could be only under / directory entry,
but yes - it could anywhere in list of OLE2 entries...

 NB> So, I'd like to suggest a slightly different approach, one of loading the container format
 NB> to decide the mime type. This will, of course, make the detection step slower and more
 NB> memory hungry for detecting these (but only these) kinds of documents. However, provided
 NB> that we keep the open container around and pass it to the parser in a later step, it's
 NB> work we would've done anyway.

 NB> I'd then see the mime process be something like:
 NB> * Loop over all magic rules
 NB>   * If the magic fits and the file extension fits, pick this one
 NB>   * Otherwise if the magic fits and it's a container:
 NB>     * Load the container
 NB>     * Check the top level entries against our list for that container
 NB>     * If we get a hit, pick that
 NB>     * If nothing hits, assume it's just the container

Maybe it would useful to make this configurable? Sometimes it's useful to
force media type detection by magic only, not by extension (for example,
file could be renamed)...

-- 
With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/        http://alexott.net/
http://alexott-ru.blogspot.com/
Skype: alex.ott