You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Keith R. Bennett" <kb...@bbsinc.biz> on 2007/10/22 23:00:00 UTC

MIME Type Detection from Byte Header Failing

All -

We're still having the problem that MIME type detection from byte headers is
failing.  I'll try to look into it, but if anyone else could also take a
look, that would be great.

I'm attaching a patch that:

* adds an alternate method for determining the MIME type that calls the new
MimeUtils.getMimeType() method.
* calls the regular method and the alternate method and asserts that they
return the same result
* enables only one type of test so that the output is more manageable
* the alternate method turns the original one upside down; it's simpler to
me because if a type is found via the byte header, the other methods are not
attempted, etc.; let me know what you think.

This patch is not intended to ever be committed, but is just for review.

Thanks,
- Keith



http://www.nabble.com/file/p13352486/diag.patch diag.patch 


-- 
View this message in context: http://www.nabble.com/MIME-Type-Detection-from-Byte-Header-Failing-tf4673629.html#a13352486
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: MIME Type Detection from Byte Header Failing

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On 10/23/07, Keith R. Bennett <kb...@bbsinc.biz> wrote:
> We're still having the problem that MIME type detection from byte headers is
> failing.  I'll try to look into it, but if anyone else could also take a
> look, that would be great.

The main problem is that we simply don't have too many MIME magic
rules in the default configuration. Until we do have more/better
rules, magic detection simply won't work regardless of how we organize
things in AutoDetectParser or MimeTypes/MimeUtils.

BR,

Jukka Zitting

Re: MIME Type Detection from Byte Header Failing

Posted by "Keith R. Bennett" <kb...@bbsinc.biz>.
Chris -

Thanks for offering to look into this.  

Recently I added unit tests to the AutoDetectParserTest class that generate
these failures.  I commented out the tests that didn't work, but left them
there so they could be used later, and hopefully reenabled when they succeed
(please see assertAutoDetect(String resource, String type, String content)). 
If you uncomment them, then the errors will cause the test to fail, and you
will see the behavior I'm describing.

The patch I provided in the previous message calls the MimeUtils and the
AutoDetectParser methods that determine MIME type to illustrate that it is
not just an AutoDetectParser problem.

Here is one way to approach this:

1) Apply the patch in my previous message to a fresh copy of Tika.

2) Remove the "//" from the println's in getMimeType2() if you'd like to get
debug output.

3) Run the unit test; in Intellij Idea, I just right click inside the test
method (AutoDetectParserTest.testWord()) and select "Run".  You can also
just run "mvn test" on the command line.

Feel free to get in touch anytime if there's anything else I can do to
clarify or help.

Regards,
Keith


Chris Mattmann wrote:
> 
> Keith:
> 
> Where exactly is it failing? In the unit tests? Or in your code? Could you
> be more specific so that I (or someone else) can track it down?
> 
> Thanks,
>  Chris
> 
> 
> 
> On 10/22/07 2:00 PM, "Keith R. Bennett" <kb...@bbsinc.biz> wrote:
> 
>> 
>> All -
>> 
>> We're still having the problem that MIME type detection from byte headers
>> is
>> failing.  I'll try to look into it, but if anyone else could also take a
>> look, that would be great.
>> 
>> I'm attaching a patch that:
>> 
>> * adds an alternate method for determining the MIME type that calls the
>> new
>> MimeUtils.getMimeType() method.
>> * calls the regular method and the alternate method and asserts that they
>> return the same result
>> * enables only one type of test so that the output is more manageable
>> * the alternate method turns the original one upside down; it's simpler
>> to
>> me because if a type is found via the byte header, the other methods are
>> not
>> attempted, etc.; let me know what you think.
>> 
>> This patch is not intended to ever be committed, but is just for review.
>> 
>> Thanks,
>> - Keith
>> 
>> 
>> 
>> http://www.nabble.com/file/p13352486/diag.patch diag.patch
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/MIME-Type-Detection-from-Byte-Header-Failing-tf4673629.html#a13352854
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


Re: MIME Type Detection from Byte Header Failing

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Keith:

Where exactly is it failing? In the unit tests? Or in your code? Could you
be more specific so that I (or someone else) can track it down?

Thanks,
 Chris



On 10/22/07 2:00 PM, "Keith R. Bennett" <kb...@bbsinc.biz> wrote:

> 
> All -
> 
> We're still having the problem that MIME type detection from byte headers is
> failing.  I'll try to look into it, but if anyone else could also take a
> look, that would be great.
> 
> I'm attaching a patch that:
> 
> * adds an alternate method for determining the MIME type that calls the new
> MimeUtils.getMimeType() method.
> * calls the regular method and the alternate method and asserts that they
> return the same result
> * enables only one type of test so that the output is more manageable
> * the alternate method turns the original one upside down; it's simpler to
> me because if a type is found via the byte header, the other methods are not
> attempted, etc.; let me know what you think.
> 
> This patch is not intended to ever be committed, but is just for review.
> 
> Thanks,
> - Keith
> 
> 
> 
> http://www.nabble.com/file/p13352486/diag.patch diag.patch
> 

______________________________________________
Chris Mattmann, Ph.D.
Chris.Mattmann@jpl.nasa.gov
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.