You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Public Network Services <pu...@gmail.com> on 2012/01/15 14:04:07 UTC

File type detection

Hi...

I am using Tika 0.9 to detect various types of files and formats, but not
getting the expected behavior. More specifically:

   - For XML files, sometimes the returned type is "text/xml" and some
   other types it is "application/xml". The second case happens intermittently
   and has occurred rarely, so it is not reproducible. Perhaps a class loading
   issue?


   - For various application files (e.g., images or MS-Office files) the
   detected type  is the generic "application/octet-stream", as opposed to the
   specific MIME type for the application.

The detection is made via a simple call to


new Tika().detect(inputStream);


where "inputStream" is the Java InputStream object used for reading from
the corresponding data file.


Is there any additional configuration (or other usage pattern) needed to
achieve the desired behavior?

Thanks!

Re: File type detection

Posted by Nick Burch <ni...@alfresco.com>.
On Sun, 15 Jan 2012, Public Network Services wrote:
> I am using Tika 0.9 to detect various types of files and formats, but not
> getting the expected behavior.

I'd suggest you try a recent nighlty build, and see if that helps - we've 
done quite a bit of detection work since 0.9

> - For various application files (e.g., images or MS-Office files) the
>  detected type  is the generic "application/octet-stream", as opposed to the
>  specific MIME type for the application.

For office file formats to be properly detected, you'll need to also have 
the tika parsers jar (+ dependencies) in your classpath, so that the extra 
detectors are present

> The detection is made via a simple call to
>
> new Tika().detect(inputStream);

It's worth double checking with the tika-app jar and the --detect flag, 
that'll let you verify if a detection problem is really a Tika one, or a 
problem with your setup (eg missing jars)

Nick