You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Public Network Services <pu...@gmail.com> on 2012/01/15 14:04:07 UTC
File type detection
Hi...
I am using Tika 0.9 to detect various types of files and formats, but not
getting the expected behavior. More specifically:
- For XML files, sometimes the returned type is "text/xml" and some
other types it is "application/xml". The second case happens intermittently
and has occurred rarely, so it is not reproducible. Perhaps a class loading
issue?
- For various application files (e.g., images or MS-Office files) the
detected type is the generic "application/octet-stream", as opposed to the
specific MIME type for the application.
The detection is made via a simple call to
new Tika().detect(inputStream);
where "inputStream" is the Java InputStream object used for reading from
the corresponding data file.
Is there any additional configuration (or other usage pattern) needed to
achieve the desired behavior?
Thanks!
Re: File type detection
Posted by Nick Burch <ni...@alfresco.com>.
On Sun, 15 Jan 2012, Public Network Services wrote:
> I am using Tika 0.9 to detect various types of files and formats, but not
> getting the expected behavior.
I'd suggest you try a recent nighlty build, and see if that helps - we've
done quite a bit of detection work since 0.9
> - For various application files (e.g., images or MS-Office files) the
> detected type is the generic "application/octet-stream", as opposed to the
> specific MIME type for the application.
For office file formats to be properly detected, you'll need to also have
the tika parsers jar (+ dependencies) in your classpath, so that the extra
detectors are present
> The detection is made via a simple call to
>
> new Tika().detect(inputStream);
It's worth double checking with the tika-app jar and the --detect flag,
that'll let you verify if a detection problem is really a Tika one, or a
problem with your setup (eg missing jars)
Nick