You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Litrik De Roy <li...@gmail.com> on 2008/02/01 15:22:32 UTC

AutoDetectParser and MS Office formats

All,

I started working on the Eclipse plug-in that I have mentioned earlier
but I ran into a problem with the AutoDetectParser.

It does not seem to recognize any of the MS Office file formats. They
all return "application/octet-stream" as content type, but no
metadata.
All other file formats work OK.

I tested this with the test files included in the
src\test\resources\test-documents directory.

My source looks like this:

----8<--------8<--------8<--------8<--------8<--------8<----
private AutoDetectParser parser = new AutoDetectParser();
private Metadata metadata = new Metadata();
...
parser.parse(stream, new DefaultHandler(), metadata);
----8<--------8<--------8<--------8<--------8<--------8<----

I'm running Java 1.6.0_03 on Windows.

I there anything special that must be done to get POI to work?

-- 
Litrik De Roy
Norio ICT Consulting - http://www.norio.be/

Re: AutoDetectParser and MS Office formats

Posted by Litrik De Roy <li...@litrik.com>.
On Feb 1, 2008 3:34 PM, Jukka Zitting <ju...@gmail.com> wrote:
>
> On Feb 1, 2008 4:22 PM, Litrik De Roy <li...@gmail.com> wrote:
> > I started working on the Eclipse plug-in that I have mentioned earlier
> > but I ran into a problem with the AutoDetectParser.
> >
> > [...]
> > I there anything special that must be done to get POI to work?
>
> We currently don't have any magic header matchers for Microsoft Office
> file formats, so the only thing AutoDetectParser can use to detect the
> file type is the file name suffix.
>
> Do you have the file name available to your plugin? You can feed the
> file name to AutoDetectParser like this:
>
>     AutoDetectParser parser = new AutoDetectParser();
>     InputStream stream = ...;
>     ContentHandler handler = ...;
>     Metadata metadata = new Metadata();
>     metadata.set(Metadata.RESOURCE_NAME_KEY, ...);
>     parser.parse(stream, handler, metadata);
>

That does the trick. Thanks.

-- 
Litrik De Roy
Norio ICT Consulting - http://www.norio.be/
litrik@norio.be - 0475 873235

Re: AutoDetectParser and MS Office formats

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Feb 1, 2008 4:22 PM, Litrik De Roy <li...@gmail.com> wrote:
> I started working on the Eclipse plug-in that I have mentioned earlier
> but I ran into a problem with the AutoDetectParser.
>
> It does not seem to recognize any of the MS Office file formats. They
> all return "application/octet-stream" as content type, but no
> metadata. All other file formats work OK.
> [...]
> I there anything special that must be done to get POI to work?

We currently don't have any magic header matchers for Microsoft Office
file formats, so the only thing AutoDetectParser can use to detect the
file type is the file name suffix.

Do you have the file name available to your plugin? You can feed the
file name to AutoDetectParser like this:

    AutoDetectParser parser = new AutoDetectParser();
    InputStream stream = ...;
    ContentHandler handler = ...;
    Metadata metadata = new Metadata();
    metadata.set(Metadata.RESOURCE_NAME_KEY, ...);
    parser.parse(stream, handler, metadata);

BR,

Jukka Zitting