You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by naskoo <ma...@gmail.com> on 2012/09/23 18:33:57 UTC

Problem detecting Microsoft Office formats from InputStream

Hi all,

I have tried to detect several types of formats and currently only the
Microsoft Office ones are
those that cannot be detected accurately.

If Tika's detect(File file) method is used ms files are detected as follows
and
I guess the result from detection is the expected one.
doc - "application/msword"
docx -
"application/vnd.openxmlformats-officedocument.wordprocessingml.document"

But If Tika's detect(InputStream is) method is used the picture is not the
same.
The results are:
doc - "application/x-tika-msoffice"
docx - "application/x-tika-ooxml"

Files for the test are created from MS Office 2007.
I couldn't find out why I get different results on same files.
Please let me know If I do something wrong or if there is some adequate
reason for this behaviour.

Best Regards,
Nasko

Re: Problem detecting Microsoft Office formats from InputStream

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Sun, Sep 23, 2012 at 8:07 PM, naskoo <ma...@gmail.com> wrote:
> Thanks for the suggestion. That way the problem is solved at some point.
> I run some more tests, but this time I removed the ms file extensions.
> I get the same not consistent results as before, even if I use
> TikaInputStream as a wrapper.
> Probably TikaInputStream just adds some metadata to include the file
> extension in the detection.

It doesn't add extra metadata (unless explicitly requested). Instead
the TikaInputStream class allows Tika parsers and detectors to use
random access for reading the underlying file.

The MS Office detectors (and a few other features in Tika) rely on
that functionality, and thus won't give as accurate results when given
just a plain InputStream instance.

BR,

Jukka Zitting

Re: Problem detecting Microsoft Office formats from InputStream

Posted by Nick Burch <ap...@gagravarr.org>.
On Sun, 23 Sep 2012, naskoo wrote:
> Probably TikaInputStream just adds some metadata to include the file 
> extension in the detection.

Nope, as I said:

>> Try wrapping your InputStream as a TikaInputStream - for full container 
>> detection Tika needs to be able to read the whole file, but still have 
>> it available for the parser

TikaInputStream provides this buffering, which allows a detector to read 
the whole file to identify what it contains (which container formats 
need), whilst still allowing a parser to get at the whole contents to 
process it

Nick

Re: Problem detecting Microsoft Office formats from InputStream

Posted by naskoo <ma...@gmail.com>.
Thanks for the suggestion. That way the problem is solved at some point.
I run some more tests, but this time I removed the ms file extensions.
I get the same not consistent results as before, even if I use
TikaInputStream as a wrapper.
Probably TikaInputStream just adds some metadata to include the file
extension in the
detection.

On Sun, Sep 23, 2012 at 8:37 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Sun, 23 Sep 2012, naskoo wrote:
>
>> But If Tika's detect(InputStream is) method is used the picture is not the
>> same.
>> The results are:
>> doc - "application/x-tika-msoffice"
>> docx - "application/x-tika-ooxml"
>>
>
> Try wrapping your InputStream as a TikaInputStream - for full container
> detection Tika needs to be able to read the whole file, but still have it
> available for the parser
>
> Nick
>

Re: Problem detecting Microsoft Office formats from InputStream

Posted by Nick Burch <ap...@gagravarr.org>.
On Sun, 23 Sep 2012, naskoo wrote:
> But If Tika's detect(InputStream is) method is used the picture is not the
> same.
> The results are:
> doc - "application/x-tika-msoffice"
> docx - "application/x-tika-ooxml"

Try wrapping your InputStream as a TikaInputStream - for full container 
detection Tika needs to be able to read the whole file, but still have it 
available for the parser

Nick