You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Simon Tyler <st...@mimecast.net> on 2010/03/15 10:27:25 UTC

Detector results for Excel formats

Hi,

I am doing some testing of Tika 0.6 and noticed some odd results for the
testEXCEL.xls file included in the test suite.

100 calls to the following code:

             is = new BufferedInputStream(new FileInputStream(filename));

            Metadata metadata = new Metadata();
            metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
       
            String type = tika.detect(is, metadata);

Results in different matches as application/msword or
application/vnd.ms-excel seemingly at random.

Is this expected? Is there a way to mitigate it?

Simon

Re: Detector results for Excel formats

Posted by Simon Tyler <st...@mimecast.net>.

Raised https://issues.apache.org/jira/browse/TIKA-391 and provided a Tika
0.6  based fix. There might be more involved a fully fix as the issue can
apply to any method that uses the results from getMimeType.

Simon

On 23/03/2010 13:13, "Mattmann, Chris A (388J)"
<ch...@jpl.nasa.gov> wrote:

> Hi Simon,
> 
> Can you prepare a patch, and post it to JIRA? I'll happily take a look.
> 
> Thanks,
> Chris
> 
> 
> On 3/23/10 3:43 AM, "Simon Tyler" <st...@mimecast.net> wrote:
> 
> 
> 
> I have had a further look at the nature of the failure to detect the type of
> the particular file and still feel it is a bug.
> 
> This is an excel (.xls) spreadsheet and I give the detector the correct
> filename and correct content content type for it. The detector still fails
> to identify it correctly sometimes.
> 
> I had a look at the code and the reason is now clear to me and is easily
> fixed.
> 
> The getMimeType method searches for a magic match and stops at the first
> hit. The search is ordered (based on priority, size and clause). This
> particular file matches two detectors (word and excel) which compare
> identically - this means the order of them in the SortedSet is undefined,
> this is the cause of the problem.
> 
> A fix is for getMimeType to return the complete set of matches rather than a
> single match and then to use the filename and content-type hints on each
> match returning the first that matches either. I have modified the code to
> do this and it solves the problem. The hint matching could be improved
> further if necessary so that it picks the best match from the set based on
> both hints rather than just stopping at the first.
> 
> Simon
> 
> 
> On 18/03/2010 19:16, "Alex Ott" <al...@gmail.com> wrote:
> 
>> Re
>> 
>> Ken Krugler  at "Thu, 18 Mar 2010 12:07:14 -0700" wrote:
>>  KK> Thanks, Alex - great input.
>> 
>>  KK> We'd run into similar problems at Krugle, with determining the correct
>> mime-type for
>>  KK> source code. Sometimes you wind up needing to parse the  code to make
>> the
>> correct choice.
>> 
>>  KK> We had extended the Nutch mime-type detector to support both regex and
>> post-processing to
>>  KK> handle this disambiguation.
>> 
>>  KK> But that was hard-coded for a handful of known edge cases.
>> 
>>  KK> One possible way for this to work with the current XML-based mime-type
>> definitions is to
>>  KK> have a "here's the name of the class you'll have to  instantiate and run
>> to make the final
>>  KK> call"
>> 
>> Yes - I have something like in my own media type detector (for data leak
>> prevention) - when signature (either CFBF or Zip) is found, then
>> corresponding code is called, that return constant, that correspond to some
>> type (I need to implement logic inside my own code, because sometimes rules
>> are to complex to express them in simplier rules).   At the end I have
>> something like:
>> 
>> if CFBF Signature then get type from CFBF and if type == NNN then mimetype =
>> word/excel/...
>> 
>> But i have special lisp-like language to describe complex checks...
>> 
>>  KK> -- Ken
>> 
>>  KK> On Mar 18, 2010, at 11:21am, Alex Ott wrote:
>> 
>>>> 
>>>> I'm not sure, that this is actual for Tika, but I looked into its mime
>>>> database and see problem in definitions - both types uses common OLE (MS
>>>> CFBF - Microsoft Compound File Binary Format) signature, that also used by
>>>> dozens of file formats.  To perform correct mime type detection of CFBF
>>>> files, you need to analyze it (with POI?) and detect which objects are
>>>> located at top-directory (directly under Root Directory entry) of the OLE
>>>> file.  For word this is object WordDocument, while for Excel this is
>>>> Workbook or Book.  Simple search for corresponding names will not help,
>>>> because all these objects could be embedded into other documents via OLE.
>>>> 
>>>> Other details you can find in official Microsoft Documentation
>>>> 
>>>> Simon Tyler  at "Thu, 18 Mar 2010 18:12:16 +0000" wrote:
>>>> ST> Hi,
>>>> 
>>>> ST> I haven't seen any responses to this. Does anyone know why I should be
>>>> ST> seeing such unpredictable behaviour?
>>>> 
>>>> ST> Simon
>>>> 
>>>> ST> On 15/03/2010 09:27, "Simon Tyler" <st...@mimecast.net> wrote:
>>>> 
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I am doing some testing of Tika 0.6 and noticed some odd results for the
>>>>>> testEXCEL.xls file included in the test suite.
>>>>>> 
>>>>>> 100 calls to the following code:
>>>>>> 
>>>>>>             is = new BufferedInputStream(new FileInputStream(filename));
>>>>>> 
>>>>>>            Metadata metadata = new Metadata();
>>>>>>            metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
>>>>>> 
>>>>>>            String type = tika.detect(is, metadata);
>>>>>> 
>>>>>> Results in different matches as application/msword or
>>>>>> application/vnd.ms-excel seemingly at random.
>>>>>> 
>>>>>> Is this expected? Is there a way to mitigate it?
>>>>>> 
>>>>>> Simon
>>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> With best wishes, Alex Ott, MBA
>>>> http://alexott.blogspot.com/        http://alexott.net/
>>>> http://alexott-ru.blogspot.com/
>> 
>>  KK> --------------------------------------------
>>  KK> Ken Krugler
>>  KK> +1 530-210-6378
>>  KK> http://bixolabs.com
>>  KK> e l a s t i c   w e b   m i n i n g
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> 
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>

Re: Detector results for Excel formats

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hi Simon,

Can you prepare a patch, and post it to JIRA? I'll happily take a look.

Thanks,
Chris


On 3/23/10 3:43 AM, "Simon Tyler" <st...@mimecast.net> wrote:



I have had a further look at the nature of the failure to detect the type of
the particular file and still feel it is a bug.

This is an excel (.xls) spreadsheet and I give the detector the correct
filename and correct content content type for it. The detector still fails
to identify it correctly sometimes.

I had a look at the code and the reason is now clear to me and is easily
fixed.

The getMimeType method searches for a magic match and stops at the first
hit. The search is ordered (based on priority, size and clause). This
particular file matches two detectors (word and excel) which compare
identically - this means the order of them in the SortedSet is undefined,
this is the cause of the problem.

A fix is for getMimeType to return the complete set of matches rather than a
single match and then to use the filename and content-type hints on each
match returning the first that matches either. I have modified the code to
do this and it solves the problem. The hint matching could be improved
further if necessary so that it picks the best match from the set based on
both hints rather than just stopping at the first.

Simon


On 18/03/2010 19:16, "Alex Ott" <al...@gmail.com> wrote:

> Re
>
> Ken Krugler  at "Thu, 18 Mar 2010 12:07:14 -0700" wrote:
>  KK> Thanks, Alex - great input.
>
>  KK> We'd run into similar problems at Krugle, with determining the correct
> mime-type for
>  KK> source code. Sometimes you wind up needing to parse the  code to make the
> correct choice.
>
>  KK> We had extended the Nutch mime-type detector to support both regex and
> post-processing to
>  KK> handle this disambiguation.
>
>  KK> But that was hard-coded for a handful of known edge cases.
>
>  KK> One possible way for this to work with the current XML-based mime-type
> definitions is to
>  KK> have a "here's the name of the class you'll have to  instantiate and run
> to make the final
>  KK> call"
>
> Yes - I have something like in my own media type detector (for data leak
> prevention) - when signature (either CFBF or Zip) is found, then
> corresponding code is called, that return constant, that correspond to some
> type (I need to implement logic inside my own code, because sometimes rules
> are to complex to express them in simplier rules).   At the end I have
> something like:
>
> if CFBF Signature then get type from CFBF and if type == NNN then mimetype =
> word/excel/...
>
> But i have special lisp-like language to describe complex checks...
>
>  KK> -- Ken
>
>  KK> On Mar 18, 2010, at 11:21am, Alex Ott wrote:
>
>>>
>>> I'm not sure, that this is actual for Tika, but I looked into its mime
>>> database and see problem in definitions - both types uses common OLE (MS
>>> CFBF - Microsoft Compound File Binary Format) signature, that also used by
>>> dozens of file formats.  To perform correct mime type detection of CFBF
>>> files, you need to analyze it (with POI?) and detect which objects are
>>> located at top-directory (directly under Root Directory entry) of the OLE
>>> file.  For word this is object WordDocument, while for Excel this is
>>> Workbook or Book.  Simple search for corresponding names will not help,
>>> because all these objects could be embedded into other documents via OLE.
>>>
>>> Other details you can find in official Microsoft Documentation
>>>
>>> Simon Tyler  at "Thu, 18 Mar 2010 18:12:16 +0000" wrote:
>>> ST> Hi,
>>>
>>> ST> I haven't seen any responses to this. Does anyone know why I should be
>>> ST> seeing such unpredictable behaviour?
>>>
>>> ST> Simon
>>>
>>> ST> On 15/03/2010 09:27, "Simon Tyler" <st...@mimecast.net> wrote:
>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am doing some testing of Tika 0.6 and noticed some odd results for the
>>>>> testEXCEL.xls file included in the test suite.
>>>>>
>>>>> 100 calls to the following code:
>>>>>
>>>>>             is = new BufferedInputStream(new FileInputStream(filename));
>>>>>
>>>>>            Metadata metadata = new Metadata();
>>>>>            metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
>>>>>
>>>>>            String type = tika.detect(is, metadata);
>>>>>
>>>>> Results in different matches as application/msword or
>>>>> application/vnd.ms-excel seemingly at random.
>>>>>
>>>>> Is this expected? Is there a way to mitigate it?
>>>>>
>>>>> Simon
>>>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> With best wishes, Alex Ott, MBA
>>> http://alexott.blogspot.com/        http://alexott.net/
>>> http://alexott-ru.blogspot.com/
>
>  KK> --------------------------------------------
>  KK> Ken Krugler
>  KK> +1 530-210-6378
>  KK> http://bixolabs.com
>  KK> e l a s t i c   w e b   m i n i n g
>
>
>
>
>






++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Detector results for Excel formats

Posted by Simon Tyler <st...@mimecast.net>.

I have had a further look at the nature of the failure to detect the type of
the particular file and still feel it is a bug.

This is an excel (.xls) spreadsheet and I give the detector the correct
filename and correct content content type for it. The detector still fails
to identify it correctly sometimes.

I had a look at the code and the reason is now clear to me and is easily
fixed.

The getMimeType method searches for a magic match and stops at the first
hit. The search is ordered (based on priority, size and clause). This
particular file matches two detectors (word and excel) which compare
identically - this means the order of them in the SortedSet is undefined,
this is the cause of the problem.

A fix is for getMimeType to return the complete set of matches rather than a
single match and then to use the filename and content-type hints on each
match returning the first that matches either. I have modified the code to
do this and it solves the problem. The hint matching could be improved
further if necessary so that it picks the best match from the set based on
both hints rather than just stopping at the first.

Simon


On 18/03/2010 19:16, "Alex Ott" <al...@gmail.com> wrote:

> Re
> 
> Ken Krugler  at "Thu, 18 Mar 2010 12:07:14 -0700" wrote:
>  KK> Thanks, Alex - great input.
> 
>  KK> We'd run into similar problems at Krugle, with determining the correct
> mime-type for
>  KK> source code. Sometimes you wind up needing to parse the  code to make the
> correct choice.
> 
>  KK> We had extended the Nutch mime-type detector to support both regex and
> post-processing to
>  KK> handle this disambiguation.
> 
>  KK> But that was hard-coded for a handful of known edge cases.
> 
>  KK> One possible way for this to work with the current XML-based mime-type
> definitions is to
>  KK> have a "here's the name of the class you'll have to  instantiate and run
> to make the final
>  KK> call"
> 
> Yes - I have something like in my own media type detector (for data leak
> prevention) - when signature (either CFBF or Zip) is found, then
> corresponding code is called, that return constant, that correspond to some
> type (I need to implement logic inside my own code, because sometimes rules
> are to complex to express them in simplier rules).   At the end I have
> something like:
> 
> if CFBF Signature then get type from CFBF and if type == NNN then mimetype =
> word/excel/...
> 
> But i have special lisp-like language to describe complex checks...
> 
>  KK> -- Ken
> 
>  KK> On Mar 18, 2010, at 11:21am, Alex Ott wrote:
> 
>>> 
>>> I'm not sure, that this is actual for Tika, but I looked into its mime
>>> database and see problem in definitions - both types uses common OLE (MS
>>> CFBF - Microsoft Compound File Binary Format) signature, that also used by
>>> dozens of file formats.  To perform correct mime type detection of CFBF
>>> files, you need to analyze it (with POI?) and detect which objects are
>>> located at top-directory (directly under Root Directory entry) of the OLE
>>> file.  For word this is object WordDocument, while for Excel this is
>>> Workbook or Book.  Simple search for corresponding names will not help,
>>> because all these objects could be embedded into other documents via OLE.
>>> 
>>> Other details you can find in official Microsoft Documentation
>>> 
>>> Simon Tyler  at "Thu, 18 Mar 2010 18:12:16 +0000" wrote:
>>> ST> Hi,
>>> 
>>> ST> I haven't seen any responses to this. Does anyone know why I should be
>>> ST> seeing such unpredictable behaviour?
>>> 
>>> ST> Simon
>>> 
>>> ST> On 15/03/2010 09:27, "Simon Tyler" <st...@mimecast.net> wrote:
>>> 
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I am doing some testing of Tika 0.6 and noticed some odd results for the
>>>>> testEXCEL.xls file included in the test suite.
>>>>> 
>>>>> 100 calls to the following code:
>>>>> 
>>>>>             is = new BufferedInputStream(new FileInputStream(filename));
>>>>> 
>>>>>            Metadata metadata = new Metadata();
>>>>>            metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
>>>>> 
>>>>>            String type = tika.detect(is, metadata);
>>>>> 
>>>>> Results in different matches as application/msword or
>>>>> application/vnd.ms-excel seemingly at random.
>>>>> 
>>>>> Is this expected? Is there a way to mitigate it?
>>>>> 
>>>>> Simon
>>>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> With best wishes, Alex Ott, MBA
>>> http://alexott.blogspot.com/        http://alexott.net/
>>> http://alexott-ru.blogspot.com/
> 
>  KK> --------------------------------------------
>  KK> Ken Krugler
>  KK> +1 530-210-6378
>  KK> http://bixolabs.com
>  KK> e l a s t i c   w e b   m i n i n g
> 
> 
> 
> 
>

Re: Detector results for Excel formats

Posted by Alex Ott <al...@gmail.com>.

Re

Ken Krugler  at "Thu, 18 Mar 2010 12:07:14 -0700" wrote:
 KK> Thanks, Alex - great input.

 KK> We'd run into similar problems at Krugle, with determining the correct mime-type for
 KK> source code. Sometimes you wind up needing to parse the  code to make the correct choice.

 KK> We had extended the Nutch mime-type detector to support both regex and post-processing to
 KK> handle this disambiguation.

 KK> But that was hard-coded for a handful of known edge cases.

 KK> One possible way for this to work with the current XML-based mime-type definitions is to
 KK> have a "here's the name of the class you'll have to  instantiate and run to make the final
 KK> call"

Yes - I have something like in my own media type detector (for data leak
prevention) - when signature (either CFBF or Zip) is found, then
corresponding code is called, that return constant, that correspond to some
type (I need to implement logic inside my own code, because sometimes rules
are to complex to express them in simplier rules).   At the end I have
something like:

if CFBF Signature then get type from CFBF and if type == NNN then mimetype = word/excel/...

But i have special lisp-like language to describe complex checks...

 KK> -- Ken

 KK> On Mar 18, 2010, at 11:21am, Alex Ott wrote:

 >>
 >> I'm not sure, that this is actual for Tika, but I looked into its mime
 >> database and see problem in definitions - both types uses common OLE (MS
 >> CFBF - Microsoft Compound File Binary Format) signature, that also used by
 >> dozens of file formats.  To perform correct mime type detection of CFBF
 >> files, you need to analyze it (with POI?) and detect which objects are
 >> located at top-directory (directly under Root Directory entry) of the OLE
 >> file.  For word this is object WordDocument, while for Excel this is
 >> Workbook or Book.  Simple search for corresponding names will not help,
 >> because all these objects could be embedded into other documents via OLE.
 >>
 >> Other details you can find in official Microsoft Documentation
 >>
 >> Simon Tyler  at "Thu, 18 Mar 2010 18:12:16 +0000" wrote:
 >> ST> Hi,
 >>
 >> ST> I haven't seen any responses to this. Does anyone know why I should be
 >> ST> seeing such unpredictable behaviour?
 >>
 >> ST> Simon
 >>
 >> ST> On 15/03/2010 09:27, "Simon Tyler" <st...@mimecast.net> wrote:
 >>
 >>>>
 >>>> Hi,
 >>>>
 >>>> I am doing some testing of Tika 0.6 and noticed some odd results for the
 >>>> testEXCEL.xls file included in the test suite.
 >>>>
 >>>> 100 calls to the following code:
 >>>>
 >>>>             is = new BufferedInputStream(new FileInputStream(filename));
 >>>>
 >>>>            Metadata metadata = new Metadata();
 >>>>            metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
 >>>>
 >>>>            String type = tika.detect(is, metadata);
 >>>>
 >>>> Results in different matches as application/msword or
 >>>> application/vnd.ms-excel seemingly at random.
 >>>>
 >>>> Is this expected? Is there a way to mitigate it?
 >>>>
 >>>> Simon
 >>>>
 >>
 >>
 >>
 >>
 >>
 >> --
 >> With best wishes, Alex Ott, MBA
 >> http://alexott.blogspot.com/        http://alexott.net/
 >> http://alexott-ru.blogspot.com/

 KK> --------------------------------------------
 KK> Ken Krugler
 KK> +1 530-210-6378
 KK> http://bixolabs.com
 KK> e l a s t i c   w e b   m i n i n g






-- 
With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/        http://alexott.net/
http://alexott-ru.blogspot.com/

Re: Detector results for Excel formats

Posted by Ken Krugler <kk...@transpac.com>.

Thanks, Alex - great input.

We'd run into similar problems at Krugle, with determining the correct  
mime-type for source code. Sometimes you wind up needing to parse the  
code to make the correct choice.

We had extended the Nutch mime-type detector to support both regex and  
post-processing to handle this disambiguation.

But that was hard-coded for a handful of known edge cases.

One possible way for this to work with the current XML-based mime-type  
definitions is to have a "here's the name of the class you'll have to  
instantiate and run to make the final call"

-- Ken

On Mar 18, 2010, at 11:21am, Alex Ott wrote:

>
> I'm not sure, that this is actual for Tika, but I looked into its mime
> database and see problem in definitions - both types uses common OLE  
> (MS
> CFBF - Microsoft Compound File Binary Format) signature, that also  
> used by
> dozens of file formats.  To perform correct mime type detection of  
> CFBF
> files, you need to analyze it (with POI?) and detect which objects are
> located at top-directory (directly under Root Directory entry) of  
> the OLE
> file.  For word this is object WordDocument, while for Excel this is
> Workbook or Book.  Simple search for corresponding names will not  
> help,
> because all these objects could be embedded into other documents via  
> OLE.
>
> Other details you can find in official Microsoft Documentation
>
> Simon Tyler  at "Thu, 18 Mar 2010 18:12:16 +0000" wrote:
> ST> Hi,
>
> ST> I haven't seen any responses to this. Does anyone know why I  
> should be
> ST> seeing such unpredictable behaviour?
>
> ST> Simon
>
> ST> On 15/03/2010 09:27, "Simon Tyler" <st...@mimecast.net> wrote:
>
>>>
>>> Hi,
>>>
>>> I am doing some testing of Tika 0.6 and noticed some odd results  
>>> for the
>>> testEXCEL.xls file included in the test suite.
>>>
>>> 100 calls to the following code:
>>>
>>>             is = new BufferedInputStream(new  
>>> FileInputStream(filename));
>>>
>>>            Metadata metadata = new Metadata();
>>>            metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
>>>
>>>            String type = tika.detect(is, metadata);
>>>
>>> Results in different matches as application/msword or
>>> application/vnd.ms-excel seemingly at random.
>>>
>>> Is this expected? Is there a way to mitigate it?
>>>
>>> Simon
>>>
>
>
>
>
>
> -- 
> With best wishes, Alex Ott, MBA
> http://alexott.blogspot.com/        http://alexott.net/
> http://alexott-ru.blogspot.com/

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Detector results for Excel formats

Posted by Alex Ott <al...@gmail.com>.

I'm not sure, that this is actual for Tika, but I looked into its mime
database and see problem in definitions - both types uses common OLE (MS
CFBF - Microsoft Compound File Binary Format) signature, that also used by
dozens of file formats.  To perform correct mime type detection of CFBF
files, you need to analyze it (with POI?) and detect which objects are
located at top-directory (directly under Root Directory entry) of the OLE
file.  For word this is object WordDocument, while for Excel this is
Workbook or Book.  Simple search for corresponding names will not help,
because all these objects could be embedded into other documents via OLE.

Other details you can find in official Microsoft Documentation

Simon Tyler  at "Thu, 18 Mar 2010 18:12:16 +0000" wrote:
 ST> Hi,

 ST> I haven't seen any responses to this. Does anyone know why I should be
 ST> seeing such unpredictable behaviour?

 ST> Simon

 ST> On 15/03/2010 09:27, "Simon Tyler" <st...@mimecast.net> wrote:

 >> 
 >> Hi,
 >> 
 >> I am doing some testing of Tika 0.6 and noticed some odd results for the
 >> testEXCEL.xls file included in the test suite.
 >> 
 >> 100 calls to the following code:
 >> 
 >>              is = new BufferedInputStream(new FileInputStream(filename));
 >> 
 >>             Metadata metadata = new Metadata();
 >>             metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
 >>        
 >>             String type = tika.detect(is, metadata);
 >> 
 >> Results in different matches as application/msword or
 >> application/vnd.ms-excel seemingly at random.
 >> 
 >> Is this expected? Is there a way to mitigate it?
 >> 
 >> Simon
 >> 

-- 
With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/        http://alexott.net/
http://alexott-ru.blogspot.com/

Re: Detector results for Excel formats

Posted by Simon Tyler <st...@mimecast.net>.

Hi,

I haven't seen any responses to this. Does anyone know why I should be
seeing such unpredictable behaviour?

Simon

On 15/03/2010 09:27, "Simon Tyler" <st...@mimecast.net> wrote:

> 
> Hi,
> 
> I am doing some testing of Tika 0.6 and noticed some odd results for the
> testEXCEL.xls file included in the test suite.
> 
> 100 calls to the following code:
> 
>              is = new BufferedInputStream(new FileInputStream(filename));
> 
>             Metadata metadata = new Metadata();
>             metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
>        
>             String type = tika.detect(is, metadata);
> 
> Results in different matches as application/msword or
> application/vnd.ms-excel seemingly at random.
> 
> Is this expected? Is there a way to mitigate it?
> 
> Simon
>