You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Vish Ramachandran <vi...@hp.com> on 2012/06/15 22:31:34 UTC

MSI file types being detected as application/x-tika-msoffice

Hi,

Download the following file, which is MSI installer for 7zip, a zip utility.

http://downloads.sourceforge.net/sevenzip/7z920-x64.msi

The following code:

String detectedType = new Tika().detect(new File("7z920-x64.msi"));

results in mime: application/x-tika-msoffice

which is wrong.

Is this expected, or am I missing something else?

Thanks
Vish








Re: MSI file types being detected as application/x-tika-msoffice

Posted by Alex Ott <al...@gmail.com>.
if you ping me on monday, I can try to find small example. Although, I
think that it can be generated using Window Installer toolkit

On Sat, Jun 16, 2012 at 5:37 PM, Nick Burch <ni...@alfresco.com> wrote:
> On 16/06/12 09:14, Alex Ott wrote:
>>
>> MSI file is windows installer, but internally it's using MS-CFB file
>> format to store data. To correctly detect it, detector should perform
>> transformation of object names (7z can do this, if I remember
>> correctly) into human-readable names, and then search for special
>> entries
>
>
> We can certainly update the detectors to handle this new (to us!) kind of
> OLE2 based file. Does anyone know of a very small .msi file we can use in a
> unit test for this? (An example one might be a good bet)
>
> Nick



-- 
With best wishes,                    Alex Ott
http://alexott.net/
Tiwtter: alexott_en (English), alexott (Russian)
Skype: alex.ott

Re: MSI file types being detected as application/x-tika-msoffice

Posted by Nick Burch <ni...@alfresco.com>.
On 16/06/12 09:14, Alex Ott wrote:
> MSI file is windows installer, but internally it's using MS-CFB file
> format to store data. To correctly detect it, detector should perform
> transformation of object names (7z can do this, if I remember
> correctly) into human-readable names, and then search for special
> entries

We can certainly update the detectors to handle this new (to us!) kind 
of OLE2 based file. Does anyone know of a very small .msi file we can 
use in a unit test for this? (An example one might be a good bet)

Nick

Re: MSI file types being detected as application/x-tika-msoffice

Posted by Alex Ott <al...@gmail.com>.
MSI file is windows installer, but internally it's using MS-CFB file
format to store data. To correctly detect it, detector should perform
transformation of object names (7z can do this, if I remember
correctly) into human-readable names, and then search for special
entries

On Fri, Jun 15, 2012 at 10:31 PM, Vish Ramachandran
<vi...@hp.com> wrote:
> Hi,
>
> Download the following file, which is MSI installer for 7zip, a zip utility.
>
> http://downloads.sourceforge.net/sevenzip/7z920-x64.msi
>
> The following code:
>
> String detectedType = new Tika().detect(new File("7z920-x64.msi"));
>
> results in mime: application/x-tika-msoffice
>
> which is wrong.
>
> Is this expected, or am I missing something else?
>
> Thanks
> Vish
>
>
>
>
>
>
>



-- 
With best wishes,                    Alex Ott
http://alexott.net/
Tiwtter: alexott_en (English), alexott (Russian)
Skype: alex.ott