You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Jon Gorrono <jp...@ucdavis.edu> on 2012/03/14 02:34:41 UTC

'looking' inside an OOXML container

Greetings... I am new to Tika and I am trying to detect the
internal doc format of an ooxml container/file

When I call detect (InputStream, String) in a new Ticka() instance, it
appears I can fool the detector(s) by changing the file extension of a
docx file to xlsx...the detection returns
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Since in the code comments use the word 'hint' to describe the use of
resource names during detection, I was hoping that the hint itself was
taken lightly: advisory

Our application accepts a very limited set of file extensions, and we
have to expect that some users will solve any conundrums about file
formats by renaming their files to meet the requirements.

I think I've all the jars (including transient dep's) piled onto the
classpath so that the more rigorous detection can take place...I've
gone thru the list of jars in the 1.0 gettingstarted.html doc twice to
make sure they are all listed in the eclipse classpath.... I just
don't know if what I am seeing is consistent with missing jars or not.

I done some debugging and see a very long list of Magics, but, again,
don;t know if that is core or not.... should I see a long list of
detectors as well?

Any help offered would be appreciated

-- 
Jon Gorrono
PGP Key: 0x5434509D -
http{pgp.mit.edu:11371/pks/lookup?search=0x5434509D&op=index}
GSWoT Introducer - {GSWoT:US75 5434509D Jon P. Gorrono <jpgorrono -
www.gswot.org>}
http{middleware.ucdavis.edu}

Re: 'looking' inside an OOXML container

Posted by Jon Gorrono <jp...@ucdavis.edu>.
Thanks for the response.... replies in-line below...

On Thu, Mar 15, 2012 at 9:58 AM, Nick Burch <ni...@alfresco.com> wrote:
> On Tue, 13 Mar 2012, Jon Gorrono wrote:
>>
>> The tika-app jar properly identifies the misnamed file so it's either a
>> classpath or a implementation issue
>
>
> You'll need to have the Tika Parsers jar (and associated dependencies) for
> it to work properly. We do have unit tests for this, and as long as the
> parser jar + dependencies are there, then the appropriate detector will
> fire. It may be worth making sure you use a recent nightly build, or waiting
> for Tika 1.1 (hopefully due soon) though, as I seem to recall we had to fix
> an ordering problem at some point
>
>
>> Also ContainerAwareDetector does not seem to exist in 1.0 ... this leads
>> me to think that that part was abstracted for ease of use and the docs are
>> now outdated(?)
>
>
> Which docs were you looking at? ContainerAwareDetector has gone, yes, it's
> now handled by the same service loading mechanism that parsers use

http://tika.apache.org/1.0/detection.html

And this

http://tika.apache.org/1.0/gettingstarted.html

.. is where I am getting the list of jars to put on the classpath....
all in that list are present with the same version listed and there
are not conflicts with other versions


>
>
>> But should I then be wrapping the inputstream in a TikaInputStream?
>
>
> If you have a File, then I'd suggest you use a TikaInputStream



Ok, I wrapped the file ... right now I the same (no) effect.




>
>
>> I also tried the detection after creating a spingbean for the Tika class
>> in the hope that it might wake up a hidden 'inner-self' :)
>
>
> Not sure if it'll help or not, but you could look at Alfresco for an example
> (though a large one!) of using spring beans with Tika for detection

I just thought that using Spring might help some aspect of the
detection bootstrapping, like getting the OSGi environment squared
away... but if that is not the case, I'll just be as well to do
without it.


>
> Nick



-- 
Jon Gorrono
PGP Key: 0x5434509D -
http{pgp.mit.edu:11371/pks/lookup?search=0x5434509D&op=index}
GSWoT Introducer - {GSWoT:US75 5434509D Jon P. Gorrono <jpgorrono -
www.gswot.org>}
http{middleware.ucdavis.edu}

Re: 'looking' inside an OOXML container

Posted by Nick Burch <ni...@alfresco.com>.
On Tue, 13 Mar 2012, Jon Gorrono wrote:
> The tika-app jar properly identifies the misnamed file so it's either a 
> classpath or a implementation issue

You'll need to have the Tika Parsers jar (and associated dependencies) for 
it to work properly. We do have unit tests for this, and as long as the 
parser jar + dependencies are there, then the appropriate detector will 
fire. It may be worth making sure you use a recent nightly build, or 
waiting for Tika 1.1 (hopefully due soon) though, as I seem to recall we 
had to fix an ordering problem at some point

> Also ContainerAwareDetector does not seem to exist in 1.0 ... this leads 
> me to think that that part was abstracted for ease of use and the docs 
> are now outdated(?)

Which docs were you looking at? ContainerAwareDetector has gone, yes, it's 
now handled by the same service loading mechanism that parsers use

> But should I then be wrapping the inputstream in a TikaInputStream?

If you have a File, then I'd suggest you use a TikaInputStream

> I also tried the detection after creating a spingbean for the Tika class 
> in the hope that it might wake up a hidden 'inner-self' :)

Not sure if it'll help or not, but you could look at Alfresco for an 
example (though a large one!) of using spring beans with Tika for 
detection

Nick

Re: 'looking' inside an OOXML container

Posted by Jon Gorrono <jp...@ucdavis.edu>.
The tika-app jar properly identifies the misnamed file so it's either
a classpath or a implementation issue

I've checked the cp again and verified all jars present and accounted
for and no duplicated or version-based conflicts

Also ContainerAwareDetector does not seem to exist in 1.0 ... this
leads me to think that that part was abstracted for ease of use and
the docs are now outdated(?)

But should I then be wrapping the inputstream in a TikaInputStream?

I also tried the detection after creating a spingbean for the Tika
class in the hope that it might wake up a hidden 'inner-self' :)



On Tue, Mar 13, 2012 at 6:34 PM, Jon Gorrono <jp...@ucdavis.edu> wrote:
> Greetings... I am new to Tika and I am trying to detect the
> internal doc format of an ooxml container/file
>
> When I call detect (InputStream, String) in a new Ticka() instance, it
> appears I can fool the detector(s) by changing the file extension of a
> docx file to xlsx...the detection returns
> application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
>
> Since in the code comments use the word 'hint' to describe the use of
> resource names during detection, I was hoping that the hint itself was
> taken lightly: advisory
>
> Our application accepts a very limited set of file extensions, and we
> have to expect that some users will solve any conundrums about file
> formats by renaming their files to meet the requirements.
>
> I think I've all the jars (including transient dep's) piled onto the
> classpath so that the more rigorous detection can take place...I've
> gone thru the list of jars in the 1.0 gettingstarted.html doc twice to
> make sure they are all listed in the eclipse classpath.... I just
> don't know if what I am seeing is consistent with missing jars or not.
>
> I done some debugging and see a very long list of Magics, but, again,
> don;t know if that is core or not.... should I see a long list of
> detectors as well?
>
> Any help offered would be appreciated
>
> --
> Jon Gorrono
> PGP Key: 0x5434509D -
> http{pgp.mit.edu:11371/pks/lookup?search=0x5434509D&op=index}
> GSWoT Introducer - {GSWoT:US75 5434509D Jon P. Gorrono <jpgorrono -
> www.gswot.org>}
> http{middleware.ucdavis.edu}



-- 
Jon Gorrono
PGP Key: 0x5434509D -
http{pgp.mit.edu:11371/pks/lookup?search=0x5434509D&op=index}
GSWoT Introducer - {GSWoT:US75 5434509D Jon P. Gorrono <jpgorrono -
www.gswot.org>}
http{middleware.ucdavis.edu}