You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Nick Burch <ap...@gagravarr.org> on 2015/06/07 14:01:22 UTC

Re: svn commit: r1683969 - /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.t ika.parser.Parser

On Sun, 7 Jun 2015, Mattmann, Chris A (3980) wrote:
> Also the lovely thing here too is that since cTAKESParser is a decorator 
> for AutoDetectParser there is magical infinite recursion if it’s enabled 
> via SPI.

Should it really be a wrapper for AutoDetectParser though? I haven't read 
through the wiki page or the code yet (need to do that after lunch...), 
but my general guess would've been that a wrapping parser should sit 
between AutoDetectParser and DefaultParser? (AutoDetectParser normally 
calls to DefaultParser via the Tika config).

If it worked that way, we could slip it in between the two in the tika 
config file.

Though if someone could quickly point out why it needs to wrap outside 
AutoDetectParser rather than inside, that'd save time!

Nick

Re: svn commit: r1683969 - /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.t ika.parser.Parser

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Works great, thanks Nick. I’ll update the wiki once we release 1.10
since 1.9 will have the old way of doing it.

Thanks for this!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: Nick Burch <ap...@gagravarr.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Monday, June 8, 2015 at 8:29 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: svn commit: r1683969 -
/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.t
ika.parser.Parser

>On Sun, 7 Jun 2015, Mattmann, Chris A (3980) wrote:
>> Great question Nick. If you have a better idea on how to make it so
>>that 
>> any file can come into the cTAKES parser, get its text and metadata
>> parsed out, and then feed that into cTAKES Iʼm all ears. We just
>>thought 
>> that decorating AutoDetect would serve that purpose for us. Since
>>cTAKES 
>> just puts metadata in the met object (as of now) and doesnʼt do XHTML
>> content (future improvement), I supposed we could instantiate an
>> AutoDetectParser instead of decorating it which may help. Dunno,
>>anyways 
>> looking forward to what your thoughts are :-)
>
>I've had a go at this, and fixed a few Tika bugs on the way... You can
>now 
>(as detailed in the javadoc) just do:
>    AutoDetectParser parser = new AutoDetectParser(new CTAKESParser());
>And you'll get auto-detection with cTAKES applied to the result.
>
>Alternately, if you want to turn on cTAKES support in config, for use eg
>with the Tika CLI or Tika Server, you just need a config file like:
>   <properties>
>     <parsers>
>       <parser class="org.apache.tika.parser.ctakes.CTAKESParser">
>          <parser class="org.apache.tika.parser.DefaultParser"/>
>       </parser>
>     </parsers>
>   </properties>
>(Example config file in SVN!)
>
>
>Does this work for everyone?
>
>Nick


Re: svn commit: r1683969 - /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.t ika.parser.Parser

Posted by Nick Burch <ap...@gagravarr.org>.
On Sun, 7 Jun 2015, Mattmann, Chris A (3980) wrote:
> Great question Nick. If you have a better idea on how to make it so that 
> any file can come into the cTAKES parser, get its text and metadata 
> parsed out, and then feed that into cTAKES I’m all ears. We just thought 
> that decorating AutoDetect would serve that purpose for us. Since cTAKES 
> just puts metadata in the met object (as of now) and doesn’t do XHTML 
> content (future improvement), I supposed we could instantiate an 
> AutoDetectParser instead of decorating it which may help. Dunno, anyways 
> looking forward to what your thoughts are :-)

I've had a go at this, and fixed a few Tika bugs on the way... You can now 
(as detailed in the javadoc) just do:
    AutoDetectParser parser = new AutoDetectParser(new CTAKESParser());
And you'll get auto-detection with cTAKES applied to the result.

Alternately, if you want to turn on cTAKES support in config, for use eg 
with the Tika CLI or Tika Server, you just need a config file like:
   <properties>
     <parsers>
       <parser class="org.apache.tika.parser.ctakes.CTAKESParser">
          <parser class="org.apache.tika.parser.DefaultParser"/>
       </parser>
     </parsers>
   </properties>
(Example config file in SVN!)


Does this work for everyone?

Nick

Re: svn commit: r1683969 - /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.t ika.parser.Parser

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Great question Nick. If you have a better idea on how to make it
so that any file can come into the cTAKES parser, get its text and
metadata parsed out, and then feed that into cTAKES I’m all ears.
We just thought that decorating AutoDetect would serve that purpose
for us. Since cTAKES just puts metadata in the met object (as of now)
and doesn’t do XHTML content (future improvement), I supposed we could
instantiate an AutoDetectParser instead of decorating it which may
help. Dunno, anyways looking forward to what your thoughts are :-)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




-----Original Message-----
From: Nick Burch <ap...@gagravarr.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>
Date: Sunday, June 7, 2015 at 5:01 AM
To: "dev@tika.apache.org" <de...@tika.apache.org>
Subject: Re: svn commit: r1683969 -
/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.t
ika.parser.Parser

>On Sun, 7 Jun 2015, Mattmann, Chris A (3980) wrote:
>> Also the lovely thing here too is that since cTAKESParser is a
>>decorator 
>> for AutoDetectParser there is magical infinite recursion if it’s
>>enabled 
>> via SPI.
>
>Should it really be a wrapper for AutoDetectParser though? I haven't read
>through the wiki page or the code yet (need to do that after lunch...),
>but my general guess would've been that a wrapping parser should sit
>between AutoDetectParser and DefaultParser? (AutoDetectParser normally
>calls to DefaultParser via the Tika config).
>
>If it worked that way, we could slip it in between the two in the tika
>config file.
>
>Though if someone could quickly point out why it needs to wrap outside
>AutoDetectParser rather than inside, that'd save time!
>
>Nick