You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tod <li...@gmail.com> on 2011/06/22 15:00:39 UTC

Tika Jax-RS and DIH

> Mattmann, Chris A (388J <chris.a.mattmann <at> jpl.nasa.gov> writes:
>
>> >
>> > Hi Jo,
>> >
>> > You may consider checking out Tika trunk, where we recently have a Tika JAX-RS
> web service [1] committed as
>> > part of the tika-server module. You could probably wire DIH into it and
> accomplish the same thing.
>> >
>> > Cheers,
>> > Chris
>> >
>> > [1] https://issues.apache.org/jira/browse/TIKA-593


Chris - could you elaborate on using Tika Jax-RS and DIH?  How 
production ready is it?  Could you summarize the steps necessary to get 
it to work?  Any examples yet?

I'd be happy to work with you to get something out to the group.


Thanks - Tod

Re: Tika Jax-RS and DIH

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Tod,

On Jun 22, 2011, at 6:00 AM, Tod wrote:

>> Mattmann, Chris A (388J <chris.a.mattmann <at> jpl.nasa.gov> writes:
>> 
>>>> 
>>>> Hi Jo,
>>>> 
>>>> You may consider checking out Tika trunk, where we recently have a Tika JAX-RS
>> web service [1] committed as
>>>> part of the tika-server module. You could probably wire DIH into it and
>> accomplish the same thing.
>>>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/TIKA-593
> 
> 
> Chris - could you elaborate on using Tika Jax-RS and DIH?  How 
> production ready is it?  

Sure. I know that Maxim Valyanskiy has done a bunch of work with the Tika Jax-RS layer. It's a simple exposing of Tika met extraction and unpackaging capabilities via the JSR 311 spec. So you get REST services like:

/meta 
HTTP PUTs a document to the /meta service and you get back "text/csv" of the metadata.

/tika

HTTP PUTs a document to the /tika service and you get back the extracted text.
HTTP GET prints a greeting stating the server is up.

/unpacker

HTTP PUTs an embedded document type to the /unpacker service and you get back a zip of the extracted text for each resource filename in the original PUT embedded document type.


> Could you summarize the steps necessary to get 
> it to work?  Any examples yet?

Basically you just build the tika-server WAR file, drop it onto a Servlet App Server (Tomcat, Jetty, etc.) and then you've got a Tika JAX-RS server.

> 
> I'd be happy to work with you to get something out to the group.

Awesome! I've created a Tika Wiki page here:

http://wiki.apache.org/tika/TikaJAXRS

Since this is really also Tika related, please feel free to join user@tika.apache.org or dev@tika.apache.org by sending emails to:

user-subscribe@tika.apache.org
dev-subscribe@tik.apache.org

Then you can move the Tika portions of the conversation there. For the Solr/DIH side, this is the right list.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++