You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Mr Havercamp <mr...@gmail.com> on 2012/07/20 17:17:33 UTC

Tika Server mode accessing with CURL

Have been playing around with integrating Tika into my PHP app.

I have had great success with Tika on the command line and also SolrCell.

However, I was wondering if there is some way of running Tika in server 
mode and extracting a document, say, via CURL.

I have had varying degrees of success with:

nc localhost 30000 < 
/opt/lampp/htdocs/joomla25/tmp/InformationRepository.pdf

but I'm wondering how I pass other params such as for extracting just 
metadata or content in html format.

Any help would be much appreciated.

Cheers


Hayden

Re: Tika Server mode accessing with CURL

Posted by Jason Judge <ja...@consil.co.uk>.
Hayden,

I am developing a small PHP library to drive the command line version of Tika to
perform a variety of functions. The library handles input and output files,
tidying them up when finished, and delivers data in files or open streams.

I'm doing this primarily for a project to analyse uploaded CVS, but also for
getting into PSR-0 so it can be used on a variety of projects. If you are
interested, I can send you what I have got so far.

The hope was that using the library can be agnostic to how it accesses Tika -
whether command line, server or even something like java-php-bridge, but what I
have found so far is that each of these access methods are inconsistent, i.e.
offer different features. There is stuff you can do from the command line that
you can't do from the server mode and vice-versa (I've raised a ticket on this).
I think a Java/PHP bridge would be best, but I have absolutely no experience in
Java servers and setting up custom Java applications, and it's a steep learning
curve to get into.

But anyway, the ultimate aim is to get a portable PHP library that can use the
features of Tika in a consistent way, and perhaps use drivers so that whatever
method of accessing Tika is available, could be used.

-- Jason


jason.judge@consil.co.uk <ma...@consil.co.uk>
www.consil.co.uk <http://www.consil.co.uk/>
On 20/07/2012 18:13, Mr Havercamp wrote:
> Hi Chris
>
> Thanks for the reply. I will check it out and let you know how I go.
>
> I am developing an extension for Joomla which uses Solr and Tika to index
> content and attachments. I have three configuration options for users to
> select when specifying a method to extract content and metadata from files; a
> local install of the tika app, SolrCell, or a remote tika server. In your
> opinion, would TikaJAXRS be a viable option for remote tika extraction (for
> example, running on a separate server) especially in regards to performance
> and security?
>
> Thanks again
>
>
> Hayden
>
> On 20/07/12 23:30, Mattmann, Chris A (388J) wrote:
>> Hi Hayden,
>>
>> Thanks for your email! Have you tried the Tika JAXRS server, documented here:
>>
>> https://issues.apache.org/jira/browse/TIKA-593
>> http://wiki.apache.org/tika/TikaJAXRS
>>
>> It first appeared in 1.2 and can also be run on a port (9988 by default)
>> to handle cURL interactions.
>>
>> Cheers,
>> Chris
>>
>> On Jul 20, 2012, at 8:17 AM, Mr Havercamp wrote:
>>
>>> Have been playing around with integrating Tika into my PHP app.
>>>
>>> I have had great success with Tika on the command line and also SolrCell.
>>>
>>> However, I was wondering if there is some way of running Tika in server mode
>>> and extracting a document, say, via CURL.
>>>
>>> I have had varying degrees of success with:
>>>
>>> nc localhost 30000 < /opt/lampp/htdocs/joomla25/tmp/InformationRepository.pdf
>>>
>>> but I'm wondering how I pass other params such as for extracting just
>>> metadata or content in html format.
>>>
>>> Any help would be much appreciated.
>>>
>>> Cheers
>>>
>>>
>>> Hayden
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattmann@nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>


Re: Tika Server mode accessing with CURL

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Hayden,

On Jul 21, 2012, at 6:24 AM, Mr Havercamp wrote:

> Hi Chris
> 
> Thanks for your links, etc. I have successfully built and run Tika JAXRS and will look to incorporate it into my component so that users can configure and use it for Tika extraction (currently I have local Tika and SolrCell (Solr server). I think it is important to provide users with different options depending on their requirements (e.g. performance, simplicity, cost-effectiveness, etc).

Awesome, +1!

> 
> Using Tika JAXRS I can very easily extract metadata which is great. I am also able to extract content as plain text but I cannot see a setting for returning content in xml/html. Is there a setting for this? Perhaps I'm missing something.

You are most likely correct -- the JAXRS module is an evolving spec and we, as Jason put it,
would like to look to make it and the CLI and the server interface a bit more consistent and
standardized. If there is something that you don't see that it does (e.g., like xml/html output),
please file a feature request at: https://issues.apache.org/jira/browse/TIKA so that we can
keep it in mind going forward when folks are working on this. Also, contributions welcome,
so if you think you would/could take a crack at trying to add it, awesome. If not, I'm sure
one of the devs working on Tika JAXRS will get around to it.

Thanks!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Tika Server mode accessing with CURL

Posted by Mr Havercamp <mr...@gmail.com>.
Hi Chris

Thanks for your links, etc. I have successfully built and run Tika JAXRS 
and will look to incorporate it into my component so that users can 
configure and use it for Tika extraction (currently I have local Tika 
and SolrCell (Solr server). I think it is important to provide users 
with different options depending on their requirements (e.g. 
performance, simplicity, cost-effectiveness, etc).

Using Tika JAXRS I can very easily extract metadata which is great. I am 
also able to extract content as plain text but I cannot see a setting 
for returning content in xml/html. Is there a setting for this? Perhaps 
I'm missing something.

Cheers


Hayden

On 21/07/12 01:31, Mattmann, Chris A (388J) wrote:
> Hi Hayden,
>
> Thanks a ton! Yep I think TikaJAXRS will be a viable option for remote tika extraction.
>
> Let me know how I can help.
>
> Thanks much!
>
> Cheers,
> Chris
>
> On Jul 20, 2012, at 10:13 AM, Mr Havercamp wrote:
>
>> Hi Chris
>>
>> Thanks for the reply. I will check it out and let you know how I go.
>>
>> I am developing an extension for Joomla which uses Solr and Tika to index content and attachments. I have three configuration options for users to select when specifying a method to extract content and metadata from files; a local install of the tika app, SolrCell, or a remote tika server. In your opinion, would TikaJAXRS be a viable option for remote tika extraction (for example, running on a separate server) especially in regards to performance and security?
>>
>> Thanks again
>>
>>
>> Hayden
>>
>> On 20/07/12 23:30, Mattmann, Chris A (388J) wrote:
>>> Hi Hayden,
>>>
>>> Thanks for your email! Have you tried the Tika JAXRS server, documented here:
>>>
>>> https://issues.apache.org/jira/browse/TIKA-593
>>> http://wiki.apache.org/tika/TikaJAXRS
>>>
>>> It first appeared in 1.2 and can also be run on a port (9988 by default)
>>> to handle cURL interactions.
>>>
>>> Cheers,
>>> Chris
>>>
>>> On Jul 20, 2012, at 8:17 AM, Mr Havercamp wrote:
>>>
>>>> Have been playing around with integrating Tika into my PHP app.
>>>>
>>>> I have had great success with Tika on the command line and also SolrCell.
>>>>
>>>> However, I was wondering if there is some way of running Tika in server mode and extracting a document, say, via CURL.
>>>>
>>>> I have had varying degrees of success with:
>>>>
>>>> nc localhost 30000 < /opt/lampp/htdocs/joomla25/tmp/InformationRepository.pdf
>>>>
>>>> but I'm wondering how I pass other params such as for extracting just metadata or content in html format.
>>>>
>>>> Any help would be much appreciated.
>>>>
>>>> Cheers
>>>>
>>>>
>>>> Hayden
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Senior Computer Scientist
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 171-266B, Mailstop: 171-246
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:   http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Assistant Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>


Re: Tika Server mode accessing with CURL

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Hayden,

Thanks a ton! Yep I think TikaJAXRS will be a viable option for remote tika extraction.

Let me know how I can help.

Thanks much!

Cheers,
Chris

On Jul 20, 2012, at 10:13 AM, Mr Havercamp wrote:

> Hi Chris
> 
> Thanks for the reply. I will check it out and let you know how I go.
> 
> I am developing an extension for Joomla which uses Solr and Tika to index content and attachments. I have three configuration options for users to select when specifying a method to extract content and metadata from files; a local install of the tika app, SolrCell, or a remote tika server. In your opinion, would TikaJAXRS be a viable option for remote tika extraction (for example, running on a separate server) especially in regards to performance and security?
> 
> Thanks again
> 
> 
> Hayden
> 
> On 20/07/12 23:30, Mattmann, Chris A (388J) wrote:
>> Hi Hayden,
>> 
>> Thanks for your email! Have you tried the Tika JAXRS server, documented here:
>> 
>> https://issues.apache.org/jira/browse/TIKA-593
>> http://wiki.apache.org/tika/TikaJAXRS
>> 
>> It first appeared in 1.2 and can also be run on a port (9988 by default)
>> to handle cURL interactions.
>> 
>> Cheers,
>> Chris
>> 
>> On Jul 20, 2012, at 8:17 AM, Mr Havercamp wrote:
>> 
>>> Have been playing around with integrating Tika into my PHP app.
>>> 
>>> I have had great success with Tika on the command line and also SolrCell.
>>> 
>>> However, I was wondering if there is some way of running Tika in server mode and extracting a document, say, via CURL.
>>> 
>>> I have had varying degrees of success with:
>>> 
>>> nc localhost 30000 < /opt/lampp/htdocs/joomla25/tmp/InformationRepository.pdf
>>> 
>>> but I'm wondering how I pass other params such as for extracting just metadata or content in html format.
>>> 
>>> Any help would be much appreciated.
>>> 
>>> Cheers
>>> 
>>> 
>>> Hayden
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattmann@nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Tika Server mode accessing with CURL

Posted by Mr Havercamp <mr...@gmail.com>.
Hi Chris

Thanks for the reply. I will check it out and let you know how I go.

I am developing an extension for Joomla which uses Solr and Tika to 
index content and attachments. I have three configuration options for 
users to select when specifying a method to extract content and metadata 
from files; a local install of the tika app, SolrCell, or a remote tika 
server. In your opinion, would TikaJAXRS be a viable option for remote 
tika extraction (for example, running on a separate server) especially 
in regards to performance and security?

Thanks again


Hayden

On 20/07/12 23:30, Mattmann, Chris A (388J) wrote:
> Hi Hayden,
>
> Thanks for your email! Have you tried the Tika JAXRS server, documented here:
>
> https://issues.apache.org/jira/browse/TIKA-593
> http://wiki.apache.org/tika/TikaJAXRS
>
> It first appeared in 1.2 and can also be run on a port (9988 by default)
> to handle cURL interactions.
>
> Cheers,
> Chris
>
> On Jul 20, 2012, at 8:17 AM, Mr Havercamp wrote:
>
>> Have been playing around with integrating Tika into my PHP app.
>>
>> I have had great success with Tika on the command line and also SolrCell.
>>
>> However, I was wondering if there is some way of running Tika in server mode and extracting a document, say, via CURL.
>>
>> I have had varying degrees of success with:
>>
>> nc localhost 30000 < /opt/lampp/htdocs/joomla25/tmp/InformationRepository.pdf
>>
>> but I'm wondering how I pass other params such as for extracting just metadata or content in html format.
>>
>> Any help would be much appreciated.
>>
>> Cheers
>>
>>
>> Hayden
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattmann@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>


Re: Tika Server mode accessing with CURL

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hi Hayden,

Thanks for your email! Have you tried the Tika JAXRS server, documented here:

https://issues.apache.org/jira/browse/TIKA-593
http://wiki.apache.org/tika/TikaJAXRS

It first appeared in 1.2 and can also be run on a port (9988 by default)
to handle cURL interactions.

Cheers,
Chris

On Jul 20, 2012, at 8:17 AM, Mr Havercamp wrote:

> Have been playing around with integrating Tika into my PHP app.
> 
> I have had great success with Tika on the command line and also SolrCell.
> 
> However, I was wondering if there is some way of running Tika in server mode and extracting a document, say, via CURL.
> 
> I have had varying degrees of success with:
> 
> nc localhost 30000 < /opt/lampp/htdocs/joomla25/tmp/InformationRepository.pdf
> 
> but I'm wondering how I pass other params such as for extracting just metadata or content in html format.
> 
> Any help would be much appreciated.
> 
> Cheers
> 
> 
> Hayden


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++