You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Mr Havercamp <mr...@gmail.com> on 2013/10/10 01:50:17 UTC
Using TikaJAXRS with remote files
Hi
Been working with tika jaxrs and it is working great.
One thing I'm wondering; the standalone Tika app can extract remote
files by providing a url (both in GUI and CMD mode); I'm wondering if
the same is at all possible with TIKAJAXRS or TIka app launched in
server mode?
The reason being I may run an indexing client on a separate server so it
wouldn't necessarily have direct access to the file system where the
files to be indexed reside.
Cheers
Hayden
Re: Using TikaJAXRS with remote files
Posted by "Mattmann, Chris A (398J)" <ch...@jpl.nasa.gov>.
Awesome.
One thought would be to take the below and update our wiki with
the information on how you are integrating TikaJAXRS and cURL.
That seems very useful.
If you wouldn't mind updating the wiki, that would be a great
help to the community!
http://wiki.apache.org/tika/
Cheers,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: Mr Havercamp <mr...@gmail.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Wednesday, October 9, 2013 9:06 PM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: Re: Using TikaJAXRS with remote files
>Thanks Chris, good to know I'm on the right track.
>
>I guess the caveat to below is that it does fetch the entire file so
>only grabbing the file's metadata on large files (say a video) can take
>a while.
>
>I did attempt passing on the file's headers to the tika server:
>
>curl -I "http://url/to/my.file" | curl -X PUT -T -
>http://myserver/tika/meta
>
>and it does make an attempt to fetch the metadata but it results in very
>little real metadata info:
>
>"Content-Encoding","windows-1252"
>"Content-Type","text/plain; charset=windows-1252"
>
>(understandable as Tika Server is expecting the entire file to do its
>magic).
>
>In the meantime I'm using CURL to obtain the file metadata:
>
>curl -I http://url/to/my.video
>
>HTTP/1.1 200 OK
>Date: Thu, 10 Oct 2013 04:01:15 GMT
>Last-Modified: Thu, 10 Oct 2013 04:01:15 GMT
>ETag: 1381377675619
>Expires: Thu, 10 Oct 2013 04:11:15 GMT
>Cache-Control: public
>Cache-Control: max-age=600
>Cache-Control: s-maxage=600
>x-entity-prefix: bitstreams
>x-entity-reference: /to/my.video
>x-entity-url: /to/myfile.html
>x-entity-format: html
>x-sdata-handler: org.dspace.rest.providers.BitstreamProvider
>x-sdata-url: /bitstreams/2416/download
>Content-Disposition: attachment; filename=my.video
>Content-Type: video/x-ms-wmv;charset=UTF-8
>Content-Length: 243062358
>
>then, if the Content-Type matches my preconfigured list of types I want
>to extract, I make another run through using my tika server:
>
>curl "http://url/to/my.file" | curl -X PUT -T - http://myserver/tika/meta
>
>
>On 10/10/13 10:35, Chris Mattmann wrote:
>> Looks good to me! Excellent work and not sure I have
>> a better way atm..
>>
>> ------------------------
>> Chris Mattmann
>> chris.mattmann@gmail.com
>>
>>
>>
>>
>> -----Original Message-----
>> From: Mr Havercamp <mr...@gmail.com>
>> Reply-To: <us...@tika.apache.org>
>> Date: Wednesday, October 9, 2013 7:27 PM
>> To: <us...@tika.apache.org>
>> Subject: Re: Using TikaJAXRS with remote files
>>
>>> Success!
>>>
>>> For anybody else interested:
>>>
>>> curl "http://url/to/my.file" | curl -X PUT -T -
>>>http://myserver/tika/meta
>>>
>>> However would be interested if anybody else has a different/more
>>> efficient way of doing such an operation.
>>>
>>> On 10/10/13 10:11, Mr Havercamp wrote:
>>>> Further to my previous post:
>>>>
>>>> I can send remote files using a combination of the tika app running in
>>>> server mode, curl and nc:
>>>>
>>>> java -jar tika-app-1.3.jar --server 1234
>>>>
>>>> curl "http://url/to/my.file" | nc localhost 1234
>>>>
>>>> So I guess now the only missing piece is being able to send remote
>>>> files to JAXRS for extraction.
>>>>
>>>> On 10/10/13 07:50, Mr Havercamp wrote:
>>>>> Hi
>>>>>
>>>>> Been working with tika jaxrs and it is working great.
>>>>>
>>>>> One thing I'm wondering; the standalone Tika app can extract remote
>>>>> files by providing a url (both in GUI and CMD mode); I'm wondering if
>>>>> the same is at all possible with TIKAJAXRS or TIka app launched in
>>>>> server mode?
>>>>>
>>>>> The reason being I may run an indexing client on a separate server so
>>>>> it wouldn't necessarily have direct access to the file system where
>>>>> the files to be indexed reside.
>>>>>
>>>>> Cheers
>>>>>
>>>>>
>>>>> Hayden
>>
>
Re: Using TikaJAXRS with remote files
Posted by Mr Havercamp <mr...@gmail.com>.
Thanks Chris, good to know I'm on the right track.
I guess the caveat to below is that it does fetch the entire file so
only grabbing the file's metadata on large files (say a video) can take
a while.
I did attempt passing on the file's headers to the tika server:
curl -I "http://url/to/my.file" | curl -X PUT -T - http://myserver/tika/meta
and it does make an attempt to fetch the metadata but it results in very
little real metadata info:
"Content-Encoding","windows-1252"
"Content-Type","text/plain; charset=windows-1252"
(understandable as Tika Server is expecting the entire file to do its
magic).
In the meantime I'm using CURL to obtain the file metadata:
curl -I http://url/to/my.video
HTTP/1.1 200 OK
Date: Thu, 10 Oct 2013 04:01:15 GMT
Last-Modified: Thu, 10 Oct 2013 04:01:15 GMT
ETag: 1381377675619
Expires: Thu, 10 Oct 2013 04:11:15 GMT
Cache-Control: public
Cache-Control: max-age=600
Cache-Control: s-maxage=600
x-entity-prefix: bitstreams
x-entity-reference: /to/my.video
x-entity-url: /to/myfile.html
x-entity-format: html
x-sdata-handler: org.dspace.rest.providers.BitstreamProvider
x-sdata-url: /bitstreams/2416/download
Content-Disposition: attachment; filename=my.video
Content-Type: video/x-ms-wmv;charset=UTF-8
Content-Length: 243062358
then, if the Content-Type matches my preconfigured list of types I want
to extract, I make another run through using my tika server:
curl "http://url/to/my.file" | curl -X PUT -T - http://myserver/tika/meta
On 10/10/13 10:35, Chris Mattmann wrote:
> Looks good to me! Excellent work and not sure I have
> a better way atm..
>
> ------------------------
> Chris Mattmann
> chris.mattmann@gmail.com
>
>
>
>
> -----Original Message-----
> From: Mr Havercamp <mr...@gmail.com>
> Reply-To: <us...@tika.apache.org>
> Date: Wednesday, October 9, 2013 7:27 PM
> To: <us...@tika.apache.org>
> Subject: Re: Using TikaJAXRS with remote files
>
>> Success!
>>
>> For anybody else interested:
>>
>> curl "http://url/to/my.file" | curl -X PUT -T - http://myserver/tika/meta
>>
>> However would be interested if anybody else has a different/more
>> efficient way of doing such an operation.
>>
>> On 10/10/13 10:11, Mr Havercamp wrote:
>>> Further to my previous post:
>>>
>>> I can send remote files using a combination of the tika app running in
>>> server mode, curl and nc:
>>>
>>> java -jar tika-app-1.3.jar --server 1234
>>>
>>> curl "http://url/to/my.file" | nc localhost 1234
>>>
>>> So I guess now the only missing piece is being able to send remote
>>> files to JAXRS for extraction.
>>>
>>> On 10/10/13 07:50, Mr Havercamp wrote:
>>>> Hi
>>>>
>>>> Been working with tika jaxrs and it is working great.
>>>>
>>>> One thing I'm wondering; the standalone Tika app can extract remote
>>>> files by providing a url (both in GUI and CMD mode); I'm wondering if
>>>> the same is at all possible with TIKAJAXRS or TIka app launched in
>>>> server mode?
>>>>
>>>> The reason being I may run an indexing client on a separate server so
>>>> it wouldn't necessarily have direct access to the file system where
>>>> the files to be indexed reside.
>>>>
>>>> Cheers
>>>>
>>>>
>>>> Hayden
>
Re: Using TikaJAXRS with remote files
Posted by Chris Mattmann <ch...@gmail.com>.
Looks good to me! Excellent work and not sure I have
a better way atm..
------------------------
Chris Mattmann
chris.mattmann@gmail.com
-----Original Message-----
From: Mr Havercamp <mr...@gmail.com>
Reply-To: <us...@tika.apache.org>
Date: Wednesday, October 9, 2013 7:27 PM
To: <us...@tika.apache.org>
Subject: Re: Using TikaJAXRS with remote files
>Success!
>
>For anybody else interested:
>
>curl "http://url/to/my.file" | curl -X PUT -T - http://myserver/tika/meta
>
>However would be interested if anybody else has a different/more
>efficient way of doing such an operation.
>
>On 10/10/13 10:11, Mr Havercamp wrote:
>> Further to my previous post:
>>
>> I can send remote files using a combination of the tika app running in
>> server mode, curl and nc:
>>
>> java -jar tika-app-1.3.jar --server 1234
>>
>> curl "http://url/to/my.file" | nc localhost 1234
>>
>> So I guess now the only missing piece is being able to send remote
>> files to JAXRS for extraction.
>>
>> On 10/10/13 07:50, Mr Havercamp wrote:
>>> Hi
>>>
>>> Been working with tika jaxrs and it is working great.
>>>
>>> One thing I'm wondering; the standalone Tika app can extract remote
>>> files by providing a url (both in GUI and CMD mode); I'm wondering if
>>> the same is at all possible with TIKAJAXRS or TIka app launched in
>>> server mode?
>>>
>>> The reason being I may run an indexing client on a separate server so
>>> it wouldn't necessarily have direct access to the file system where
>>> the files to be indexed reside.
>>>
>>> Cheers
>>>
>>>
>>> Hayden
>>
>
Re: Using TikaJAXRS with remote files
Posted by Mr Havercamp <mr...@gmail.com>.
Success!
For anybody else interested:
curl "http://url/to/my.file" | curl -X PUT -T - http://myserver/tika/meta
However would be interested if anybody else has a different/more
efficient way of doing such an operation.
On 10/10/13 10:11, Mr Havercamp wrote:
> Further to my previous post:
>
> I can send remote files using a combination of the tika app running in
> server mode, curl and nc:
>
> java -jar tika-app-1.3.jar --server 1234
>
> curl "http://url/to/my.file" | nc localhost 1234
>
> So I guess now the only missing piece is being able to send remote
> files to JAXRS for extraction.
>
> On 10/10/13 07:50, Mr Havercamp wrote:
>> Hi
>>
>> Been working with tika jaxrs and it is working great.
>>
>> One thing I'm wondering; the standalone Tika app can extract remote
>> files by providing a url (both in GUI and CMD mode); I'm wondering if
>> the same is at all possible with TIKAJAXRS or TIka app launched in
>> server mode?
>>
>> The reason being I may run an indexing client on a separate server so
>> it wouldn't necessarily have direct access to the file system where
>> the files to be indexed reside.
>>
>> Cheers
>>
>>
>> Hayden
>
Re: Using TikaJAXRS with remote files
Posted by Mr Havercamp <mr...@gmail.com>.
Further to my previous post:
I can send remote files using a combination of the tika app running in
server mode, curl and nc:
java -jar tika-app-1.3.jar --server 1234
curl "http://url/to/my.file" | nc localhost 1234
So I guess now the only missing piece is being able to send remote files
to JAXRS for extraction.
On 10/10/13 07:50, Mr Havercamp wrote:
> Hi
>
> Been working with tika jaxrs and it is working great.
>
> One thing I'm wondering; the standalone Tika app can extract remote
> files by providing a url (both in GUI and CMD mode); I'm wondering if
> the same is at all possible with TIKAJAXRS or TIka app launched in
> server mode?
>
> The reason being I may run an indexing client on a separate server so
> it wouldn't necessarily have direct access to the file system where
> the files to be indexed reside.
>
> Cheers
>
>
> Hayden