You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Mr Havercamp <mr...@gmail.com> on 2013/10/10 01:50:17 UTC

Using TikaJAXRS with remote files

Hi

Been working with tika jaxrs and it is working great.

One thing I'm wondering; the standalone Tika app can extract remote 
files by providing a url (both in GUI and CMD mode); I'm wondering if 
the same is at all possible with TIKAJAXRS or TIka app launched in 
server mode?

The reason being I may run an indexing client on a separate server so it 
wouldn't necessarily have direct access to the file system where the 
files to be indexed reside.

Cheers


Hayden

Re: Using TikaJAXRS with remote files

Posted by "Mattmann, Chris A (398J)" <ch...@jpl.nasa.gov>.
Awesome.

One thought would be to take the below and update our wiki with
the information on how you are integrating TikaJAXRS and cURL.
That seems very useful.

If you wouldn't mind updating the wiki, that would be a great
help to the community!

http://wiki.apache.org/tika/

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Mr Havercamp <mr...@gmail.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Wednesday, October 9, 2013 9:06 PM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: Re: Using TikaJAXRS with remote files

>Thanks Chris, good to know I'm on the right track.
>
>I guess the caveat to below is that it does fetch the entire file so
>only grabbing the file's metadata on large files (say a video) can take
>a while.
>
>I did attempt passing on the file's headers to the tika server:
>
>curl -I "http://url/to/my.file" | curl -X PUT -T -
>http://myserver/tika/meta
>
>and it does make an attempt to fetch the metadata but it results in very
>little real metadata info:
>
>"Content-Encoding","windows-1252"
>"Content-Type","text/plain; charset=windows-1252"
>
>(understandable as Tika Server is expecting the entire file to do its
>magic).
>
>In the meantime I'm using CURL to obtain the file metadata:
>
>curl -I http://url/to/my.video
>
>HTTP/1.1 200 OK
>Date: Thu, 10 Oct 2013 04:01:15 GMT
>Last-Modified: Thu, 10 Oct 2013 04:01:15 GMT
>ETag: 1381377675619
>Expires: Thu, 10 Oct 2013 04:11:15 GMT
>Cache-Control: public
>Cache-Control: max-age=600
>Cache-Control: s-maxage=600
>x-entity-prefix: bitstreams
>x-entity-reference: /to/my.video
>x-entity-url: /to/myfile.html
>x-entity-format: html
>x-sdata-handler: org.dspace.rest.providers.BitstreamProvider
>x-sdata-url: /bitstreams/2416/download
>Content-Disposition: attachment; filename=my.video
>Content-Type: video/x-ms-wmv;charset=UTF-8
>Content-Length: 243062358
>
>then, if the Content-Type matches my preconfigured list of types I want
>to extract, I make another run through using my tika server:
>
>curl "http://url/to/my.file" | curl -X PUT -T - http://myserver/tika/meta
>
>
>On 10/10/13 10:35, Chris Mattmann wrote:
>> Looks good to me! Excellent work and not sure I have
>> a better way atm..
>>
>> ------------------------
>> Chris Mattmann
>> chris.mattmann@gmail.com
>>
>>
>>
>>
>> -----Original Message-----
>> From: Mr Havercamp <mr...@gmail.com>
>> Reply-To: <us...@tika.apache.org>
>> Date: Wednesday, October 9, 2013 7:27 PM
>> To: <us...@tika.apache.org>
>> Subject: Re: Using TikaJAXRS with remote files
>>
>>> Success!
>>>
>>> For anybody else interested:
>>>
>>> curl "http://url/to/my.file" | curl -X PUT -T -
>>>http://myserver/tika/meta
>>>
>>> However would be interested if anybody else has a different/more
>>> efficient way of doing such an operation.
>>>
>>> On 10/10/13 10:11, Mr Havercamp wrote:
>>>> Further to my previous post:
>>>>
>>>> I can send remote files using a combination of the tika app running in
>>>> server mode, curl and nc:
>>>>
>>>> java -jar tika-app-1.3.jar --server 1234
>>>>
>>>> curl "http://url/to/my.file" | nc localhost 1234
>>>>
>>>> So I guess now the only missing piece is being able to send remote
>>>> files to JAXRS for extraction.
>>>>
>>>> On 10/10/13 07:50, Mr Havercamp wrote:
>>>>> Hi
>>>>>
>>>>> Been working with tika jaxrs and it is working great.
>>>>>
>>>>> One thing I'm wondering; the standalone Tika app can extract remote
>>>>> files by providing a url (both in GUI and CMD mode); I'm wondering if
>>>>> the same is at all possible with TIKAJAXRS or TIka app launched in
>>>>> server mode?
>>>>>
>>>>> The reason being I may run an indexing client on a separate server so
>>>>> it wouldn't necessarily have direct access to the file system where
>>>>> the files to be indexed reside.
>>>>>
>>>>> Cheers
>>>>>
>>>>>
>>>>> Hayden
>>
>


Re: Using TikaJAXRS with remote files

Posted by Mr Havercamp <mr...@gmail.com>.
Thanks Chris, good to know I'm on the right track.

I guess the caveat to below is that it does fetch the entire file so 
only grabbing the file's metadata on large files (say a video) can take 
a while.

I did attempt passing on the file's headers to the tika server:

curl -I "http://url/to/my.file" | curl -X PUT -T - http://myserver/tika/meta

and it does make an attempt to fetch the metadata but it results in very 
little real metadata info:

"Content-Encoding","windows-1252"
"Content-Type","text/plain; charset=windows-1252"

(understandable as Tika Server is expecting the entire file to do its 
magic).

In the meantime I'm using CURL to obtain the file metadata:

curl -I http://url/to/my.video

HTTP/1.1 200 OK
Date: Thu, 10 Oct 2013 04:01:15 GMT
Last-Modified: Thu, 10 Oct 2013 04:01:15 GMT
ETag: 1381377675619
Expires: Thu, 10 Oct 2013 04:11:15 GMT
Cache-Control: public
Cache-Control: max-age=600
Cache-Control: s-maxage=600
x-entity-prefix: bitstreams
x-entity-reference: /to/my.video
x-entity-url: /to/myfile.html
x-entity-format: html
x-sdata-handler: org.dspace.rest.providers.BitstreamProvider
x-sdata-url: /bitstreams/2416/download
Content-Disposition: attachment; filename=my.video
Content-Type: video/x-ms-wmv;charset=UTF-8
Content-Length: 243062358

then, if the Content-Type matches my preconfigured list of types I want 
to extract, I make another run through using my tika server:

curl "http://url/to/my.file" | curl -X PUT -T - http://myserver/tika/meta


On 10/10/13 10:35, Chris Mattmann wrote:
> Looks good to me! Excellent work and not sure I have
> a better way atm..
>
> ------------------------
> Chris Mattmann
> chris.mattmann@gmail.com
>
>
>
>
> -----Original Message-----
> From: Mr Havercamp <mr...@gmail.com>
> Reply-To: <us...@tika.apache.org>
> Date: Wednesday, October 9, 2013 7:27 PM
> To: <us...@tika.apache.org>
> Subject: Re: Using TikaJAXRS with remote files
>
>> Success!
>>
>> For anybody else interested:
>>
>> curl "http://url/to/my.file" | curl -X PUT -T - http://myserver/tika/meta
>>
>> However would be interested if anybody else has a different/more
>> efficient way of doing such an operation.
>>
>> On 10/10/13 10:11, Mr Havercamp wrote:
>>> Further to my previous post:
>>>
>>> I can send remote files using a combination of the tika app running in
>>> server mode, curl and nc:
>>>
>>> java -jar tika-app-1.3.jar --server 1234
>>>
>>> curl "http://url/to/my.file" | nc localhost 1234
>>>
>>> So I guess now the only missing piece is being able to send remote
>>> files to JAXRS for extraction.
>>>
>>> On 10/10/13 07:50, Mr Havercamp wrote:
>>>> Hi
>>>>
>>>> Been working with tika jaxrs and it is working great.
>>>>
>>>> One thing I'm wondering; the standalone Tika app can extract remote
>>>> files by providing a url (both in GUI and CMD mode); I'm wondering if
>>>> the same is at all possible with TIKAJAXRS or TIka app launched in
>>>> server mode?
>>>>
>>>> The reason being I may run an indexing client on a separate server so
>>>> it wouldn't necessarily have direct access to the file system where
>>>> the files to be indexed reside.
>>>>
>>>> Cheers
>>>>
>>>>
>>>> Hayden
>


Re: Using TikaJAXRS with remote files

Posted by Chris Mattmann <ch...@gmail.com>.
Looks good to me! Excellent work and not sure I have
a better way atm..

------------------------
Chris Mattmann
chris.mattmann@gmail.com




-----Original Message-----
From: Mr Havercamp <mr...@gmail.com>
Reply-To: <us...@tika.apache.org>
Date: Wednesday, October 9, 2013 7:27 PM
To: <us...@tika.apache.org>
Subject: Re: Using TikaJAXRS with remote files

>Success!
>
>For anybody else interested:
>
>curl "http://url/to/my.file" | curl -X PUT -T - http://myserver/tika/meta
>
>However would be interested if anybody else has a different/more
>efficient way of doing such an operation.
>
>On 10/10/13 10:11, Mr Havercamp wrote:
>> Further to my previous post:
>>
>> I can send remote files using a combination of the tika app running in
>> server mode, curl and nc:
>>
>> java -jar tika-app-1.3.jar --server 1234
>>
>> curl "http://url/to/my.file" | nc localhost 1234
>>
>> So I guess now the only missing piece is being able to send remote
>> files to JAXRS for extraction.
>>
>> On 10/10/13 07:50, Mr Havercamp wrote:
>>> Hi
>>>
>>> Been working with tika jaxrs and it is working great.
>>>
>>> One thing I'm wondering; the standalone Tika app can extract remote
>>> files by providing a url (both in GUI and CMD mode); I'm wondering if
>>> the same is at all possible with TIKAJAXRS or TIka app launched in
>>> server mode?
>>>
>>> The reason being I may run an indexing client on a separate server so
>>> it wouldn't necessarily have direct access to the file system where
>>> the files to be indexed reside.
>>>
>>> Cheers
>>>
>>>
>>> Hayden
>>
>



Re: Using TikaJAXRS with remote files

Posted by Mr Havercamp <mr...@gmail.com>.
Success!

For anybody else interested:

curl "http://url/to/my.file" | curl -X PUT -T - http://myserver/tika/meta

However would be interested if anybody else has a different/more 
efficient way of doing such an operation.

On 10/10/13 10:11, Mr Havercamp wrote:
> Further to my previous post:
>
> I can send remote files using a combination of the tika app running in 
> server mode, curl and nc:
>
> java -jar tika-app-1.3.jar --server 1234
>
> curl "http://url/to/my.file" | nc localhost 1234
>
> So I guess now the only missing piece is being able to send remote 
> files to JAXRS for extraction.
>
> On 10/10/13 07:50, Mr Havercamp wrote:
>> Hi
>>
>> Been working with tika jaxrs and it is working great.
>>
>> One thing I'm wondering; the standalone Tika app can extract remote 
>> files by providing a url (both in GUI and CMD mode); I'm wondering if 
>> the same is at all possible with TIKAJAXRS or TIka app launched in 
>> server mode?
>>
>> The reason being I may run an indexing client on a separate server so 
>> it wouldn't necessarily have direct access to the file system where 
>> the files to be indexed reside.
>>
>> Cheers
>>
>>
>> Hayden
>


Re: Using TikaJAXRS with remote files

Posted by Mr Havercamp <mr...@gmail.com>.
Further to my previous post:

I can send remote files using a combination of the tika app running in 
server mode, curl and nc:

java -jar tika-app-1.3.jar --server 1234

curl "http://url/to/my.file" | nc localhost 1234

So I guess now the only missing piece is being able to send remote files 
to JAXRS for extraction.

On 10/10/13 07:50, Mr Havercamp wrote:
> Hi
>
> Been working with tika jaxrs and it is working great.
>
> One thing I'm wondering; the standalone Tika app can extract remote 
> files by providing a url (both in GUI and CMD mode); I'm wondering if 
> the same is at all possible with TIKAJAXRS or TIka app launched in 
> server mode?
>
> The reason being I may run an indexing client on a separate server so 
> it wouldn't necessarily have direct access to the file system where 
> the files to be indexed reside.
>
> Cheers
>
>
> Hayden