You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Bratislav Stojanovic <br...@gmail.com> on 2014/08/07 14:44:14 UTC

Compression of Tika server output files

Hi,

I'm trying to get text, metadata and attachments all in one request using
tika-server (JAX-RS), but
the only thing I can get as an output is either uncompressed ZIP or TAR.

Is there any way to :

- set compression level? Having uncompressed ZIP/TAR with resources
actually occupies more space than having plain __METADATA__ , __TEXT__ and
other files because of additional ZIP/TAR headers. If I decide to use
ZIP/TAR I would like to save some hd space.

- or use a simple folder instead of output file with all extracted
resources inside? This is desired
for me because I don't have to decompress output to reach the extracted
resources

Basically, I would like to specify compression or folder in this command :

curl -T example.doc http://localhost:9998/all > outputFolder

I haven't found any related info on http://wiki.apache.org/tika/TikaJAXRS
or mailing list archives, so
please help :)

-- 
Bratislav Stojanovic, M.Sc.

Re: Compression of Tika server output files

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 7 Aug 2014, Bratislav Stojanovic wrote:
> This was exactly what I was afraid of...you see, I have to extract 
> thousands and thousands of documents and calling java command *three 
> times* for each of them is highly inefficient.

The Tika App is largely intended for testing, debugging, demos and light 
use from non-Java environments. It was never really intended for very 
heavy use

> I want to keep tika in memory somehow and in a single VM, not to 
> instantiate new VM every time I need to extract something.

Have you thought about calling the Java code from C? It's not as bad as it 
used to be... What you want to do is pretty easy in Java, so that's one 
way to tackle it

Otherwise, might be best to look into adding your own custom CXF endpoint 
to the tika server, to return everything you need in one go.

Nick

Re: Compression of Tika server output files

Posted by Bratislav Stojanovic <br...@gmail.com>.
This was exactly what I was afraid of...you see, I have to extract
thousands and thousands of documents and calling java
command *three times* for each of them is highly inefficient. I want to
keep tika in memory somehow and in a single VM,
not to instantiate new VM every time I need to extract something. That's
why running tika-server is almost ideal for me - yes,
I have to decompress ZIP/TAR first, but I get everything in a single call
which works much faster.

Any suggestions how to wrap tika (app) to extract everything in one call
and to stay in a single VM so HotSpot can perform
optimizations? I guess something in between tika-app and tika-server...

Thank you.

On Thu, Aug 7, 2014 at 5:32 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Thu, 7 Aug 2014, Bratislav Stojanovic wrote:
>
>> Hmm, I apologize, but I'm afraid this does not work. If you specify :
>>
>> *java -jar tika-app-1.5-SNAPSHOT.jar --text --metadata --extract
>> --extract-dir=out example.doc*
>>
>>
>> ...it will only extract attachments, not everything (text + meta +
>> attachments). Any flags I'm missing?
>>
>
> With the Tika App, you'll need to run it three times, once for text, once
> for metadata, once for embedded resource extraction
>
> If you want to do all 3 in one go, you'll need to write a few lines of Java
>
> Nick
>



-- 
Bratislav Stojanovic, M.Sc.

Re: Compression of Tika server output files

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 7 Aug 2014, Bratislav Stojanovic wrote:
> Hmm, I apologize, but I'm afraid this does not work. If you specify :
>
> *java -jar tika-app-1.5-SNAPSHOT.jar --text --metadata --extract
> --extract-dir=out example.doc*
>
> ...it will only extract attachments, not everything (text + meta +
> attachments). Any flags I'm missing?

With the Tika App, you'll need to run it three times, once for text, once 
for metadata, once for embedded resource extraction

If you want to do all 3 in one go, you'll need to write a few lines of 
Java

Nick

Re: Compression of Tika server output files

Posted by Bratislav Stojanovic <br...@gmail.com>.
Hmm, I apologize, but I'm afraid this does not work. If you specify :

*java -jar tika-app-1.5-SNAPSHOT.jar --text --metadata --extract
--extract-dir=out example.doc*

...it will only extract attachments, not everything (text + meta +
attachments). Any flags I'm missing?

Thank you for your feedback.

On Thu, Aug 7, 2014 at 5:09 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Thu, 7 Aug 2014, Bratislav Stojanovic wrote:
>
>> OK, but I don't really have to use http...does tika support extracting
>> all resources in one call by some other method?
>>
>
> The Tika App does - the -z / --extract will do that. You might also want
> to use the --extract-dir=<dir> flag to set where they go
>
> Nick
>



-- 
Bratislav Stojanovic, M.Sc.

Re: Compression of Tika server output files

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 7 Aug 2014, Bratislav Stojanovic wrote:
> OK, but I don't really have to use http...does tika support extracting 
> all resources in one call by some other method?

The Tika App does - the -z / --extract will do that. You might also want 
to use the --extract-dir=<dir> flag to set where they go

Nick

Re: Compression of Tika server output files

Posted by Bratislav Stojanovic <br...@gmail.com>.
OK, but I don't really have to use http...does tika support extracting all
resources in one call by some other method?

My initial goal of using tika server was to extract everything in one call
(doesn't matter if it's http or not) and without
using Java API - I'm calling it from C code


On Thu, Aug 7, 2014 at 4:51 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Thu, 7 Aug 2014, Bratislav Stojanovic wrote:
>
>> Yes, GZIP compression will do the job for me...but having a plain folders
>> and files as an output is even better.
>>
>> How complicate is to update/add option in tika source to output folders
>> and files directly without packing it into any file format?
>>
>
> Once you've re-written the http spec, and got browsers to implement the
> changes, it ought to be quite easy...
>
> (You're basically limited to just one file being returned by a http
> response that a browser or similar will receive)
>
> Nick
>



-- 
Bratislav Stojanovic, M.Sc.
Owner

*Cloudscraper Enterprise Search*
Skype: bratislav83
http://www.cloudscraper.ca

Re: Compression of Tika server output files

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 7 Aug 2014, Bratislav Stojanovic wrote:
> Yes, GZIP compression will do the job for me...but having a plain 
> folders and files as an output is even better.
>
> How complicate is to update/add option in tika source to output folders 
> and files directly without packing it into any file format?

Once you've re-written the http spec, and got browsers to implement the 
changes, it ought to be quite easy...

(You're basically limited to just one file being returned by a http 
response that a browser or similar will receive)

Nick

Re: Compression of Tika server output files

Posted by Bratislav Stojanovic <br...@gmail.com>.
Yes, GZIP compression will do the job for me...but having a plain folders
and files as an output is even better.

How complicate is to update/add option in tika source to output folders and
files directly without packing it into any file format?


On Thu, Aug 7, 2014 at 3:31 PM, Sergey Beryozkin <sb...@gmail.com>
wrote:

> By the way, would a default GZIP compression suit ?
> If yes we can have it done even without the extra CXF changes.
>
> Sergey
>
>
> On 07/08/14 16:15, Sergey Beryozkin wrote:
>
>> Hi
>>
>> I can try to enhance a CXF GzipOutInterceptor (at CXF level) to use a
>> compressing Deflater in GZIP compatible mode. The server will react to a
>> client accepting GZIP and compress the out payloads.
>>
>> I think it would be a good idea to have a Tika server war module
>> introduced for users easily add custom out/in filters to the JAX-RS
>> endpoint.
>>
>> I guess we can do it for 1.7
>> Sergey
>> DEFLATER
>>
>>
>>
>> On 07/08/14 15:44, Bratislav Stojanovic wrote:
>>
>>> Hi,
>>>
>>> I'm trying to get text, metadata and attachments all in one request
>>> using tika-server (JAX-RS), but
>>> the only thing I can get as an output is either uncompressed ZIP or TAR.
>>>
>>> Is there any way to :
>>>
>>> - set compression level? Having uncompressed ZIP/TAR with resources
>>> actually occupies more space than having plain __METADATA__ , __TEXT__
>>> and other files because of additional ZIP/TAR headers. If I decide to
>>> use ZIP/TAR I would like to save some hd space.
>>>
>>> - or use a simple folder instead of output file with all extracted
>>> resources inside? This is desired
>>> for me because I don't have to decompress output to reach the extracted
>>> resources
>>>
>>> Basically, I would like to specify compression or folder in this
>>> command :
>>>
>>> curl -T example.doc http://localhost:9998/all > outputFolder
>>>
>>> I haven't found any related info on
>>> http://wiki.apache.org/tika/TikaJAXRS or mailing list archives, so
>>> please help :)
>>>
>>> --
>>> Bratislav Stojanovic, M.Sc.
>>>
>>
>>


-- 
Bratislav Stojanovic, M.Sc.

Re: Compression of Tika server output files

Posted by Sergey Beryozkin <sb...@gmail.com>.
By the way, would a default GZIP compression suit ?
If yes we can have it done even without the extra CXF changes.

Sergey

On 07/08/14 16:15, Sergey Beryozkin wrote:
> Hi
>
> I can try to enhance a CXF GzipOutInterceptor (at CXF level) to use a
> compressing Deflater in GZIP compatible mode. The server will react to a
> client accepting GZIP and compress the out payloads.
>
> I think it would be a good idea to have a Tika server war module
> introduced for users easily add custom out/in filters to the JAX-RS
> endpoint.
>
> I guess we can do it for 1.7
> Sergey
> DEFLATER
>
>
> On 07/08/14 15:44, Bratislav Stojanovic wrote:
>> Hi,
>>
>> I'm trying to get text, metadata and attachments all in one request
>> using tika-server (JAX-RS), but
>> the only thing I can get as an output is either uncompressed ZIP or TAR.
>>
>> Is there any way to :
>>
>> - set compression level? Having uncompressed ZIP/TAR with resources
>> actually occupies more space than having plain __METADATA__ , __TEXT__
>> and other files because of additional ZIP/TAR headers. If I decide to
>> use ZIP/TAR I would like to save some hd space.
>>
>> - or use a simple folder instead of output file with all extracted
>> resources inside? This is desired
>> for me because I don't have to decompress output to reach the extracted
>> resources
>>
>> Basically, I would like to specify compression or folder in this
>> command :
>>
>> curl -T example.doc http://localhost:9998/all > outputFolder
>>
>> I haven't found any related info on
>> http://wiki.apache.org/tika/TikaJAXRS or mailing list archives, so
>> please help :)
>>
>> --
>> Bratislav Stojanovic, M.Sc.
>

Re: Compression of Tika server output files

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi

I can try to enhance a CXF GzipOutInterceptor (at CXF level) to use a 
compressing Deflater in GZIP compatible mode. The server will react to a 
client accepting GZIP and compress the out payloads.

I think it would be a good idea to have a Tika server war module 
introduced for users easily add custom out/in filters to the JAX-RS 
endpoint.

I guess we can do it for 1.7
Sergey



On 07/08/14 15:44, Bratislav Stojanovic wrote:
> Hi,
>
> I'm trying to get text, metadata and attachments all in one request
> using tika-server (JAX-RS), but
> the only thing I can get as an output is either uncompressed ZIP or TAR.
>
> Is there any way to :
>
> - set compression level? Having uncompressed ZIP/TAR with resources
> actually occupies more space than having plain __METADATA__ , __TEXT__
> and other files because of additional ZIP/TAR headers. If I decide to
> use ZIP/TAR I would like to save some hd space.
>
> - or use a simple folder instead of output file with all extracted
> resources inside? This is desired
> for me because I don't have to decompress output to reach the extracted
> resources
>
> Basically, I would like to specify compression or folder in this command :
>
> curl -T example.doc http://localhost:9998/all > outputFolder
>
> I haven't found any related info on
> http://wiki.apache.org/tika/TikaJAXRS or mailing list archives, so
> please help :)
>
> --
> Bratislav Stojanovic, M.Sc.