You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Doug Carter <dc...@mercycorps.org> on 2010/01/15 20:07:05 UTC

Tika command line performance

Hi all,

This may be off-topic for this list, but I need to start somewhere.

I need a command line utility to do document format conversion, in a
batch mode environment. The batch process is a combination of steps, one
of which is the actual format conversion which is currently being done
by a collection of Linux binary converters like wvWare, pdftohtml, etc.

I've put a shell script wrapper around the tika jar:

  java -jar tika-app.jar [infile] > [outfile]

This works OK, but as you would imagine, it is much slower compared to
a Linux binary. 

Does anyone know of a way to improve the performance in a setup like
this? I know it goes against the whole philosophy of Java, but is there
a way to compile the Tika jar byte code into a native Linux binary? I've
taken a look at gcj, but it doesn't look like a simple re-compile.

Any ideas would be greatly appreciated.

TIA,

Doug


Re: Tika command line performance

Posted by Luke Nezda <ln...@gmail.com>.
Maybe you could try Nailgun
<http://martiansoftware.com/nailgun/index.html>; if I understand
correctly, its a C socket wrapper to simple Java socket
server which holds JVM open.  I've never actually used it, but sounds like
you have a use case where it could be beneficial (assuming JVM init overhead
is slowest part).

Good luck,
- Luke

On Fri, Jan 15, 2010 at 1:07 PM, Doug Carter <dc...@mercycorps.org> wrote:

>
> Hi all,
>
> This may be off-topic for this list, but I need to start somewhere.
>
> I need a command line utility to do document format conversion, in a
> batch mode environment. The batch process is a combination of steps, one
> of which is the actual format conversion which is currently being done
> by a collection of Linux binary converters like wvWare, pdftohtml, etc.
>
> I've put a shell script wrapper around the tika jar:
>
>  java -jar tika-app.jar [infile] > [outfile]
>
> This works OK, but as you would imagine, it is much slower compared to
> a Linux binary.
>
> Does anyone know of a way to improve the performance in a setup like
> this? I know it goes against the whole philosophy of Java, but is there
> a way to compile the Tika jar byte code into a native Linux binary? I've
> taken a look at gcj, but it doesn't look like a simple re-compile.
>
> Any ideas would be greatly appreciated.
>
> TIA,
>
> Doug
>
>

Re: Tika command line performance

Posted by Doug Carter <dc...@mercycorps.org>.
On Fri, Jan 15, 2010 at 11:37:30AM -0800, Ken Krugler wrote:
> 
> On Jan 15, 2010, at 11:27am, Doug Carter wrote:
> 
> >On Fri, Jan 15, 2010 at 11:19:31AM -0800, Ken Krugler wrote:
> >>
> >>On Jan 15, 2010, at 11:07am, Doug Carter wrote:
> >>
> >>>
> >>>Hi all,
> >>>
> >>>This may be off-topic for this list, but I need to start somewhere.
> >>>
> >>>I need a command line utility to do document format conversion, in a
> >>>batch mode environment. The batch process is a combination of steps,
> >>>one
> >>>of which is the actual format conversion which is currently being  
> >>>done
> >>>by a collection of Linux binary converters like wvWare, pdftohtml,
> >>>etc.
> >>>
> >>>I've put a shell script wrapper around the tika jar:
> >>>
> >>>java -jar tika-app.jar [infile] > [outfile]
> >>>
> >>>This works OK, but as you would imagine, it is much slower  
> >>>compared to
> >>>a Linux binary.
> >>>
> >>>Does anyone know of a way to improve the performance in a setup like
> >>>this? I know it goes against the whole philosophy of Java, but is
> >>>there
> >>>a way to compile the Tika jar byte code into a native Linux binary?
> >>>I've
> >>>taken a look at gcj, but it doesn't look like a simple re-compile.
> >>>
> >>>Any ideas would be greatly appreciated.
> >>
> >>If you have a set of documents, easiest would be to pass in a
> >>directory to tika-app (extend it a bit) so that one invocation of the
> >>JVM processes many documents.
> >
> >Hi Ken,
> >
> >I've considered something like this (for the exact reason you stated)
> >but I don't have that flexibility with my current setup. Each document
> >needs to go through a series of processing steps, one of which is the
> >format conversion.
> 
> In that case, another cheesy solution is to have the Java process  
> watch a specific directory. Whenever a new file (with the appropriate  
> name format) appears, it gets processed. This Java process then  
> continues to run indefinitely as a kind of processing daemon.
> 
> You can avoid hand-off problems by using a name pattern, and renaming  
> the file when it's really ready for processing.
> 
> There are lots of cleaner, more sophisticated systems involving  
> notification systems, queues, RESTful services, etc. which might be  
> more appropriate, depending on your needs.

Interesting approach. Thanks for the idea.

Doug


Re: Tika command line performance

Posted by Ken Krugler <kk...@transpac.com>.
On Jan 15, 2010, at 11:27am, Doug Carter wrote:

> On Fri, Jan 15, 2010 at 11:19:31AM -0800, Ken Krugler wrote:
>>
>> On Jan 15, 2010, at 11:07am, Doug Carter wrote:
>>
>>>
>>> Hi all,
>>>
>>> This may be off-topic for this list, but I need to start somewhere.
>>>
>>> I need a command line utility to do document format conversion, in a
>>> batch mode environment. The batch process is a combination of steps,
>>> one
>>> of which is the actual format conversion which is currently being  
>>> done
>>> by a collection of Linux binary converters like wvWare, pdftohtml,
>>> etc.
>>>
>>> I've put a shell script wrapper around the tika jar:
>>>
>>> java -jar tika-app.jar [infile] > [outfile]
>>>
>>> This works OK, but as you would imagine, it is much slower  
>>> compared to
>>> a Linux binary.
>>>
>>> Does anyone know of a way to improve the performance in a setup like
>>> this? I know it goes against the whole philosophy of Java, but is
>>> there
>>> a way to compile the Tika jar byte code into a native Linux binary?
>>> I've
>>> taken a look at gcj, but it doesn't look like a simple re-compile.
>>>
>>> Any ideas would be greatly appreciated.
>>
>> If you have a set of documents, easiest would be to pass in a
>> directory to tika-app (extend it a bit) so that one invocation of the
>> JVM processes many documents.
>
> Hi Ken,
>
> I've considered something like this (for the exact reason you stated)
> but I don't have that flexibility with my current setup. Each document
> needs to go through a series of processing steps, one of which is the
> format conversion.

In that case, another cheesy solution is to have the Java process  
watch a specific directory. Whenever a new file (with the appropriate  
name format) appears, it gets processed. This Java process then  
continues to run indefinitely as a kind of processing daemon.

You can avoid hand-off problems by using a name pattern, and renaming  
the file when it's really ready for processing.

There are lots of cleaner, more sophisticated systems involving  
notification systems, queues, RESTful services, etc. which might be  
more appropriate, depending on your needs.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Re: Tika command line performance

Posted by Doug Carter <dc...@mercycorps.org>.
On Fri, Jan 15, 2010 at 11:19:31AM -0800, Ken Krugler wrote:
> 
> On Jan 15, 2010, at 11:07am, Doug Carter wrote:
> 
> >
> >Hi all,
> >
> >This may be off-topic for this list, but I need to start somewhere.
> >
> >I need a command line utility to do document format conversion, in a
> >batch mode environment. The batch process is a combination of steps,  
> >one
> >of which is the actual format conversion which is currently being done
> >by a collection of Linux binary converters like wvWare, pdftohtml,  
> >etc.
> >
> >I've put a shell script wrapper around the tika jar:
> >
> > java -jar tika-app.jar [infile] > [outfile]
> >
> >This works OK, but as you would imagine, it is much slower compared to
> >a Linux binary.
> >
> >Does anyone know of a way to improve the performance in a setup like
> >this? I know it goes against the whole philosophy of Java, but is  
> >there
> >a way to compile the Tika jar byte code into a native Linux binary?  
> >I've
> >taken a look at gcj, but it doesn't look like a simple re-compile.
> >
> >Any ideas would be greatly appreciated.
> 
> If you have a set of documents, easiest would be to pass in a  
> directory to tika-app (extend it a bit) so that one invocation of the  
> JVM processes many documents.

Hi Ken,

I've considered something like this (for the exact reason you stated)
but I don't have that flexibility with my current setup. Each document
needs to go through a series of processing steps, one of which is the
format conversion.

Thanks for idea though.

Doug



Re: Tika command line performance

Posted by Ken Krugler <kk...@transpac.com>.
On Jan 15, 2010, at 11:07am, Doug Carter wrote:

>
> Hi all,
>
> This may be off-topic for this list, but I need to start somewhere.
>
> I need a command line utility to do document format conversion, in a
> batch mode environment. The batch process is a combination of steps,  
> one
> of which is the actual format conversion which is currently being done
> by a collection of Linux binary converters like wvWare, pdftohtml,  
> etc.
>
> I've put a shell script wrapper around the tika jar:
>
>  java -jar tika-app.jar [infile] > [outfile]
>
> This works OK, but as you would imagine, it is much slower compared to
> a Linux binary.
>
> Does anyone know of a way to improve the performance in a setup like
> this? I know it goes against the whole philosophy of Java, but is  
> there
> a way to compile the Tika jar byte code into a native Linux binary?  
> I've
> taken a look at gcj, but it doesn't look like a simple re-compile.
>
> Any ideas would be greatly appreciated.

If you have a set of documents, easiest would be to pass in a  
directory to tika-app (extend it a bit) so that one invocation of the  
JVM processes many documents.

-- Ken


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g