You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Chris Mattmann <ma...@apache.org> on 2023/01/05 15:29:04 UTC

Re: [EXTERNAL] Re: Subset(s) of Tika?

Not sure of your operating environment, but if you are using Python you can 
also use http://github.com/chrismattmann/tika-python by doing ‘pip install tika’ 
and then from there you have a python wrapper around the latest Tika server (2.6.0).

 

Thanks,

Chris

 

 

 

From: Bridger Dyson-Smith <bd...@gmail.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Thursday, January 5, 2023 at 7:09 AM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: [EXTERNAL] Re: Subset(s) of Tika?

 

Hi Nick and Georg

 

On Thu, Jan 5, 2023 at 9:34 AM Nick Burch <ap...@gagravarr.org> wrote:

On Thu, 5 Jan 2023, Georg.Fischer wrote:
> The tika.jar has >54 MB, and I suspect that the loading of the big jar 
> (under Windows) is hindering the performance. I should perhaps move to 
> Linux, or try the Tika server.

The Tika App jar has always been the "kitchen sink included quickstart" 
option

The Tika java library, and the Tika Server both support including or 
excluding groups of file format parsers

> I used a recent tika.jar on the Windows 10 commandline to extract text 
> from some 30 PDF files, with a makefile converting one file per command. 
> That was quite successful, but it took some time, and the approach will 
> perhaps not be appropriate for 300 or 1000 PDFs.

For a folder of files, you might be better off with Tika Batch, which is 
aimed at batch processing a large number of files. It can respawn failed 
child processes, doesn't require starting a JVM every file etc

Otherwise, the Tika Server is a good option. If you're doing everything 
locally, turn on "-enableUnsecureFeatures -enableFileUrl" and then you can 
pass it a file path to process (but not on a publically available 
machine!)

Now that's a neat trick - I was just going to suggest the Server but those switches are

definitely something to add to my notes. Also, thanks for suggesting Tika Batch - I didn't

know about that either. 

 

Nick

 

Best,

Bridger