You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Milos Kovacevic <mi...@grf.bg.ac.rs> on 2014/10/30 12:34:59 UTC

Setting tesseract properties when using tika-server

Hello,
I am using tika-server-1.7-SNAPSHOT.jar which incorporates tesseract ocr
engine. I am curious how can i set different tesseract parameters such as
default language or output format (hOCR) in a separate request to tika
server?
Regards, Milos


Re: Setting tesseract properties when using tika-server

Posted by Milos Kovacevic <mi...@grf.bg.ac.rs>.
Thank you very much!
I think I'll go with the option no. 2
Cheers,
Milos
> Hi Milos,
>
>> On 30 Oct 2014, at 15:06, Milos Kovacevic <mi...@grf.bg.ac.rs> wrote:
>>
>> How can i do that?
>
> We have recently added two options based on your feedback:
> 1) Setting your own custom TesseractOCRConfig.properties file on the
> classpath to override the default settings
> 2) Passing the X-Tika-OCRLanguage custom header into the /tika resource
> specifying your language using the Tesseract language parameter scheme.
>
> The Wiki has been updated to include information on this on the TikaOCR
> page here -> https://wiki.apache.org/tika/TikaOCR
> <https://wiki.apache.org/tika/TikaOCR>
>
> Cheers,
> Dave



Re: Setting tesseract properties when using tika-server

Posted by David Meikle <lo...@gmail.com>.
Hi Milos,

> On 30 Oct 2014, at 15:06, Milos Kovacevic <mi...@grf.bg.ac.rs> wrote:
> 
> How can i do that?

We have recently added two options based on your feedback:
1) Setting your own custom TesseractOCRConfig.properties file on the classpath to override the default settings
2) Passing the X-Tika-OCRLanguage custom header into the /tika resource specifying your language using the Tesseract language parameter scheme.

The Wiki has been updated to include information on this on the TikaOCR page here -> https://wiki.apache.org/tika/TikaOCR <https://wiki.apache.org/tika/TikaOCR>

Cheers,
Dave

Re: Setting tesseract properties when using tika-server

Posted by David Meikle <lo...@gmail.com>.
Hey Chris,

> On 17 Nov 2014, at 16:46, Mattmann, Chris A (3980) <ch...@jpl.nasa.gov> wrote:
> 
> Hi Dave,
> 
> I like having request headers with the Tesseract properties, prefixed
> with X-Tika-OCR<propertyname>. Very cool idea!
> 
> Cheers,
> Chris

I was thinking the same thing this morning having done it for the language only given Milos email. Going to update this with the other properties too.

Cheers,
Dave

Re: Setting tesseract properties when using tika-server

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Hi Dave,

I like having request headers with the Tesseract properties, prefixed
with X-Tika-OCR<propertyname>. Very cool idea!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: David Meikle <lo...@gmail.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Monday, November 17, 2014 at 3:54 PM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: Re: Setting tesseract properties when using tika-server

>
>
>
>Hi Nick,
>
>
>On 16 Nov 2014, at 11:16, Nick Burch <ap...@gagravarr.org> wrote:
>
>Maybe
> we could say that the default Tika URL won't include tessaract. We then
>provide another one that does bring it in, and offers parameters to hint
>which languages to try for on that request?
>
>
>
>
>
>
>Considering this again, we already have set the pattern that you can hint
>via headers (i.e. our File-Name header), so why not do this via a header.
>
>
>Thinking about calling this X-Tika-OCRLanguage?  Any other preferences?
>
>
>Cheers,
>Dave


Re: Setting tesseract properties when using tika-server

Posted by David Meikle <lo...@gmail.com>.
Hi Nick,

> On 16 Nov 2014, at 11:16, Nick Burch <ap...@gagravarr.org> wrote:
> 
> Maybe we could say that the default Tika URL won't include tessaract. We then provide another one that does bring it in, and offers parameters to hint which languages to try for on that request?

Considering this again, we already have set the pattern that you can hint via headers (i.e. our File-Name header), so why not do this via a header.

Thinking about calling this X-Tika-OCRLanguage?  Any other preferences?

Cheers,
Dave

Re: Setting tesseract properties when using tika-server

Posted by Nick Burch <ap...@gagravarr.org>.
On Sat, 15 Nov 2014, David Meikle wrote:
>> The OP is using the Tika Server though. I guess we'd need to allow for 
>> an extra header in the server to get this set on the context used in 
>> the server's parsing?
>
> We could do something like this to allow users to set the language per 
> request - I am using the parser wrapped via its own server API, so all I 
> am doing is capturing a request parameter and then setting the context 
> to override a patched TesseractOCRConfig that loads from an external 
> properties file akin to the PDFConfig file.  I will add that in at 
> least.
>
> I personally don’t like custom headers that modify behaviour, although 
> you do see if in POST requests at times.  Same difference really between 
> this and an optional parameter.  Maybe the config file will be enough as 
> having added the above, I don’t see much difference between a call with 
> a single language and one with all languages configured.

Maybe we could say that the default Tika URL won't include tessaract. We 
then provide another one that does bring it in, and offers parameters to 
hint which languages to try for on that request?

Nick

Re: Setting tesseract properties when using tika-server

Posted by David Meikle <lo...@gmail.com>.
Hi Nick,

> On 15 Nov 2014, at 15:39, Nick Burch <ap...@gagravarr.org> wrote:
> 
> On Sat, 15 Nov 2014, David Meikle wrote:
>>> How can i do that?
>> 
>> You can set this using the TesseractOCRConfig class.  It has a property called language which can be set to a + separated list of supported language models (i.e. the ones you have installed with your Tesseract installation) using their ISO 639-2 codes.  You then add this into the ParseContext so you override the default use of the english model only.
>> 
>> ParseContext context = new ParseContext();
>> TesseractOCRConfig ocrConfig = new TesseractOCRConfig();
>> ocrConfig.setLanguage("eng+fra+deu");
>> context.set(TesseractOCRConfig.class, ocrConfig);
> 
> The OP is using the Tika Server though. I guess we'd need to allow for an extra header in the server to get this set on the context used in the server's parsing?

We could do something like this to allow users to set the language per request - I am using the parser wrapped via its own server API, so all I am doing is capturing a request parameter and then setting the context to override a patched TesseractOCRConfig that loads from an external properties file akin to the PDFConfig file.  I will add that in at least.

I personally don’t like custom headers that modify behaviour, although you do see if in POST requests at times.  Same difference really between this and an optional parameter.  Maybe the config file will be enough as having added the above, I don’t see much difference between a call with a single language and one with all languages configured.

>> I am using this in production now and have done some work to make configuring the OCR Parser easier.  Not had time to contribute this back, will hopefully be able to do this whilst at ApacheCon EU.
> 
> I'll be there too, but slightly stressed with the number of talks I'm giving, but I can hopefully offer a quick hand at some point :)

I had noticed you are doing a quite a few sessions! See you soon.

Cheers,
Dave


Re: Setting tesseract properties when using tika-server

Posted by Nick Burch <ap...@gagravarr.org>.
On Sat, 15 Nov 2014, David Meikle wrote:
>> How can i do that?
>
> You can set this using the TesseractOCRConfig class.  It has a property 
> called language which can be set to a + separated list of supported 
> language models (i.e. the ones you have installed with your Tesseract 
> installation) using their ISO 639-2 codes.  You then add this into the 
> ParseContext so you override the default use of the english model only.
>
> ParseContext context = new ParseContext();
> TesseractOCRConfig ocrConfig = new TesseractOCRConfig();
> ocrConfig.setLanguage("eng+fra+deu");
> context.set(TesseractOCRConfig.class, ocrConfig);

The OP is using the Tika Server though. I guess we'd need to allow for an 
extra header in the server to get this set on the context used in the 
server's parsing?

> I am using this in production now and have done some work to make 
> configuring the OCR Parser easier.  Not had time to contribute this 
> back, will hopefully be able to do this whilst at ApacheCon EU.

I'll be there too, but slightly stressed with the number of talks I'm 
giving, but I can hopefully offer a quick hand at some point :)

Nick

Re: Setting tesseract properties when using tika-server

Posted by David Meikle <lo...@gmail.com>.
Hello Milos,

> On 30 Oct 2014, at 15:06, Milos Kovacevic <mi...@grf.bg.ac.rs> wrote:
> 
>> On Thu, 30 Oct 2014, Milos Kovacevic wrote:
>>> I am using tika-server-1.7-SNAPSHOT.jar which incorporates tesseract ocr
>>> engine. I am curious how can i set different tesseract parameters such
>>> as
>>> default language or output format (hOCR) in a separate request to tika
>>> server?
>> 
>> I believe they can only be set once on a server-wide basis at the moment
> 
> How can i do that?

You can set this using the TesseractOCRConfig class.  It has a property called language which can be set to a + separated list of supported language models (i.e. the ones you have installed with your Tesseract installation) using their ISO 639-2 codes.  You then add this into the ParseContext so you override the default use of the english model only.

ParseContext context = new ParseContext();
TesseractOCRConfig ocrConfig = new TesseractOCRConfig();
ocrConfig.setLanguage("eng+fra+deu");
context.set(TesseractOCRConfig.class, ocrConfig);

Then it is a case of using this ParseContext within the parser.

>> 
>> Could you explain a use case for wanting to change it on a per-request
>> basis, to help us understand?
> 
> Well, I have a lot of files written in different languages and alphabets.
> OCR performance depends on that info. So when I have to send let's say
> English file I'll set the language to eng and if the file is Serbian I'll
> set it to be SER. Tesseract uses language files to improve recognition
> performance.

I am using this in production now and have done some work to make configuring the OCR Parser easier.  Not had time to contribute this back, will hopefully be able to do this whilst at ApacheCon EU.

Cheers,
Dave

Re: Setting tesseract properties when using tika-server

Posted by Milos Kovacevic <mi...@grf.bg.ac.rs>.
Hello,

> On Thu, 30 Oct 2014, Milos Kovacevic wrote:
>> I am using tika-server-1.7-SNAPSHOT.jar which incorporates tesseract ocr
>> engine. I am curious how can i set different tesseract parameters such
>> as
>> default language or output format (hOCR) in a separate request to tika
>> server?
>
> I believe they can only be set once on a server-wide basis at the moment

How can i do that?

>
> Could you explain a use case for wanting to change it on a per-request
> basis, to help us understand?

Well, I have a lot of files written in different languages and alphabets.
OCR performance depends on that info. So when I have to send let's say
English file I'll set the language to eng and if the file is Serbian I'll
set it to be SER. Tesseract uses language files to improve recognition
performance.

>
> Thanks
> Nick
>

Regards, Milos


Re: Setting tesseract properties when using tika-server

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 30 Oct 2014, Milos Kovacevic wrote:
> I am using tika-server-1.7-SNAPSHOT.jar which incorporates tesseract ocr
> engine. I am curious how can i set different tesseract parameters such as
> default language or output format (hOCR) in a separate request to tika
> server?

I believe they can only be set once on a server-wide basis at the moment

Could you explain a use case for wanting to change it on a per-request 
basis, to help us understand?

Thanks
Nick