You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "GAURAV KUMAR (Jira)" <ji...@apache.org> on 2021/10/14 06:00:00 UTC

[jira] [Commented] (TIKA-2970) Configuring Tesseract for OCR of PDF via Tika Config is not working

    [ https://issues.apache.org/jira/browse/TIKA-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428613#comment-17428613 ] 

GAURAV KUMAR commented on TIKA-2970:
------------------------------------

NEED A HELP !!!

Before Tika 2.x we were setting tesseract params using config.properties  through TesseractOCRConfig but now that's not the case. Now, do we set those configurations using tika-config.xml through TikaConfig.class ? if yes , don't we need to set them separately through TesseractOCRParser?

Please help me understand this as I need to migrate from tika 1.27 to Tika 2.0.0

> Configuring Tesseract for OCR of PDF via Tika Config is not working
> -------------------------------------------------------------------
>
>                 Key: TIKA-2970
>                 URL: https://issues.apache.org/jira/browse/TIKA-2970
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr
>    Affects Versions: 1.22
>            Reporter: David Eric Pugh
>            Assignee: Tim Allison
>            Priority: Critical
>             Fix For: 1.23
>
>
> Based on TIKA-2705, I thought I could eliminate the use of the properties files for configuring PDF and OCR processing, and just use a tika-config.xml file.
> I believe I have a unit test that demonstrates that if you need to override the tesseract path for OCR, you end up always with the default Tesseract configuration, which leads to Tika throwing an error: https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java#L328   
> In stepping through the code, it seems like every time we consult the context:
> ```
> TesseractOCRConfig tesseractConfig =
>                 context.get(TesseractOCRConfig.class, DEFAULT_TESSERACT_CONFIG);
> ```
> We always get back the default.  The context never has our customized TesseractOCRConfig!   Despite the fact that when we load up the TikaConfig in the first case, I notice that we do create a TesseractOCRParser object WITH the various parameters...   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)