You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "August Valera (JIRA)" <ji...@apache.org> on 2018/07/26 23:52:00 UTC

[jira] [Created] (TIKA-2696) Support output of Tesseract OSD output for psm mode 0

August Valera created TIKA-2696:
-----------------------------------

             Summary: Support output of Tesseract OSD output for psm mode 0
                 Key: TIKA-2696
                 URL: https://issues.apache.org/jira/browse/TIKA-2696
             Project: Tika
          Issue Type: Improvement
          Components: ocr
            Reporter: August Valera


TIKA-2357 added support for additional PSM (page segmentation modes) for Tesseract OCR, including mode 0, which is {{Orientation and script detection (OSD) only}}, meaning it does not perform OCR, just outputs orientation and script information.

An example usage of mode 0:
{code:java}
$ tesseract infile.png outfile --psm 0 -l osd
{code}
In this mode, the usual {{outfile.txt}} is not created. Instead, and similar to other modes that run OSD in addition to extraction, the result is an {{outfile.osd}} file, like so:
{code:java}
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 212
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 13.73
Script: Latin
Script confidence: 4.78
{code}
However, {{TesseractOCRParser#parse(...)}} is [coded|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java#L437] to only read the contents of {{outfile.txt}} (alternatively {{outfile.hocr}}) in all modes, so mode 0 outputs nothing regardless of input.

This is consistent with Tika's goal to output extracted text, but against the intention of the user expecting OSD output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)