You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "August Valera (JIRA)" <ji...@apache.org> on 2018/07/26 23:52:00 UTC
[jira] [Created] (TIKA-2696) Support output of Tesseract OSD output
for psm mode 0
August Valera created TIKA-2696:
-----------------------------------
Summary: Support output of Tesseract OSD output for psm mode 0
Key: TIKA-2696
URL: https://issues.apache.org/jira/browse/TIKA-2696
Project: Tika
Issue Type: Improvement
Components: ocr
Reporter: August Valera
TIKA-2357 added support for additional PSM (page segmentation modes) for Tesseract OCR, including mode 0, which is {{Orientation and script detection (OSD) only}}, meaning it does not perform OCR, just outputs orientation and script information.
An example usage of mode 0:
{code:java}
$ tesseract infile.png outfile --psm 0 -l osd
{code}
In this mode, the usual {{outfile.txt}} is not created. Instead, and similar to other modes that run OSD in addition to extraction, the result is an {{outfile.osd}} file, like so:
{code:java}
Page 1
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 212
Page number: 0
Orientation in degrees: 0
Rotate: 0
Orientation confidence: 13.73
Script: Latin
Script confidence: 4.78
{code}
However, {{TesseractOCRParser#parse(...)}} is [coded|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java#L437] to only read the contents of {{outfile.txt}} (alternatively {{outfile.hocr}}) in all modes, so mode 0 outputs nothing regardless of input.
This is consistent with Tika's goal to output extracted text, but against the intention of the user expecting OSD output.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)