You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "August Valera (JIRA)" <ji...@apache.org> on 2018/08/13 18:08:00 UTC

[jira] [Updated] (TIKA-2696) Support output of Tesseract OSD output for psm mode 0

     [ https://issues.apache.org/jira/browse/TIKA-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

August Valera updated TIKA-2696:
--------------------------------
    External issue URL: https://github.com/apache/tika/pull/246/files

> Support output of Tesseract OSD output for psm mode 0
> -----------------------------------------------------
>
>                 Key: TIKA-2696
>                 URL: https://issues.apache.org/jira/browse/TIKA-2696
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr
>            Reporter: August Valera
>            Priority: Minor
>
> TIKA-2357 added support for additional PSM (page segmentation modes) for Tesseract OCR, including mode 0, which is {{Orientation and script detection (OSD) only}}, meaning it does not perform OCR, just outputs orientation and script information.
> An example usage of mode 0:
> {code:java}
> $ tesseract infile.png outfile --psm 0 -l osd
> {code}
> In this mode, the usual {{outfile.txt}} is not created. Instead, and similar to other modes that run OSD in addition to extraction, the result is an {{outfile.osd}} file, like so:
> {code:java}
> Page 1
> Warning. Invalid resolution 0 dpi. Using 70 instead.
> Estimating resolution as 212
> Page number: 0
> Orientation in degrees: 0
> Rotate: 0
> Orientation confidence: 13.73
> Script: Latin
> Script confidence: 4.78
> {code}
> However, {{TesseractOCRParser#parse(...)}} is [coded|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java#L437] to only read the contents of {{outfile.txt}} (alternatively {{outfile.hocr}}) in all modes, so mode 0 outputs nothing regardless of input.
> This is consistent with Tika's goal to output extracted text, but against the intention of the user expecting OSD output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)