You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2021/05/18 12:48:00 UTC

[jira] [Updated] (TIKA-3384) Convert new transcribe package to a Parser

     [ https://issues.apache.org/jira/browse/TIKA-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison updated TIKA-3384:
------------------------------
    Summary: Convert new transcribe package to a Parser  (was: Convert new transcribe package to a Parser along the lines of OCR?)

> Convert new transcribe package to a Parser
> ------------------------------------------
>
>                 Key: TIKA-3384
>                 URL: https://issues.apache.org/jira/browse/TIKA-3384
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> This is a proposal to convert [~lewismc] et al's awesome new transcribe code into a parser along the lines of Tesseract.  
> In 2.x, I inverted the call order from 1.x.  The image parsers now look to see if there's a parser that supports a pseudo mime, like {{image/ocr-jpeg}}, if there is, then they apply that parser to the stream.  We could do the same thing with media files that the new transcription package supports.  
> For those who want only ocr/transcription, they can turn off the image parsers and then decorate the OCR parser, for example, with {{supports "image/jpeg"}} and that parser will be called directly.
> What do you think?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)