You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Hudson (Jira)" <ji...@apache.org> on 2021/05/18 10:56:00 UTC

[jira] [Commented] (TIKA-3384) Convert new transcribe package to a Parser along the lines of OCR?

    [ https://issues.apache.org/jira/browse/TIKA-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346788#comment-17346788 ] 

Hudson commented on TIKA-3384:
------------------------------

UNSTABLE: Integrated in Jenkins build Tika ยป tika-main-jdk8 #237 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/237/])
TIKA-3384 -- convert transcribe to a traditional parser (tallison: [https://github.com/apache/tika/commit/93d2211037b01ca237a51f83879ae35f3f76dca8])
* (delete) tika-transcribe/src/test/resources/en-US_(Hi).mp4
* (delete) tika-transcribe/src/test/resources/ko-KR_(We_Are_Having_Class_x2).mp3
* (delete) tika-transcribe/pom.xml
* (delete) tika-transcribe/src/test/resources/it-IT_(We_Are_Having_Class_x2).mp3
* (delete) tika-transcribe/src/main/java/org/apache/tika/transcribe/AmazonTranscribe.java
* (delete) tika-transcribe/src/test/resources/ja-JP_(We_Are_At_School).mp3
* (edit) pom.xml
* (delete) tika-transcribe/src/test/resources/de-DE_(We_Are_At_School_x2).mp3
* (edit) tika-example/pom.xml
* (delete) tika-transcribe/src/test/java/org/apache/tika/transcribe/AmazonTranscribeTest.java
* (delete) tika-transcribe/src/test/resources/pt-BR_(We_Are_At_School).mp3
* (delete) tika-transcribe/src/test/resources/en-US_(A_Little_Bottle_Of_Water).mp3
* (delete) tika-transcribe/src/test/resources/en-GB_(A_Little_Bottle_Of_Water).mp3
* (delete) tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
* (delete) tika-transcribe/src/test/resources/ko-KR_(Annyeonghaseyo).mp4
* (delete) tika-transcribe/src/main/resources/org.apache.tika.transcribe/transcribe.amazon.properties
* (delete) tika-transcribe/src/test/resources/ShortAudioSampleFrench.mp3
* (edit) tika-example/src/main/java/org/apache/tika/example/TranscribeTranslateExample.java
* (delete) tika-transcribe/src/main/resources/META-INF.services/org.apache.tika.language.translate.Translator
* (edit) tika-parsers/tika-parsers-ml/pom.xml
* (delete) tika-transcribe/src/test/resources/en-AU_(A_Little_Bottle_Of_Water).mp3
Revert "TIKA-3384 -- convert transcribe to a traditional parser" (tallison: [https://github.com/apache/tika/commit/2e520e82d7c2d5088803af60cb44793abd852bea])
* (add) tika-transcribe/pom.xml
* (add) tika-transcribe/src/test/resources/ko-KR_(Annyeonghaseyo).mp4
* (add) tika-transcribe/src/test/java/org/apache/tika/transcribe/AmazonTranscribeTest.java
* (add) tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
* (add) tika-transcribe/src/test/resources/de-DE_(We_Are_At_School_x2).mp3
* (edit) tika-parsers/tika-parsers-ml/pom.xml
* (add) tika-transcribe/src/test/resources/en-GB_(A_Little_Bottle_Of_Water).mp3
* (add) tika-transcribe/src/test/resources/it-IT_(We_Are_Having_Class_x2).mp3
* (add) tika-transcribe/src/main/resources/META-INF.services/org.apache.tika.language.translate.Translator
* (add) tika-transcribe/src/test/resources/en-US_(A_Little_Bottle_Of_Water).mp3
* (add) tika-transcribe/src/test/resources/ShortAudioSampleFrench.mp3
* (add) tika-transcribe/src/test/resources/en-US_(Hi).mp4
* (add) tika-transcribe/src/test/resources/ko-KR_(We_Are_Having_Class_x2).mp3
* (add) tika-transcribe/src/test/resources/pt-BR_(We_Are_At_School).mp3
* (edit) tika-example/pom.xml
* (add) tika-transcribe/src/main/resources/org.apache.tika.transcribe/transcribe.amazon.properties
* (add) tika-transcribe/src/test/resources/en-AU_(A_Little_Bottle_Of_Water).mp3
* (edit) pom.xml
* (add) tika-transcribe/src/test/resources/ja-JP_(We_Are_At_School).mp3
* (edit) tika-example/src/main/java/org/apache/tika/example/TranscribeTranslateExample.java
* (add) tika-transcribe/src/main/java/org/apache/tika/transcribe/AmazonTranscribe.java


> Convert new transcribe package to a Parser along the lines of OCR?
> ------------------------------------------------------------------
>
>                 Key: TIKA-3384
>                 URL: https://issues.apache.org/jira/browse/TIKA-3384
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> This is a proposal to convert [~lewismc] et al's awesome new transcribe code into a parser along the lines of Tesseract.  
> In 2.x, I inverted the call order from 1.x.  The image parsers now look to see if there's a parser that supports a pseudo mime, like {{image/ocr-jpeg}}, if there is, then they apply that parser to the stream.  We could do the same thing with media files that the new transcription package supports.  
> For those who want only ocr/transcription, they can turn off the image parsers and then decorate the OCR parser, for example, with {{supports "image/jpeg"}} and that parser will be called directly.
> What do you think?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)