You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by GitBox <gi...@apache.org> on 2021/02/26 05:44:28 UTC

[GitHub] [tika] lewismc commented on a change in pull request #406: WIP: TIKA-94 Speech recognition

lewismc commented on a change in pull request #406:
URL: https://github.com/apache/tika/pull/406#discussion_r583392435



##########
File path: tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
##########
@@ -0,0 +1,90 @@
+/*

Review comment:
       Please change name of interface from `Transcriber.java` to `Transcribe.java`
   Why? 
   The Interface doesn't do the transcribing... the implementation does.

##########
File path: tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe;
+
+import org.apache.tika.exception.TikaException;
+
+import java.io.IOException;
+
+import com.amazonaws.services.transcribe.model.LanguageCode;
+
+
+/**
+ * Interface for Transcriber services.
+ *
+ * @since Tika TODO

Review comment:
       Excellent. Thank you for adding this. We will populate it when we complete the pull request.

##########
File path: tika-core/pom.xml
##########
@@ -84,6 +84,12 @@
       <artifactId>junit</artifactId>
       <scope>test</scope>
     </dependency>
+      <dependency>

Review comment:
       Please push this into the `tika-translate` module

##########
File path: tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe;
+
+import org.apache.tika.exception.TikaException;
+
+import java.io.IOException;
+
+import com.amazonaws.services.transcribe.model.LanguageCode;
+
+
+/**
+ * Interface for Transcriber services.
+ *
+ * @since Tika TODO
+ */
+public interface Transcriber {
+    /**
+     * @return

Review comment:
       First, we need a description of the interface. This is REALLY important
   Next we add parameters
   Then we add `@throws`
   then return
   
   This method signature needs to change. It is too tighly coupled to the AWS transcribe input. Please model the interface on the `tika-translate` API. 

##########
File path: tika-transcribe/src/main/resources/org/apache/tika/transcribe/transcribe/transcribe.amazon.properties
##########
@@ -0,0 +1,18 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+transcribe.AWS_ACCESS_KEY=dummy_key
+transcribe.AWS_SECRET_KEY=dummy_key
+transcribe.BUCKET_NAME=dummy_name

Review comment:
       I feel that we need to put more out of the interface and into the imlementation. The same goes for pushing more backend-specific methos parameters into this config file. 

##########
File path: tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/AmazonTranscribe.java
##########
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe.transcribe;
+import java.io.File;
+
+import com.amazonaws.services.transcribe.model.*;
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.transcribe.Transcriber;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.amazonaws.services.s3.AmazonS3;
+import com.amazonaws.services.s3.model.PutObjectRequest;
+import com.amazonaws.services.transcribe.AmazonTranscribeAsync;
+
+import java.io.IOException;
+import java.util.Properties;
+
+
+public class AmazonTranscribe implements Transcriber {
+
+    private AmazonTranscribeAsync amazonTranscribe;
+
+    private AmazonS3 amazonS3;
+
+    private static final Logger LOG = LoggerFactory.getLogger(AmazonTranscribe.class);
+
+    private String bucketName;
+
+    private boolean isAvailable; // Flag for whether or not translation is available.
+
+    private String clientId;
+
+    private String clientSecret;  // Keys used for the API calls.
+
+//    private HashSet<String> validSourceLanguages = new HashSet<>(Arrays.asList("en-US", "en-GB", "es-US", "fr-CA", "fr-FR", "en-AU",

Review comment:
       Is this not available from the AWS Java API? This is difficult  to maintain otherwise. 

##########
File path: tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/AmazonTranscribe.java
##########
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe.transcribe;
+import java.io.File;
+
+import com.amazonaws.services.transcribe.model.*;
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.transcribe.Transcriber;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.amazonaws.services.s3.AmazonS3;
+import com.amazonaws.services.s3.model.PutObjectRequest;
+import com.amazonaws.services.transcribe.AmazonTranscribeAsync;
+
+import java.io.IOException;
+import java.util.Properties;
+
+
+public class AmazonTranscribe implements Transcriber {
+
+    private AmazonTranscribeAsync amazonTranscribe;
+
+    private AmazonS3 amazonS3;
+
+    private static final Logger LOG = LoggerFactory.getLogger(AmazonTranscribe.class);
+
+    private String bucketName;
+
+    private boolean isAvailable; // Flag for whether or not translation is available.
+
+    private String clientId;
+
+    private String clientSecret;  // Keys used for the API calls.
+
+//    private HashSet<String> validSourceLanguages = new HashSet<>(Arrays.asList("en-US", "en-GB", "es-US", "fr-CA", "fr-FR", "en-AU",
+//            "it-IT", "de-DE", "pt-BR", "ja-JP", "ko-KR"));  // Valid inputs to StartStreamTranscription for language of source file (audio)
+
+    public AmazonTranscribe() {
+        this.isAvailable = true;
+        Properties config = new Properties();
+        try {
+            config.load(AmazonTranscribe.class
+                    .getResourceAsStream(
+                            "transcribe.amazon.properties"));
+            this.clientId = config.getProperty("transcribe.AWS_ACCESS_KEY");
+            this.clientSecret = config.getProperty("transcribe.AWS_SECRET_KEY");
+            this.bucketName = config.getProperty("transcribe.BUCKET_NAME");
+
+        } catch (Exception e) {
+            LOG.warn("Exception reading config file", e);
+            isAvailable = false;
+        }
+    }
+
+    
+    /**
+     * Audio to text function without language specification
+     * @param fileName
+     * @return Transcribed text
+     * @throws TikaException
+     * @throws IOException
+     */
+    @Override
+    public void startTranscribeAudio(String fileName, String jobName) throws TikaException, IOException {
+        if (!isAvailable())
+            return;
+        StartTranscriptionJobRequest startTranscriptionJobRequest = new StartTranscriptionJobRequest();
+        Media media = new Media();
+        media.setMediaFileUri(amazonS3.getUrl(bucketName, fileName).toString());

Review comment:
       What about source language?

##########
File path: tika-transcribe/src/test/java/org/apache/tika/transcibe/transcibe/AmazonTranscribeGuessLanguageTest.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.transcibe.transcibe;
+
+import org.apache.tika.transcribe.transcribe.AmazonTranscribe;
+import org.junit.Before;
+import org.junit.Test;
+
+import static junit.framework.TestCase.assertNotNull;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.fail;
+
+public class AmazonTranscribeGuessLanguageTest {
+    AmazonTranscribe transcriber;

Review comment:
       This should be
   ```
   Transcribe transcriber;
   ```
   

##########
File path: tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/AmazonTranscribe.java
##########
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe.transcribe;
+import java.io.File;
+
+import com.amazonaws.services.transcribe.model.*;
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.transcribe.Transcriber;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.amazonaws.services.s3.AmazonS3;
+import com.amazonaws.services.s3.model.PutObjectRequest;
+import com.amazonaws.services.transcribe.AmazonTranscribeAsync;
+
+import java.io.IOException;
+import java.util.Properties;
+
+
+public class AmazonTranscribe implements Transcriber {
+
+    private AmazonTranscribeAsync amazonTranscribe;
+
+    private AmazonS3 amazonS3;
+
+    private static final Logger LOG = LoggerFactory.getLogger(AmazonTranscribe.class);
+
+    private String bucketName;
+
+    private boolean isAvailable; // Flag for whether or not translation is available.
+
+    private String clientId;
+
+    private String clientSecret;  // Keys used for the API calls.
+
+//    private HashSet<String> validSourceLanguages = new HashSet<>(Arrays.asList("en-US", "en-GB", "es-US", "fr-CA", "fr-FR", "en-AU",
+//            "it-IT", "de-DE", "pt-BR", "ja-JP", "ko-KR"));  // Valid inputs to StartStreamTranscription for language of source file (audio)
+
+    public AmazonTranscribe() {
+        this.isAvailable = true;
+        Properties config = new Properties();
+        try {
+            config.load(AmazonTranscribe.class
+                    .getResourceAsStream(
+                            "transcribe.amazon.properties"));
+            this.clientId = config.getProperty("transcribe.AWS_ACCESS_KEY");
+            this.clientSecret = config.getProperty("transcribe.AWS_SECRET_KEY");
+            this.bucketName = config.getProperty("transcribe.BUCKET_NAME");
+
+        } catch (Exception e) {
+            LOG.warn("Exception reading config file", e);
+            isAvailable = false;
+        }
+    }
+
+    
+    /**
+     * Audio to text function without language specification
+     * @param fileName
+     * @return Transcribed text
+     * @throws TikaException
+     * @throws IOException
+     */
+    @Override
+    public void startTranscribeAudio(String fileName, String jobName) throws TikaException, IOException {
+        if (!isAvailable())
+            return;
+        StartTranscriptionJobRequest startTranscriptionJobRequest = new StartTranscriptionJobRequest();
+        Media media = new Media();
+        media.setMediaFileUri(amazonS3.getUrl(bucketName, fileName).toString());
+        startTranscriptionJobRequest.withMedia(media)
+                .withOutputBucketName(this.bucketName)
+                .setTranscriptionJobName(jobName);
+        amazonTranscribe.startTranscriptionJob(startTranscriptionJobRequest);
+    }
+
+    /**
+     * Audio to text function with language specification
+     * @param fileName
+     * @param sourceLanguage
+     * @return Transcribed text
+     * @throws TikaException
+     * @throws IOException
+     */
+    @Override
+    public void startTranscribeAudio(String fileName, LanguageCode sourceLanguage, String jobName) throws TikaException, IOException {
+        if (!isAvailable())
+			return;
+        StartTranscriptionJobRequest startTranscriptionJobRequest = new StartTranscriptionJobRequest();
+        Media media = new Media();
+        media.setMediaFileUri(amazonS3.getUrl(bucketName, fileName).toString());
+        startTranscriptionJobRequest.withMedia(media)
+                .withLanguageCode(sourceLanguage)
+                .withOutputBucketName(this.bucketName)
+                .setTranscriptionJobName(jobName);
+        amazonTranscribe.startTranscriptionJob(startTranscriptionJobRequest);
+    }
+
+    @Override
+    public void startTranscribeVideo(String fileName, String jobName) throws TikaException, IOException {
+        if (!isAvailable())
+            return;
+        //TODO
+
+    }
+
+    /**
+     * Audio to text function with language specification
+     * @param fileName
+     * @param sourceLanguage
+     * @return Transcribed text
+     * @throws TikaException
+     * @throws IOException
+     */
+    @Override
+    public void startTranscribeVideo(String fileName, LanguageCode sourceLanguage, String jobName) throws TikaException, IOException {
+        if (!isAvailable())
+            return;
+        //boolean validSourceLanguageFlag = validSourceLanguages.contains(sourceLanguage); // Checks if sourceLanguage in validSourceLanguages O(1) lookup time
+
+        //if (!validSourceLanguageFlag) { // Throws TikaException if the input sourceLanguage is not present in validSourceLanguages
+        //    throw new TikaException("Provided Source Language is Not Valid. Run without language parameter or please select one of: " +
+        //           "en-US, en-GB, es-US, fr-CA, fr-FR, en-AU, it-IT, de-DE, pt-BR, ja-JP, ko-KR"); }
+        //TODO
+
+    }
+
+    /**
+     * @return Valid AWS Credentials
+     */
+	public boolean isAvailable() {
+		return this.isAvailable;
+	}
+
+    /** Gets Transcriptioni result from AWS S3 bucket given bucketNamee and key
+     * @param key
+     * @return
+     */
+    @Override
+    public String getTranscriptResult(String key) {
+        TranscriptionJob transcriptionJob = retrieveObjectWhenJobCompleted(key);
+        if (transcriptionJob != null && !TranscriptionJobStatus.FAILED.equals(transcriptionJob.getTranscriptionJobStatus())) {
+            return amazonS3.getObjectAsString(this.bucketName, key + ".json");
+        } else
+            return null;
+    }
+
+    /**
+     * Private helper function to get object from s3
+     * @param key
+     * @return
+     */
+    private TranscriptionJob retrieveObjectWhenJobCompleted(String key) {
+        GetTranscriptionJobRequest getTranscriptionJobRequest = new GetTranscriptionJobRequest();
+        getTranscriptionJobRequest.setTranscriptionJobName(key);
+
+        while (true) {
+            GetTranscriptionJobResult innerResult = amazonTranscribe.getTranscriptionJob(getTranscriptionJobRequest);
+            String status = innerResult.getTranscriptionJob().getTranscriptionJobStatus();
+            if (TranscriptionJobStatus.COMPLETED.name().equals(status) ||
+                    TranscriptionJobStatus.FAILED.name().equals(status)) {
+                return innerResult.getTranscriptionJob();
+            }
+        }
+    }
+
+    /**
+     * Call this method in order to upload a file to the Amazon S3 bucket.
+     * @param bucketName
+     * @param fileName
+     * @param fullFileName
+     */
+    @Override
+    public void uploadFileToBucket(String bucketName, String fileName, String fullFileName) {

Review comment:
       This needs to exist in the AWS implementation but NOT in the Transcribe Interface. 

##########
File path: tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe;
+
+import org.apache.tika.exception.TikaException;
+
+import java.io.IOException;
+
+import com.amazonaws.services.transcribe.model.LanguageCode;
+
+
+/**
+ * Interface for Transcriber services.
+ *
+ * @since Tika TODO
+ */
+public interface Transcriber {
+    /**
+     * @return
+     * @param fileName
+     * @param jobName
+     * @throws TikaException       When there is an error translating.
+     * @throws java.io.IOException
+     * @since TODO
+     */
+    public void startTranscribeAudio(String fileName, String jobName) throws TikaException, IOException;
+
+    /**
+     * @return
+     * @param fileName
+     * @param sourceLanguage
+     * @param jobName
+     * @throws TikaException       When there is an error translating.
+     * @throws java.io.IOException
+     * @since TODO
+     */
+    public void startTranscribeAudio(String fileName, LanguageCode sourceLanguage, String jobName) throws TikaException, IOException;
+
+    /**
+     * @return
+     * @param fileName
+     * @param jobName
+     * @throws TikaException       When there is an error translating.
+     * @throws java.io.IOException
+     * @since TODO
+     */
+    public void startTranscribeVideo(String fileName, String jobName) throws TikaException, IOException;
+
+    /**
+     * @return
+     * @param fileName
+     * @param jobName
+     * @param sourceLanguage
+     * @throws TikaException       When there is an error translating.
+     * @throws java.io.IOException
+     * @since TODO
+     */
+    public void startTranscribeVideo(String fileName, LanguageCode sourceLanguage, String jobName) throws TikaException, IOException;
+
+    /**
+     * Gets transcription result from S3
+     * @param key
+     * @return
+     */
+    public String getTranscriptResult(String key);
+
+    /**
+     * Upload file to s3
+     * @param bucketName
+     * @param fileName
+     * @param filePath
+     */
+    public void uploadFileToBucket(String bucketName, String fileName, String filePath);

Review comment:
       This should never be in an interface. This is WAY to AWS specific. 

##########
File path: tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe;
+

Review comment:
       Remove whitespace

##########
File path: tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/AmazonTranscribe.java
##########
@@ -0,0 +1,193 @@
+/*

Review comment:
       Please look at the package naming here...
   ```
   tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/
   ```
   should be
   ```
   tika-transcribe/src/main/java/org/apache/tika/transcribe
   ```
   
   

##########
File path: tika-core/src/main/java/org/apache/tika/transcribe/Transcriber.java
##########
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe;
+
+import org.apache.tika.exception.TikaException;
+
+import java.io.IOException;
+
+import com.amazonaws.services.transcribe.model.LanguageCode;
+
+
+/**
+ * Interface for Transcriber services.
+ *
+ * @since Tika TODO
+ */
+public interface Transcriber {
+    /**
+     * @return
+     * @param fileName

Review comment:
       Also, what about the language implementation the transcription service should work on?

##########
File path: tika-transcribe/pom.xml
##########
@@ -0,0 +1,144 @@
+<?xml version="1.0" encoding="UTF-8"?>
+
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+    <modelVersion>4.0.0</modelVersion>
+
+    <parent>
+        <groupId>org.apache.tika</groupId>
+        <artifactId>tika-parent</artifactId>
+        <version>2.0.0-SNAPSHOT</version>
+        <relativePath>../tika-parent/pom.xml</relativePath>
+    </parent>
+
+    <artifactId>tika-transcribe</artifactId>
+    <packaging>bundle</packaging>
+    <name>Apache Tika transcribe</name>
+    <url>http://tika.apache.org/</url>
+    <!--TODO use latest aws version or the one defined in the tika-parent-->

Review comment:
       Defining in `tika-parent` is fine but not in `tika-core`

##########
File path: tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/AmazonTranscribe.java
##########
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe.transcribe;
+import java.io.File;
+
+import com.amazonaws.services.transcribe.model.*;

Review comment:
       Please order all `import`s alphabetically

##########
File path: tika-transcribe/src/test/java/org/apache/tika/transcibe/transcibe/AmazonTranscribeGuessLanguageTest.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.transcibe.transcibe;
+
+import org.apache.tika.transcribe.transcribe.AmazonTranscribe;
+import org.junit.Before;
+import org.junit.Test;
+
+import static junit.framework.TestCase.assertNotNull;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.fail;
+
+public class AmazonTranscribeGuessLanguageTest {
+    AmazonTranscribe transcriber;
+
+    @Before
+    public void setUp() {
+        transcriber = new AmazonTranscribe();
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageAudioShortTest() {
+        String expected = "where is the bus stop? where is the bus stop?";
+        //TODO: "expected" should be changed to reflect the contents of ShortAudioSample.mp3
+        /*
+        URL res = getClass().getClassLoader().getResource("ShortAudioSample.mp3");
+        File file = Paths.get(res.toURI()).toFile();
+        String absolutePath = file.getAbsolutePath();
+        Necessary to get the correct file path from our test resource folder? */
+        //TODO: is the above commented block necessary to obtain the proper filepath for a file located in the tika-translate/test/resources directory?
+
+        String audioFilePath = "src/test/resources/ShortAudioSample.mp3";
+        String result = null;
+
+        if (transcriber.isAvailable()) {
+            try {
+                result = transcriber.transcribeAudio(audioFilePath);
+                assertNotNull(result);
+                assertEquals("Result: [" + result
+                                + "]: not equal to expected: [" + expected + "]",
+                        expected, result);
+            } catch (Exception e) {
+                e.printStackTrace();
+                fail(e.getMessage());
+            }
+        }
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageAudioLongTest() {
+        String expected = "where is the bus stop? where is the bus stop?";
+        //TODO: "expected" should be changed to reflect the contents of LongAudioSample.mp3
+        String audioFilePath = "src/test/resources/LongAudioSample.mp3";

Review comment:
       Where is this file?

##########
File path: tika-transcribe/src/main/java/org/apache/tika/transcribe/transcribe/AmazonTranscribe.java
##########
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.tika.transcribe.transcribe;
+import java.io.File;
+
+import com.amazonaws.services.transcribe.model.*;
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.transcribe.Transcriber;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import com.amazonaws.services.s3.AmazonS3;
+import com.amazonaws.services.s3.model.PutObjectRequest;
+import com.amazonaws.services.transcribe.AmazonTranscribeAsync;
+
+import java.io.IOException;
+import java.util.Properties;
+
+
+public class AmazonTranscribe implements Transcriber {
+
+    private AmazonTranscribeAsync amazonTranscribe;
+
+    private AmazonS3 amazonS3;
+
+    private static final Logger LOG = LoggerFactory.getLogger(AmazonTranscribe.class);
+
+    private String bucketName;
+
+    private boolean isAvailable; // Flag for whether or not translation is available.
+
+    private String clientId;
+
+    private String clientSecret;  // Keys used for the API calls.
+
+//    private HashSet<String> validSourceLanguages = new HashSet<>(Arrays.asList("en-US", "en-GB", "es-US", "fr-CA", "fr-FR", "en-AU",
+//            "it-IT", "de-DE", "pt-BR", "ja-JP", "ko-KR"));  // Valid inputs to StartStreamTranscription for language of source file (audio)
+
+    public AmazonTranscribe() {
+        this.isAvailable = true;
+        Properties config = new Properties();
+        try {
+            config.load(AmazonTranscribe.class
+                    .getResourceAsStream(
+                            "transcribe.amazon.properties"));
+            this.clientId = config.getProperty("transcribe.AWS_ACCESS_KEY");
+            this.clientSecret = config.getProperty("transcribe.AWS_SECRET_KEY");
+            this.bucketName = config.getProperty("transcribe.BUCKET_NAME");
+
+        } catch (Exception e) {
+            LOG.warn("Exception reading config file", e);
+            isAvailable = false;
+        }
+    }
+
+    
+    /**
+     * Audio to text function without language specification
+     * @param fileName
+     * @return Transcribed text

Review comment:
       Please populate all of this Javadoc based upon the guidance I provided above. 

##########
File path: tika-transcribe/src/test/java/org/apache/tika/transcibe/transcibe/AmazonTranscribeGuessLanguageTest.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.transcibe.transcibe;
+
+import org.apache.tika.transcribe.transcribe.AmazonTranscribe;
+import org.junit.Before;
+import org.junit.Test;
+
+import static junit.framework.TestCase.assertNotNull;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.fail;
+
+public class AmazonTranscribeGuessLanguageTest {
+    AmazonTranscribe transcriber;
+
+    @Before
+    public void setUp() {
+        transcriber = new AmazonTranscribe();
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageAudioShortTest() {
+        String expected = "where is the bus stop? where is the bus stop?";
+        //TODO: "expected" should be changed to reflect the contents of ShortAudioSample.mp3
+        /*
+        URL res = getClass().getClassLoader().getResource("ShortAudioSample.mp3");
+        File file = Paths.get(res.toURI()).toFile();
+        String absolutePath = file.getAbsolutePath();
+        Necessary to get the correct file path from our test resource folder? */
+        //TODO: is the above commented block necessary to obtain the proper filepath for a file located in the tika-translate/test/resources directory?
+
+        String audioFilePath = "src/test/resources/ShortAudioSample.mp3";

Review comment:
       Where is this file?

##########
File path: tika-transcribe/src/test/java/org/apache/tika/transcibe/transcibe/AmazonTranscribeGuessLanguageTest.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.transcibe.transcibe;
+
+import org.apache.tika.transcribe.transcribe.AmazonTranscribe;
+import org.junit.Before;
+import org.junit.Test;
+
+import static junit.framework.TestCase.assertNotNull;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.fail;
+
+public class AmazonTranscribeGuessLanguageTest {
+    AmazonTranscribe transcriber;
+
+    @Before
+    public void setUp() {
+        transcriber = new AmazonTranscribe();
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageAudioShortTest() {
+        String expected = "where is the bus stop? where is the bus stop?";
+        //TODO: "expected" should be changed to reflect the contents of ShortAudioSample.mp3
+        /*
+        URL res = getClass().getClassLoader().getResource("ShortAudioSample.mp3");
+        File file = Paths.get(res.toURI()).toFile();
+        String absolutePath = file.getAbsolutePath();
+        Necessary to get the correct file path from our test resource folder? */
+        //TODO: is the above commented block necessary to obtain the proper filepath for a file located in the tika-translate/test/resources directory?
+
+        String audioFilePath = "src/test/resources/ShortAudioSample.mp3";
+        String result = null;
+
+        if (transcriber.isAvailable()) {
+            try {
+                result = transcriber.transcribeAudio(audioFilePath);
+                assertNotNull(result);
+                assertEquals("Result: [" + result
+                                + "]: not equal to expected: [" + expected + "]",
+                        expected, result);
+            } catch (Exception e) {
+                e.printStackTrace();
+                fail(e.getMessage());
+            }
+        }
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageAudioLongTest() {
+        String expected = "where is the bus stop? where is the bus stop?";
+        //TODO: "expected" should be changed to reflect the contents of LongAudioSample.mp3
+        String audioFilePath = "src/test/resources/LongAudioSample.mp3";
+        String result = null;
+
+        if (transcriber.isAvailable()) {
+            try {
+                result = transcriber.transcribeAudio(audioFilePath);
+                assertNotNull(result);
+                assertEquals("Result: [" + result
+                                + "]: not equal to expected: [" + expected + "]",
+                        expected, result);
+            } catch (Exception e) {
+                e.printStackTrace();
+                fail(e.getMessage());
+            }
+        }
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageShortVideoTest() {
+        String expected = "where is the bus stop? where is the bus stop?";
+        //TODO: "expected" should be changed to reflect the contents of ShortVideoSample.mp4
+        String videoFilePath = "src/test/resources/ShortVideoSample.mp4";

Review comment:
       Where  is this file?

##########
File path: tika-transcribe/src/test/java/org/apache/tika/transcibe/transcibe/AmazonTranscribeGuessLanguageTest.java
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.transcibe.transcibe;
+
+import org.apache.tika.transcribe.transcribe.AmazonTranscribe;
+import org.junit.Before;
+import org.junit.Test;
+
+import static junit.framework.TestCase.assertNotNull;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.fail;
+
+public class AmazonTranscribeGuessLanguageTest {
+    AmazonTranscribe transcriber;
+
+    @Before
+    public void setUp() {
+        transcriber = new AmazonTranscribe();
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageAudioShortTest() {
+        String expected = "where is the bus stop? where is the bus stop?";
+        //TODO: "expected" should be changed to reflect the contents of ShortAudioSample.mp3
+        /*
+        URL res = getClass().getClassLoader().getResource("ShortAudioSample.mp3");
+        File file = Paths.get(res.toURI()).toFile();
+        String absolutePath = file.getAbsolutePath();
+        Necessary to get the correct file path from our test resource folder? */
+        //TODO: is the above commented block necessary to obtain the proper filepath for a file located in the tika-translate/test/resources directory?
+
+        String audioFilePath = "src/test/resources/ShortAudioSample.mp3";
+        String result = null;
+
+        if (transcriber.isAvailable()) {
+            try {
+                result = transcriber.transcribeAudio(audioFilePath);
+                assertNotNull(result);
+                assertEquals("Result: [" + result
+                                + "]: not equal to expected: [" + expected + "]",
+                        expected, result);
+            } catch (Exception e) {
+                e.printStackTrace();
+                fail(e.getMessage());
+            }
+        }
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageAudioLongTest() {
+        String expected = "where is the bus stop? where is the bus stop?";
+        //TODO: "expected" should be changed to reflect the contents of LongAudioSample.mp3
+        String audioFilePath = "src/test/resources/LongAudioSample.mp3";
+        String result = null;
+
+        if (transcriber.isAvailable()) {
+            try {
+                result = transcriber.transcribeAudio(audioFilePath);
+                assertNotNull(result);
+                assertEquals("Result: [" + result
+                                + "]: not equal to expected: [" + expected + "]",
+                        expected, result);
+            } catch (Exception e) {
+                e.printStackTrace();
+                fail(e.getMessage());
+            }
+        }
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageShortVideoTest() {
+        String expected = "where is the bus stop? where is the bus stop?";
+        //TODO: "expected" should be changed to reflect the contents of ShortVideoSample.mp4
+        String videoFilePath = "src/test/resources/ShortVideoSample.mp4";
+        String result = null;
+
+        if (transcriber.isAvailable()) {
+            try {
+                result = transcriber.transcribeVideo(videoFilePath);
+                assertNotNull(result);
+                assertEquals("Result: [" + result
+                                + "]: not equal to expected: [" + expected + "]",
+                        expected, result);
+            } catch (Exception e) {
+                e.printStackTrace();
+                fail(e.getMessage());
+            }
+        }
+    }
+
+    @Test
+    public void AmazonTranscribeGuessLanguageLongVideoTest() {
+        String expected = "hello sir";
+        //TODO: "expected" should be changed to reflect the contents of LongVideoSample.mp4
+        String videoFilePath = "src/test/resources/LongVideoSample.mp4";

Review comment:
       ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org