You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2023/01/20 15:01:41 UTC
[GitHub] [beam] damccorm commented on a diff in pull request #25083: Multi language/runinference example

damccorm commented on code in PR #25083:
URL: https://github.com/apache/beam/pull/25083#discussion_r1082651068


##########
sdks/python/apache_beam/examples/inference/multi_language_inference/last_word_prediction/src/main/java/org/MultiLangRunInference.java:
##########
@@ -0,0 +1,89 @@
+package org;
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.beam.sdk.Pipeline;
+import org.apache.beam.sdk.extensions.python.PythonExternalTransform;
+import org.apache.beam.sdk.io.TextIO;
+import org.apache.beam.sdk.options.Description;
+import org.apache.beam.sdk.options.PipelineOptions;
+import org.apache.beam.sdk.options.PipelineOptionsFactory;
+import org.apache.beam.sdk.options.Validation.Required;
+import org.apache.beam.sdk.values.PCollection;
+
+public class MultiLangRunInference {
+    public interface MultiLanguageOptions extends PipelineOptions {
+
+        @Description("Path to an input file that contains labels and pixels to feed into the model")
+        @Required
+        String getInputFile();
+
+        void setInputFile(String value);
+
+        @Description("Path to a stored model.")
+        @Required
+        String getModelPath();
+
+        void setModelPath(String value);
+
+        @Description("Path to an input file that contains labels and pixels to feed into the model")
+        @Required
+        String getOutputFile();
+
+        void setOutputFile(String value);
+
+        @Description("Name of the model on HuggingFace.")
+        @Required
+        String getModelName();
+
+        void setModelName(String value);
+
+        @Description("Port number of the expansion service.")
+        @Required
+        String getPort();
+
+        void setPort(String value);
+    }
+
+    public static void main(String[] args) {
+
+        MultiLanguageOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
+                .as(MultiLanguageOptions.class);
+
+        Pipeline p = Pipeline.create(options);
+        PCollection<String> input = p.apply("Read Input", TextIO.read().from(options.getInputFile()));
+        
+        /* For 2.44.0 and on
+        List<String> local_packages=new ArrayList<String>(); 
+        local_packages.add("multi_language_custom_transform"); 
+        */
+        List<String> packages=new ArrayList<String>();  
+        input.apply("Predict", PythonExternalTransform.<PCollection<String>, PCollection<String>>from(
+                "multi_language_custom_transform.composite_transform.InferenceTransform", "localhost:" + options.getPort())
+                .withKwarg("model", options.getModelName())
+                .withKwarg("model_path", options.getModelPath())
+                // .withExtraPackages(multi_language_custom_transform)
+                )
+                .apply("Write Output", TextIO.write().to(options.getOutputFile()));
+
+        p.run().waitUntilFinish();

Review Comment:
   Now that 2.44's release candidate has been accepted (and it should be shortly published), can we default to the 2.44.0 implementation?



##########
website/www/site/content/en/documentation/ml/multi-language-inference.md:
##########
@@ -0,0 +1,161 @@
+---
+title: "Cross Language RunInference  "
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Cross Language RunInference
+
+The pipeline is written in Java and reads the input data from Google Cloud Storage. With the help of a [PythonExternalTransform](https://beam.apache.org/documentation/programming-guide/#1312-creating-cross-language-python-transforms),
+a composite Python transform is called to do the preprocessing, postprocessing, and inference.
+Lastly, the data is written back to Google Cloud Storage in the Java pipeline.
+
+## NLP model and dataset
+A `bert-base-uncased` natural language processing (NLP) model is used to make inference. This model is open source and available on [HuggingFace](https://huggingface.co/bert-base-uncased). This BERT-model is
+used to predict the last word of a sentence based on the context of the sentence.
+
+We also use an [IMDB movie reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?select=IMDB+Dataset.csv) dataset, which is  an open-source dataset that is available on Kaggle.
+
+The following is a sample of the data after preprocessing:
+
+| **Text** 	|   **Last Word** 	|
+|---	|:---	|
+|<img width=700/>|<img width=100/>|
+| One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be [MASK] 	| hooked 	|
+| A wonderful little [MASK] 	| production 	|
+| So im not a big fan of Boll's work but then again not many [MASK] 	| are 	|
+| This a fantastic movie of three prisoners who become [MASK] 	| famous 	|
+| Some films just simply should not be [MASK] 	| remade 	|
+| The Karen Carpenter Story shows a little more about singer Karen Carpenter's complex [MASK] 	| life 	|
+
+You can see the full code used in this example on [Github](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/inference/multi_language_inference).
+
+
+## Multi-language Inference pipeline
+
+When using multi-language pipelines, you have access to a much larger pool of transforms. For more information, see [Multi-language pipelines](https://beam.apache.org/documentation/programming-guide/#multi-language-pipelines) in the Apache Beam Programming Guide.
+
+### Cross-Language Python transform
+In addition to running inference, we also need to perform preprocessing and postprocessing on the data. Processing the data makes it possible to interpret the output. In order to do these three tasks, one single composite custom PTransform is written, with a unit DoFn or PTransform for each of the tasks, shown in the following example:

Review Comment:
   Could we add a sentence here about running inference with the built in x-lang RunInference transform? Something along the lines of:
   
   `To run inference from Java, you can use the [cross-language RunInference transform](https://github.com/apache/beam/blob/master/sdks/java/extensions/python/src/main/java/org/apache/beam/sdk/extensions/python/transforms/RunInference.java). The [Java Sklearn Mnist Classification example](https://github.com/apache/beam/tree/master/examples/multi-language) demonstrates how you can use this transform if you don't have pre/post-processing that must be done in Python. In this example, in addition to running inference, we also want to perform preprocessing and postprocessing on the data in Python. ....`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org