You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "sirenbyte (via GitHub)" <gi...@apache.org> on 2023/04/12 01:06:27 UTC

[GitHub] [beam] sirenbyte opened a new pull request, #26227: add cross-language

sirenbyte opened a new pull request, #26227:
URL: https://github.com/apache/beam/pull/26227

   **Please** add a meaningful description for your change here
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #<ISSUE NUMBER>` instead.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/get-started-contributing/#make-the-reviewers-job-easier).
   
   To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md)
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] sirenbyte commented on a diff in pull request #26227: [Tour of Beam] Learning content for "Cross-language Transforms" module

Posted by "sirenbyte (via GitHub)" <gi...@apache.org>.
sirenbyte commented on code in PR #26227:
URL: https://github.com/apache/beam/pull/26227#discussion_r1185033910


##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming data processing pipelines. One of the key features of Apache Beam is its ability to support multi-language pipelines. With Apache Beam, you can write different parts of your pipeline in different programming languages, and they can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, and Go. This makes it possible to use the language that best suits your needs for each part of your pipeline.
+
+To build a multi-language pipeline with Apache Beam, you can use the following approach:
+
+Define your pipeline using Apache Beam's SDK in your preferred programming language. This defines the data processing steps that need to be executed.
+
+Use Apache Beam's language-specific SDKs to implement the data processing steps in the appropriate programming languages. For example, you could use Java to process some data, Python to process some other data, and Go to perform a specific computation.
+
+Use Apache Beam's cross-language support to connect the different parts of your pipeline together. Apache Beam provides a common data model and serialization format, so data can be passed seamlessly between different languages.
+
+By using Apache Beam's multi-language support, you can take advantage of the strengths of different programming languages, while still building a unified data processing pipeline. This can be especially useful when working with large datasets, as different languages may have different performance characteristics for different tasks.
+
+To create a multi-language pipeline in Apache Beam, follow these steps:
+
+Choose your SDKs: First, decide which programming languages and corresponding SDKs you'd like to use. Apache Beam currently supports Python, Java, and Go SDKs.

Review Comment:
   Done



##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming data processing pipelines. One of the key features of Apache Beam is its ability to support multi-language pipelines. With Apache Beam, you can write different parts of your pipeline in different programming languages, and they can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, and Go. This makes it possible to use the language that best suits your needs for each part of your pipeline.
+
+To build a multi-language pipeline with Apache Beam, you can use the following approach:
+
+Define your pipeline using Apache Beam's SDK in your preferred programming language. This defines the data processing steps that need to be executed.
+
+Use Apache Beam's language-specific SDKs to implement the data processing steps in the appropriate programming languages. For example, you could use Java to process some data, Python to process some other data, and Go to perform a specific computation.
+
+Use Apache Beam's cross-language support to connect the different parts of your pipeline together. Apache Beam provides a common data model and serialization format, so data can be passed seamlessly between different languages.
+
+By using Apache Beam's multi-language support, you can take advantage of the strengths of different programming languages, while still building a unified data processing pipeline. This can be especially useful when working with large datasets, as different languages may have different performance characteristics for different tasks.
+
+To create a multi-language pipeline in Apache Beam, follow these steps:
+
+Choose your SDKs: First, decide which programming languages and corresponding SDKs you'd like to use. Apache Beam currently supports Python, Java, and Go SDKs.
+
+Set up the dependencies: Make sure you have installed the necessary dependencies for each language. For instance, you'll need the Beam Python SDK for Python or the Beam Java SDK for Java.
+
+Create a pipeline: Using the primary language of your choice, create a pipeline object using the respective SDK. This pipeline will serve as the main entry point for your multi-language pipeline.
+
+Use cross-language transforms: To execute transforms written in other languages, use the ExternalTransform class (in Python) or the External class (in Java). This allows you to use a transform written in another language as if it were a native transform in your main pipeline. You'll need to provide the appropriate expansion service address for the language of the transform.
+
+{{if (eq .Sdk "java")}}
+
+#### Start an expansion service
+
+When building a job for a multi-language pipeline, Beam uses an expansion service to expand composite transforms. You must have at least one expansion service per remote SDK.
+
+In the general case, if you have a supported version of Python installed on your system, you can let `PythonExternalTransform` handle the details of creating and starting up the expansion service. But if you want to customize the environment or use transforms not available in the default Beam SDK, you might need to run your own expansion service.
+
+For example, to start the standard expansion service for a Python transform, ExpansionServiceServicer, follow these steps:
+
+* Activate a new virtual environment following these instructions.
+
+* Install Apache Beam with gcp and dataframe packages.
+```
+pip install 'apache-beam[gcp,dataframe]'
+```
+* Run the following command
+```
+python -m apache_beam.runners.portability.expansion_service_main -p <PORT> --fully_qualified_name_glob "*"
+```
+
+The command runs `expansion_service_main.py`, which starts the standard expansion service. When you use Gradle to run your Java pipeline, you can specify the expansion service with the `expansionService` option. For example: `--expansionService=localhost:<PORT>`.
+
+#### Run Java program
+
+```
+private String getModelLoaderScript() {
+        String s = "from apache_beam.ml.inference.sklearn_inference import SklearnModelHandlerNumpy\n";
+        s = s + "from apache_beam.ml.inference.base import KeyedModelHandler\n";
+        s = s + "def get_model_handler(model_uri):\n";
+        s = s + "  return KeyedModelHandler(SklearnModelHandlerNumpy(model_uri))\n";
+
+        return s;
+}
+
+void runExample(SklearnMnistClassificationOptions options, String expansionService) {
+        // Schema of the output PCollection Row type to be provided to the RunInference transform.
+        Schema schema =
+                Schema.of(
+                        Schema.Field.of("example", Schema.FieldType.array(Schema.FieldType.INT64)),
+                        Schema.Field.of("inference", FieldType.STRING));
+
+        Pipeline pipeline = Pipeline.create(options);
+        PCollection<KV<Long, Iterable<Long>>> col =
+                pipeline
+                        .apply(TextIO.read().from(options.getInput()))
+                        .apply(Filter.by(new FilterNonRecordsFn()))
+                        .apply(MapElements.via(new RecordsToLabeledPixelsFn()));
+
+        col.apply(RunInference.ofKVs(getModelLoaderScript(), schema, VarLongCoder.of())
+                                .withKwarg("model_uri", options.getModelPath())
+                                .withExpansionService(expansionService))
+                .apply(MapElements.via(new FormatOutput()))
+                .apply(TextIO.write().to("out.txt"));
+
+        pipeline.run().waitUntilFinish();
+}
+```
+
+> **Note**. Python extension service may not work on Windows. The **extension service** and the **Java pipeline** must run in the same environment. If the **extension service** is running in a **docker** container, you should run **Java pipeline** inside it.
+
+#### Run in docker
+
+Run **extension service** and **java program**:
+```
+# Use an official Java base image
+FROM openjdk:11
+
+# Install Python
+RUN apt-get update && \
+    apt-get install -y python python-pip
+
+# Set the working directory
+WORKDIR /app
+
+# Copy Java and Python source files to the working directory
+COPY src /app/src
+COPY scripts /app/scripts
+
+# Install Python dependencies
+RUN pip install --upgrade pip \
+RUN pip install apache-beam==2.46.0 \
+RUN pip install --default-timeout=100 torch \
+RUN pip install torchvision \
+RUN pip install pandas \
+RUN pip install scikit-learn
+
+RUN wget https://repo1.maven.org/maven2/org/springframework/spring-expression/$SPRING_VERSION/spring-expression-$SPRING_VERSION.jar &&\
+    mv spring-expression-$SPRING_VERSION.jar /opt/apache/beam/jars/spring-expression.jar
+
+# Compile the Java program
+RUN javac -d /app/bin /app/src/MyJavaProgram.java
+
+# Set the entrypoint to run both the Java program and Python script
+ENTRYPOINT ["bash", "-c", "java -cp /app/bin MyJavaProgram && python -m apache_beam.runners.portability.expansion_service_main -p 9090 --fully_qualified_name_glob=*"]
+```
+{{end}}
+
+
+{{if (eq .Sdk "python")}}
+
+
+Write own java service
+```
+import autovalue.shaded.com.google.auto.service.AutoService;
+import autovalue.shaded.com.google.common.collect.ImmutableMap;
+import org.apache.beam.sdk.expansion.ExternalTransformRegistrar;
+import org.apache.beam.sdk.transforms.Count;
+import org.apache.beam.sdk.transforms.ExternalTransformBuilder;
+import org.apache.beam.sdk.transforms.PTransform;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+
+import java.util.Map;
+
+public class Task {
+    static class JavaCount extends PTransform<PCollection<String>, PCollection<KV<String, Long>>> {
+
+        public JavaCount() {
+        }
+
+        @Override
+        public PCollection<KV<String, Long>> expand(PCollection<String> input) {
+            return input
+                    .apply(
+                            "JavaCount", Count.perElement());
+        }
+    }
+
+    static class JavaCountBuilder implements
+            ExternalTransformBuilder<JavaCountConfiguration, PCollection<String>, PCollection<KV<String, Long>>> {
+
+        @Override
+        public PTransform<PCollection<String>, PCollection<KV<String, Long>>> buildExternal(
+                JavaCountConfiguration configuration) {
+            return new JavaCount();
+        }
+    }
+
+    static class JavaCountConfiguration {
+
+    }
+
+    @AutoService(ExternalTransformRegistrar.class)
+    public class JavaCountRegistrar implements ExternalTransformRegistrar {
+
+        final String URN = "my.beam.transform.javacount";
+
+        @Override
+        public Map<String, ExternalTransformBuilder<?, ?, ?>> knownBuilderInstances() {
+            return ImmutableMap.of(URN, new JavaCountBuilder());
+        }
+    }
+}
+```
+
+Build jar:
+```
+mvn clean
+mvn package -DskipTests
+cd target
+java -jar java-count-bundled-0.1.jar 12345
+```
+
+Python pipeline:
+```
+import logging
+import re
+import typing
+
+import apache_beam as beam
+from apache_beam.io import ReadFromText
+from apache_beam.io import WriteToText
+from apache_beam.transforms.external import ImplicitSchemaPayloadBuilder
+from apache_beam.options.pipeline_options import PipelineOptions
+
+
+class WordExtractingDoFn(beam.DoFn):
+  """Parse each line of input text into words."""
+  def process(self, element):
+    """Returns an iterator over the words of this element.
+    The element is a line of text.  If the line is blank, note that, too.
+    Args:
+      element: the element being processed
+    Returns:
+      The processed element.
+    """
+    return re.findall(r'[\w\']+', element, re.UNICODE)
+
+
+def run(input_path, output_path, pipeline_args):
+  pipeline_options = PipelineOptions(pipeline_args)
+
+  with beam.Pipeline(options=pipeline_options) as p:
+    lines = p | 'Read' >> ReadFromText(input_path).with_output_types(str)
+    words = lines | 'Split' >> (beam.ParDo(WordExtractingDoFn()).with_output_types(str))
+
+    java_output = (
+        words
+        | 'JavaCount' >> beam.ExternalTransform(
+              'my.beam.transform.javacount',
+              None,
+              "localhost:12345"))
+
+    def format(kv):
+      key, value = kv
+      return '%s:%s' % (key, value)
+
+    output = java_output | 'Format' >> beam.Map(format)
+    output | 'Write' >> WriteToText(output_path)
+
+
+if __name__ == '__main__':
+  logging.getLogger().setLevel(logging.INFO)
+  import argparse
+
+  parser = argparse.ArgumentParser()
+  parser.add_argument(
+      '--input',
+      dest='input',
+      required=True,
+      help='Input file')
+  parser.add_argument(
+      '--output',
+      dest='output',
+      required=True,
+      help='Output file')
+  known_args, pipeline_args = parser.parse_known_args()
+
+  run(
+      known_args.input,
+      known_args.output,
+      pipeline_args)
+```
+
+Run program
+```
+python javacount.py --runner DirectRunner --environment_type=DOCKER --input input1 --output output --sdk_harness_container_image_overrides ".*java.*,chamikaramj/beam_java11_sdk:latest"
+```
+{{end}}

Review Comment:
   Done



##########
learning/tour-of-beam/learning-content/cross-language/sql-transform/description.md:
##########
@@ -0,0 +1,78 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Sql transform
+
+In Apache Beam, you can use **SQL** transforms to query and manipulate your data within the pipeline using SQL-like syntax. This is particularly useful if you're already familiar with SQL and want to leverage that knowledge while processing data in your Apache Beam pipeline.
+
+Apache Beam **SQL** is built on top of Calcite, an open-source SQL parser, and optimizer framework.
+
+To use **SQL** transforms in Apache Beam, you'll need to perform the following steps:
+
+Create a `PCollection` of rows. In Apache Beam, **SQL** operations are performed on `PCollection<Row>` objects. You'll need to convert your data into a `PCollection` of rows, which requires defining a schema that describes the structure of your data.

Review Comment:
   Done



##########
learning/tour-of-beam/learning-content/cross-language/sql-transform/description.md:
##########
@@ -0,0 +1,78 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Sql transform
+
+In Apache Beam, you can use **SQL** transforms to query and manipulate your data within the pipeline using SQL-like syntax. This is particularly useful if you're already familiar with SQL and want to leverage that knowledge while processing data in your Apache Beam pipeline.
+
+Apache Beam **SQL** is built on top of Calcite, an open-source SQL parser, and optimizer framework.
+
+To use **SQL** transforms in Apache Beam, you'll need to perform the following steps:
+
+Create a `PCollection` of rows. In Apache Beam, **SQL** operations are performed on `PCollection<Row>` objects. You'll need to convert your data into a `PCollection` of rows, which requires defining a schema that describes the structure of your data.
+
+Apply the SQL transform. Once you have your `PCollection<Row>`, you can apply a SQL transform using the `SqlTransform.query()` method. You'll need to provide the SQL query you want to execute on your data.
+
+{{if (eq .Sdk "java")}}
+Add the necessary dependencies to your project. For Java, you'll need to include the `beam-sdks-java-extensions-sql` dependency in your build configuration.
+
+```
+import org.apache.beam.sdk.extensions.sql.SqlTransform;
+import org.apache.beam.sdk.schemas.Schema;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.beam.sdk.values.Row;
+
+// Define the schema for your data
+Schema schema = Schema.builder()
+        .addField("id", Schema.FieldType.INT32)
+        .addField("name", Schema.FieldType.STRING)
+        .build();
+
+// Assume input is a PCollection<Row> with the above schema
+PCollection<Row> input = ...;
+
+// Apply the SQL transform
+PCollection<Row> result = input.apply(
+        SqlTransform.query("SELECT id, name FROM PCOLLECTION WHERE id > 100"));
+```
+{{end}}
+
+{{if (eq .Sdk "python")}}
+```
+import apache_beam as beam
+from apache_beam.transforms.sql import SqlTransform
+
+# Define a sample input data as a list of dictionaries
+input_data = [
+    {'id': 1, 'name': 'Alice'},
+    {'id': 2, 'name': 'Bob'},
+    {'id': 101, 'name': 'Carol'},
+    {'id': 102, 'name': 'David'},
+]
+
+# Create a pipeline
+with beam.Pipeline() as p:
+    # Read input data and convert it to a PCollection of Rows
+    input_rows = (
+        p | 'Create input data' >> beam.Create(input_data)
+        | 'Convert to Rows' >> beam.Map(lambda x: beam.Row(id=int(x['id']), name=str(x['name'])))
+    )
+
+    # Apply the SQL transform
+    filtered_rows = input_rows | SqlTransform("SELECT id, name FROM PCOLLECTION WHERE id > 100")
+
+    # Print the results
+    filtered_rows | 'Print results' >> beam.Map(print)
+```
+{{end}}

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] olehborysevych commented on pull request #26227: [Tour of Beam] Learning content for "Cross-language Transforms" module

Posted by "olehborysevych (via GitHub)" <gi...@apache.org>.
olehborysevych commented on PR #26227:
URL: https://github.com/apache/beam/pull/26227#issuecomment-1582478907

   Hey @lostluck yes, the learning materials were  approved by Kerry and some of recent changes are just a minor fix and missing dependencies added. It is already deployed to staging and tested so the answer is  - yes, it could be merged. We asked Pablo to merge but if you could do this it will be great, since There is a lot of PR's on Pablo now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] lostluck merged pull request #26227: [Tour of Beam] Learning content for "Cross-language Transforms" module

Posted by "lostluck (via GitHub)" <gi...@apache.org>.
lostluck merged PR #26227:
URL: https://github.com/apache/beam/pull/26227


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] kerrydc commented on pull request #26227: [Tour of Beam] Learning content for "Cross-language Transforms" module

Posted by "kerrydc (via GitHub)" <gi...@apache.org>.
kerrydc commented on PR #26227:
URL: https://github.com/apache/beam/pull/26227#issuecomment-1536404477

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] lostluck commented on pull request #26227: [Tour of Beam] Learning content for "Cross-language Transforms" module

Posted by "lostluck (via GitHub)" <gi...@apache.org>.
lostluck commented on PR #26227:
URL: https://github.com/apache/beam/pull/26227#issuecomment-1583505662

   Or not, because now it's saying something about conflicts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #26227: [Tour of Beam] Learning content for "Cross-language Transforms" module

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #26227:
URL: https://github.com/apache/beam/pull/26227#issuecomment-1504394858

   Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment `assign set of reviewers`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] lostluck commented on pull request #26227: [Tour of Beam] Learning content for "Cross-language Transforms" module

Posted by "lostluck (via GitHub)" <gi...@apache.org>.
lostluck commented on PR #26227:
URL: https://github.com/apache/beam/pull/26227#issuecomment-1584503349

   Merged! Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] sirenbyte commented on a diff in pull request #26227: [Tour of Beam] Learning content for "Cross-language Transforms" module

Posted by "sirenbyte (via GitHub)" <gi...@apache.org>.
sirenbyte commented on code in PR #26227:
URL: https://github.com/apache/beam/pull/26227#discussion_r1185032956


##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming data processing pipelines. One of the key features of Apache Beam is its ability to support multi-language pipelines. With Apache Beam, you can write different parts of your pipeline in different programming languages, and they can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, and Go. This makes it possible to use the language that best suits your needs for each part of your pipeline.
+
+To build a multi-language pipeline with Apache Beam, you can use the following approach:
+
+Define your pipeline using Apache Beam's SDK in your preferred programming language. This defines the data processing steps that need to be executed.

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] lostluck commented on pull request #26227: [Tour of Beam] Learning content for "Cross-language Transforms" module

Posted by "lostluck (via GitHub)" <gi...@apache.org>.
lostluck commented on PR #26227:
URL: https://github.com/apache/beam/pull/26227#issuecomment-1581495441

   @olehborysevych @sirenbyte This looks like it's had several reviews before and some recent commits, is it ready to merge at this point? (he asks before taking a look at the code)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] sirenbyte commented on a diff in pull request #26227: [Tour of Beam] Learning content for "Cross-language Transforms" module

Posted by "sirenbyte (via GitHub)" <gi...@apache.org>.
sirenbyte commented on code in PR #26227:
URL: https://github.com/apache/beam/pull/26227#discussion_r1185034234


##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming data processing pipelines. One of the key features of Apache Beam is its ability to support multi-language pipelines. With Apache Beam, you can write different parts of your pipeline in different programming languages, and they can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, and Go. This makes it possible to use the language that best suits your needs for each part of your pipeline.

Review Comment:
   Done



##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming data processing pipelines. One of the key features of Apache Beam is its ability to support multi-language pipelines. With Apache Beam, you can write different parts of your pipeline in different programming languages, and they can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, and Go. This makes it possible to use the language that best suits your needs for each part of your pipeline.
+
+To build a multi-language pipeline with Apache Beam, you can use the following approach:
+
+Define your pipeline using Apache Beam's SDK in your preferred programming language. This defines the data processing steps that need to be executed.
+
+Use Apache Beam's language-specific SDKs to implement the data processing steps in the appropriate programming languages. For example, you could use Java to process some data, Python to process some other data, and Go to perform a specific computation.
+
+Use Apache Beam's cross-language support to connect the different parts of your pipeline together. Apache Beam provides a common data model and serialization format, so data can be passed seamlessly between different languages.
+
+By using Apache Beam's multi-language support, you can take advantage of the strengths of different programming languages, while still building a unified data processing pipeline. This can be especially useful when working with large datasets, as different languages may have different performance characteristics for different tasks.
+

Review Comment:
   Done



##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming data processing pipelines. One of the key features of Apache Beam is its ability to support multi-language pipelines. With Apache Beam, you can write different parts of your pipeline in different programming languages, and they can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, and Go. This makes it possible to use the language that best suits your needs for each part of your pipeline.
+
+To build a multi-language pipeline with Apache Beam, you can use the following approach:
+
+Define your pipeline using Apache Beam's SDK in your preferred programming language. This defines the data processing steps that need to be executed.
+
+Use Apache Beam's language-specific SDKs to implement the data processing steps in the appropriate programming languages. For example, you could use Java to process some data, Python to process some other data, and Go to perform a specific computation.
+
+Use Apache Beam's cross-language support to connect the different parts of your pipeline together. Apache Beam provides a common data model and serialization format, so data can be passed seamlessly between different languages.
+
+By using Apache Beam's multi-language support, you can take advantage of the strengths of different programming languages, while still building a unified data processing pipeline. This can be especially useful when working with large datasets, as different languages may have different performance characteristics for different tasks.
+
+To create a multi-language pipeline in Apache Beam, follow these steps:
+
+Choose your SDKs: First, decide which programming languages and corresponding SDKs you'd like to use. Apache Beam currently supports Python, Java, and Go SDKs.
+
+Set up the dependencies: Make sure you have installed the necessary dependencies for each language. For instance, you'll need the Beam Python SDK for Python or the Beam Java SDK for Java.
+
+Create a pipeline: Using the primary language of your choice, create a pipeline object using the respective SDK. This pipeline will serve as the main entry point for your multi-language pipeline.
+
+Use cross-language transforms: To execute transforms written in other languages, use the ExternalTransform class (in Python) or the External class (in Java). This allows you to use a transform written in another language as if it were a native transform in your main pipeline. You'll need to provide the appropriate expansion service address for the language of the transform.
+
+{{if (eq .Sdk "java")}}
+
+#### Start an expansion service
+
+When building a job for a multi-language pipeline, Beam uses an expansion service to expand composite transforms. You must have at least one expansion service per remote SDK.

Review Comment:
   Done



##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming data processing pipelines. One of the key features of Apache Beam is its ability to support multi-language pipelines. With Apache Beam, you can write different parts of your pipeline in different programming languages, and they can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, and Go. This makes it possible to use the language that best suits your needs for each part of your pipeline.
+
+To build a multi-language pipeline with Apache Beam, you can use the following approach:
+
+Define your pipeline using Apache Beam's SDK in your preferred programming language. This defines the data processing steps that need to be executed.
+
+Use Apache Beam's language-specific SDKs to implement the data processing steps in the appropriate programming languages. For example, you could use Java to process some data, Python to process some other data, and Go to perform a specific computation.
+
+Use Apache Beam's cross-language support to connect the different parts of your pipeline together. Apache Beam provides a common data model and serialization format, so data can be passed seamlessly between different languages.
+
+By using Apache Beam's multi-language support, you can take advantage of the strengths of different programming languages, while still building a unified data processing pipeline. This can be especially useful when working with large datasets, as different languages may have different performance characteristics for different tasks.
+
+To create a multi-language pipeline in Apache Beam, follow these steps:
+
+Choose your SDKs: First, decide which programming languages and corresponding SDKs you'd like to use. Apache Beam currently supports Python, Java, and Go SDKs.
+
+Set up the dependencies: Make sure you have installed the necessary dependencies for each language. For instance, you'll need the Beam Python SDK for Python or the Beam Java SDK for Java.
+
+Create a pipeline: Using the primary language of your choice, create a pipeline object using the respective SDK. This pipeline will serve as the main entry point for your multi-language pipeline.
+
+Use cross-language transforms: To execute transforms written in other languages, use the ExternalTransform class (in Python) or the External class (in Java). This allows you to use a transform written in another language as if it were a native transform in your main pipeline. You'll need to provide the appropriate expansion service address for the language of the transform.
+
+{{if (eq .Sdk "java")}}
+
+#### Start an expansion service
+
+When building a job for a multi-language pipeline, Beam uses an expansion service to expand composite transforms. You must have at least one expansion service per remote SDK.
+
+In the general case, if you have a supported version of Python installed on your system, you can let `PythonExternalTransform` handle the details of creating and starting up the expansion service. But if you want to customize the environment or use transforms not available in the default Beam SDK, you might need to run your own expansion service.

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] alxp1982 commented on a diff in pull request #26227: [Tour of Beam] Learning content for "Cross-language Transforms" module

Posted by "alxp1982 (via GitHub)" <gi...@apache.org>.
alxp1982 commented on code in PR #26227:
URL: https://github.com/apache/beam/pull/26227#discussion_r1183144757


##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming data processing pipelines. One of the key features of Apache Beam is its ability to support multi-language pipelines. With Apache Beam, you can write different parts of your pipeline in different programming languages, and they can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, and Go. This makes it possible to use the language that best suits your needs for each part of your pipeline.
+
+To build a multi-language pipeline with Apache Beam, you can use the following approach:
+
+Define your pipeline using Apache Beam's SDK in your preferred programming language. This defines the data processing steps that need to be executed.
+
+Use Apache Beam's language-specific SDKs to implement the data processing steps in the appropriate programming languages. For example, you could use Java to process some data, Python to process some other data, and Go to perform a specific computation.
+
+Use Apache Beam's cross-language support to connect the different parts of your pipeline together. Apache Beam provides a common data model and serialization format, so data can be passed seamlessly between different languages.
+
+By using Apache Beam's multi-language support, you can take advantage of the strengths of different programming languages, while still building a unified data processing pipeline. This can be especially useful when working with large datasets, as different languages may have different performance characteristics for different tasks.
+
+To create a multi-language pipeline in Apache Beam, follow these steps:
+
+Choose your SDKs: First, decide which programming languages and corresponding SDKs you'd like to use. Apache Beam currently supports Python, Java, and Go SDKs.
+
+Set up the dependencies: Make sure you have installed the necessary dependencies for each language. For instance, you'll need the Beam Python SDK for Python or the Beam Java SDK for Java.
+
+Create a pipeline: Using the primary language of your choice, create a pipeline object using the respective SDK. This pipeline will serve as the main entry point for your multi-language pipeline.
+
+Use cross-language transforms: To execute transforms written in other languages, use the ExternalTransform class (in Python) or the External class (in Java). This allows you to use a transform written in another language as if it were a native transform in your main pipeline. You'll need to provide the appropriate expansion service address for the language of the transform.
+
+{{if (eq .Sdk "java")}}
+
+#### Start an expansion service
+
+When building a job for a multi-language pipeline, Beam uses an expansion service to expand composite transforms. You must have at least one expansion service per remote SDK.
+
+In the general case, if you have a supported version of Python installed on your system, you can let `PythonExternalTransform` handle the details of creating and starting up the expansion service. But if you want to customize the environment or use transforms not available in the default Beam SDK, you might need to run your own expansion service.
+
+For example, to start the standard expansion service for a Python transform, ExpansionServiceServicer, follow these steps:
+
+* Activate a new virtual environment following these instructions.
+
+* Install Apache Beam with gcp and dataframe packages.
+```
+pip install 'apache-beam[gcp,dataframe]'
+```
+* Run the following command
+```
+python -m apache_beam.runners.portability.expansion_service_main -p <PORT> --fully_qualified_name_glob "*"
+```
+
+The command runs `expansion_service_main.py`, which starts the standard expansion service. When you use Gradle to run your Java pipeline, you can specify the expansion service with the `expansionService` option. For example: `--expansionService=localhost:<PORT>`.
+
+#### Run Java program
+
+```
+private String getModelLoaderScript() {
+        String s = "from apache_beam.ml.inference.sklearn_inference import SklearnModelHandlerNumpy\n";
+        s = s + "from apache_beam.ml.inference.base import KeyedModelHandler\n";
+        s = s + "def get_model_handler(model_uri):\n";
+        s = s + "  return KeyedModelHandler(SklearnModelHandlerNumpy(model_uri))\n";
+
+        return s;
+}
+
+void runExample(SklearnMnistClassificationOptions options, String expansionService) {
+        // Schema of the output PCollection Row type to be provided to the RunInference transform.
+        Schema schema =
+                Schema.of(
+                        Schema.Field.of("example", Schema.FieldType.array(Schema.FieldType.INT64)),
+                        Schema.Field.of("inference", FieldType.STRING));
+
+        Pipeline pipeline = Pipeline.create(options);
+        PCollection<KV<Long, Iterable<Long>>> col =
+                pipeline
+                        .apply(TextIO.read().from(options.getInput()))
+                        .apply(Filter.by(new FilterNonRecordsFn()))
+                        .apply(MapElements.via(new RecordsToLabeledPixelsFn()));
+
+        col.apply(RunInference.ofKVs(getModelLoaderScript(), schema, VarLongCoder.of())
+                                .withKwarg("model_uri", options.getModelPath())
+                                .withExpansionService(expansionService))
+                .apply(MapElements.via(new FormatOutput()))
+                .apply(TextIO.write().to("out.txt"));
+
+        pipeline.run().waitUntilFinish();
+}
+```
+
+> **Note**. Python extension service may not work on Windows. The **extension service** and the **Java pipeline** must run in the same environment. If the **extension service** is running in a **docker** container, you should run **Java pipeline** inside it.
+
+#### Run in docker
+
+Run **extension service** and **java program**:
+```
+# Use an official Java base image
+FROM openjdk:11
+
+# Install Python
+RUN apt-get update && \
+    apt-get install -y python python-pip
+
+# Set the working directory
+WORKDIR /app
+
+# Copy Java and Python source files to the working directory
+COPY src /app/src
+COPY scripts /app/scripts
+
+# Install Python dependencies
+RUN pip install --upgrade pip \
+RUN pip install apache-beam==2.46.0 \
+RUN pip install --default-timeout=100 torch \
+RUN pip install torchvision \
+RUN pip install pandas \
+RUN pip install scikit-learn
+
+RUN wget https://repo1.maven.org/maven2/org/springframework/spring-expression/$SPRING_VERSION/spring-expression-$SPRING_VERSION.jar &&\
+    mv spring-expression-$SPRING_VERSION.jar /opt/apache/beam/jars/spring-expression.jar
+
+# Compile the Java program
+RUN javac -d /app/bin /app/src/MyJavaProgram.java
+
+# Set the entrypoint to run both the Java program and Python script
+ENTRYPOINT ["bash", "-c", "java -cp /app/bin MyJavaProgram && python -m apache_beam.runners.portability.expansion_service_main -p 9090 --fully_qualified_name_glob=*"]
+```
+{{end}}
+
+
+{{if (eq .Sdk "python")}}
+
+
+Write own java service
+```
+import autovalue.shaded.com.google.auto.service.AutoService;
+import autovalue.shaded.com.google.common.collect.ImmutableMap;
+import org.apache.beam.sdk.expansion.ExternalTransformRegistrar;
+import org.apache.beam.sdk.transforms.Count;
+import org.apache.beam.sdk.transforms.ExternalTransformBuilder;
+import org.apache.beam.sdk.transforms.PTransform;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+
+import java.util.Map;
+
+public class Task {
+    static class JavaCount extends PTransform<PCollection<String>, PCollection<KV<String, Long>>> {
+
+        public JavaCount() {
+        }
+
+        @Override
+        public PCollection<KV<String, Long>> expand(PCollection<String> input) {
+            return input
+                    .apply(
+                            "JavaCount", Count.perElement());
+        }
+    }
+
+    static class JavaCountBuilder implements
+            ExternalTransformBuilder<JavaCountConfiguration, PCollection<String>, PCollection<KV<String, Long>>> {
+
+        @Override
+        public PTransform<PCollection<String>, PCollection<KV<String, Long>>> buildExternal(
+                JavaCountConfiguration configuration) {
+            return new JavaCount();
+        }
+    }
+
+    static class JavaCountConfiguration {
+
+    }
+
+    @AutoService(ExternalTransformRegistrar.class)
+    public class JavaCountRegistrar implements ExternalTransformRegistrar {
+
+        final String URN = "my.beam.transform.javacount";
+
+        @Override
+        public Map<String, ExternalTransformBuilder<?, ?, ?>> knownBuilderInstances() {
+            return ImmutableMap.of(URN, new JavaCountBuilder());
+        }
+    }
+}
+```
+
+Build jar:
+```
+mvn clean
+mvn package -DskipTests
+cd target
+java -jar java-count-bundled-0.1.jar 12345
+```
+
+Python pipeline:
+```
+import logging
+import re
+import typing
+
+import apache_beam as beam
+from apache_beam.io import ReadFromText
+from apache_beam.io import WriteToText
+from apache_beam.transforms.external import ImplicitSchemaPayloadBuilder
+from apache_beam.options.pipeline_options import PipelineOptions
+
+
+class WordExtractingDoFn(beam.DoFn):
+  """Parse each line of input text into words."""
+  def process(self, element):
+    """Returns an iterator over the words of this element.
+    The element is a line of text.  If the line is blank, note that, too.
+    Args:
+      element: the element being processed
+    Returns:
+      The processed element.
+    """
+    return re.findall(r'[\w\']+', element, re.UNICODE)
+
+
+def run(input_path, output_path, pipeline_args):
+  pipeline_options = PipelineOptions(pipeline_args)
+
+  with beam.Pipeline(options=pipeline_options) as p:
+    lines = p | 'Read' >> ReadFromText(input_path).with_output_types(str)
+    words = lines | 'Split' >> (beam.ParDo(WordExtractingDoFn()).with_output_types(str))
+
+    java_output = (
+        words
+        | 'JavaCount' >> beam.ExternalTransform(
+              'my.beam.transform.javacount',
+              None,
+              "localhost:12345"))
+
+    def format(kv):
+      key, value = kv
+      return '%s:%s' % (key, value)
+
+    output = java_output | 'Format' >> beam.Map(format)
+    output | 'Write' >> WriteToText(output_path)
+
+
+if __name__ == '__main__':
+  logging.getLogger().setLevel(logging.INFO)
+  import argparse
+
+  parser = argparse.ArgumentParser()
+  parser.add_argument(
+      '--input',
+      dest='input',
+      required=True,
+      help='Input file')
+  parser.add_argument(
+      '--output',
+      dest='output',
+      required=True,
+      help='Output file')
+  known_args, pipeline_args = parser.parse_known_args()
+
+  run(
+      known_args.input,
+      known_args.output,
+      pipeline_args)
+```
+
+Run program
+```
+python javacount.py --runner DirectRunner --environment_type=DOCKER --input input1 --output output --sdk_harness_container_image_overrides ".*java.*,chamikaramj/beam_java11_sdk:latest"
+```
+{{end}}

Review Comment:
   Runnable example and challenge is missing



##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming data processing pipelines. One of the key features of Apache Beam is its ability to support multi-language pipelines. With Apache Beam, you can write different parts of your pipeline in different programming languages, and they can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, and Go. This makes it possible to use the language that best suits your needs for each part of your pipeline.

Review Comment:
   Apache Beam supports multiple programming languages, including Java, Python, and Go. With multi-language support, you can use a combination of supported languages in a single pipeline. This makes it possible to use the language that best suits your needs for each part of the pipeline.
   



##########
learning/tour-of-beam/learning-content/cross-language/sql-transform/description.md:
##########
@@ -0,0 +1,78 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Sql transform
+
+In Apache Beam, you can use **SQL** transforms to query and manipulate your data within the pipeline using SQL-like syntax. This is particularly useful if you're already familiar with SQL and want to leverage that knowledge while processing data in your Apache Beam pipeline.
+
+Apache Beam **SQL** is built on top of Calcite, an open-source SQL parser, and optimizer framework.
+
+To use **SQL** transforms in Apache Beam, you'll need to perform the following steps:
+
+Create a `PCollection` of rows. In Apache Beam, **SQL** operations are performed on `PCollection<Row>` objects. You'll need to convert your data into a `PCollection` of rows, which requires defining a schema that describes the structure of your data.

Review Comment:
   Create a `PCollection` of rows. Apache Beam performs **SQL** operations on `PCollection<Row>` objects. You'll need to convert your data into a `PCollection` of rows, which requires defining a schema that describes the structure of your data.



##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming data processing pipelines. One of the key features of Apache Beam is its ability to support multi-language pipelines. With Apache Beam, you can write different parts of your pipeline in different programming languages, and they can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, and Go. This makes it possible to use the language that best suits your needs for each part of your pipeline.
+
+To build a multi-language pipeline with Apache Beam, you can use the following approach:
+
+Define your pipeline using Apache Beam's SDK in your preferred programming language. This defines the data processing steps that need to be executed.

Review Comment:
   Should be a list? 



##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming data processing pipelines. One of the key features of Apache Beam is its ability to support multi-language pipelines. With Apache Beam, you can write different parts of your pipeline in different programming languages, and they can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, and Go. This makes it possible to use the language that best suits your needs for each part of your pipeline.
+
+To build a multi-language pipeline with Apache Beam, you can use the following approach:
+
+Define your pipeline using Apache Beam's SDK in your preferred programming language. This defines the data processing steps that need to be executed.
+
+Use Apache Beam's language-specific SDKs to implement the data processing steps in the appropriate programming languages. For example, you could use Java to process some data, Python to process some other data, and Go to perform a specific computation.
+
+Use Apache Beam's cross-language support to connect the different parts of your pipeline together. Apache Beam provides a common data model and serialization format, so data can be passed seamlessly between different languages.
+
+By using Apache Beam's multi-language support, you can take advantage of the strengths of different programming languages, while still building a unified data processing pipeline. This can be especially useful when working with large datasets, as different languages may have different performance characteristics for different tasks.
+
+To create a multi-language pipeline in Apache Beam, follow these steps:
+
+Choose your SDKs: First, decide which programming languages and corresponding SDKs you'd like to use. Apache Beam currently supports Python, Java, and Go SDKs.
+
+Set up the dependencies: Make sure you have installed the necessary dependencies for each language. For instance, you'll need the Beam Python SDK for Python or the Beam Java SDK for Java.
+
+Create a pipeline: Using the primary language of your choice, create a pipeline object using the respective SDK. This pipeline will serve as the main entry point for your multi-language pipeline.
+
+Use cross-language transforms: To execute transforms written in other languages, use the ExternalTransform class (in Python) or the External class (in Java). This allows you to use a transform written in another language as if it were a native transform in your main pipeline. You'll need to provide the appropriate expansion service address for the language of the transform.
+
+{{if (eq .Sdk "java")}}
+
+#### Start an expansion service
+
+When building a job for a multi-language pipeline, Beam uses an expansion service to expand composite transforms. You must have at least one expansion service per remote SDK.

Review Comment:
   When building a job for a multi-language pipeline, Beam uses an expansion service to expand composite transforms. Therefore, you must have at least one expansion service per remote SDK.



##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming data processing pipelines. One of the key features of Apache Beam is its ability to support multi-language pipelines. With Apache Beam, you can write different parts of your pipeline in different programming languages, and they can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, and Go. This makes it possible to use the language that best suits your needs for each part of your pipeline.
+
+To build a multi-language pipeline with Apache Beam, you can use the following approach:
+
+Define your pipeline using Apache Beam's SDK in your preferred programming language. This defines the data processing steps that need to be executed.
+
+Use Apache Beam's language-specific SDKs to implement the data processing steps in the appropriate programming languages. For example, you could use Java to process some data, Python to process some other data, and Go to perform a specific computation.
+
+Use Apache Beam's cross-language support to connect the different parts of your pipeline together. Apache Beam provides a common data model and serialization format, so data can be passed seamlessly between different languages.
+
+By using Apache Beam's multi-language support, you can take advantage of the strengths of different programming languages, while still building a unified data processing pipeline. This can be especially useful when working with large datasets, as different languages may have different performance characteristics for different tasks.
+
+To create a multi-language pipeline in Apache Beam, follow these steps:
+
+Choose your SDKs: First, decide which programming languages and corresponding SDKs you'd like to use. Apache Beam currently supports Python, Java, and Go SDKs.

Review Comment:
   Should be a list in the markdown outlining each task?



##########
learning/tour-of-beam/learning-content/cross-language/sql-transform/description.md:
##########
@@ -0,0 +1,78 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Sql transform
+
+In Apache Beam, you can use **SQL** transforms to query and manipulate your data within the pipeline using SQL-like syntax. This is particularly useful if you're already familiar with SQL and want to leverage that knowledge while processing data in your Apache Beam pipeline.
+
+Apache Beam **SQL** is built on top of Calcite, an open-source SQL parser, and optimizer framework.
+
+To use **SQL** transforms in Apache Beam, you'll need to perform the following steps:
+
+Create a `PCollection` of rows. In Apache Beam, **SQL** operations are performed on `PCollection<Row>` objects. You'll need to convert your data into a `PCollection` of rows, which requires defining a schema that describes the structure of your data.

Review Comment:
   List? 



##########
learning/tour-of-beam/learning-content/cross-language/sql-transform/description.md:
##########
@@ -0,0 +1,78 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Sql transform
+
+In Apache Beam, you can use **SQL** transforms to query and manipulate your data within the pipeline using SQL-like syntax. This is particularly useful if you're already familiar with SQL and want to leverage that knowledge while processing data in your Apache Beam pipeline.
+
+Apache Beam **SQL** is built on top of Calcite, an open-source SQL parser, and optimizer framework.
+
+To use **SQL** transforms in Apache Beam, you'll need to perform the following steps:
+
+Create a `PCollection` of rows. In Apache Beam, **SQL** operations are performed on `PCollection<Row>` objects. You'll need to convert your data into a `PCollection` of rows, which requires defining a schema that describes the structure of your data.
+
+Apply the SQL transform. Once you have your `PCollection<Row>`, you can apply a SQL transform using the `SqlTransform.query()` method. You'll need to provide the SQL query you want to execute on your data.
+
+{{if (eq .Sdk "java")}}
+Add the necessary dependencies to your project. For Java, you'll need to include the `beam-sdks-java-extensions-sql` dependency in your build configuration.
+
+```
+import org.apache.beam.sdk.extensions.sql.SqlTransform;
+import org.apache.beam.sdk.schemas.Schema;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.beam.sdk.values.Row;
+
+// Define the schema for your data
+Schema schema = Schema.builder()
+        .addField("id", Schema.FieldType.INT32)
+        .addField("name", Schema.FieldType.STRING)
+        .build();
+
+// Assume input is a PCollection<Row> with the above schema
+PCollection<Row> input = ...;
+
+// Apply the SQL transform
+PCollection<Row> result = input.apply(
+        SqlTransform.query("SELECT id, name FROM PCOLLECTION WHERE id > 100"));
+```
+{{end}}
+
+{{if (eq .Sdk "python")}}
+```
+import apache_beam as beam
+from apache_beam.transforms.sql import SqlTransform
+
+# Define a sample input data as a list of dictionaries
+input_data = [
+    {'id': 1, 'name': 'Alice'},
+    {'id': 2, 'name': 'Bob'},
+    {'id': 101, 'name': 'Carol'},
+    {'id': 102, 'name': 'David'},
+]
+
+# Create a pipeline
+with beam.Pipeline() as p:
+    # Read input data and convert it to a PCollection of Rows
+    input_rows = (
+        p | 'Create input data' >> beam.Create(input_data)
+        | 'Convert to Rows' >> beam.Map(lambda x: beam.Row(id=int(x['id']), name=str(x['name'])))
+    )
+
+    # Apply the SQL transform
+    filtered_rows = input_rows | SqlTransform("SELECT id, name FROM PCOLLECTION WHERE id > 100")
+
+    # Print the results
+    filtered_rows | 'Print results' >> beam.Map(print)
+```
+{{end}}

Review Comment:
   Runnable example description and challenge?



##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming data processing pipelines. One of the key features of Apache Beam is its ability to support multi-language pipelines. With Apache Beam, you can write different parts of your pipeline in different programming languages, and they can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, and Go. This makes it possible to use the language that best suits your needs for each part of your pipeline.
+
+To build a multi-language pipeline with Apache Beam, you can use the following approach:
+
+Define your pipeline using Apache Beam's SDK in your preferred programming language. This defines the data processing steps that need to be executed.
+
+Use Apache Beam's language-specific SDKs to implement the data processing steps in the appropriate programming languages. For example, you could use Java to process some data, Python to process some other data, and Go to perform a specific computation.
+
+Use Apache Beam's cross-language support to connect the different parts of your pipeline together. Apache Beam provides a common data model and serialization format, so data can be passed seamlessly between different languages.
+
+By using Apache Beam's multi-language support, you can take advantage of the strengths of different programming languages, while still building a unified data processing pipeline. This can be especially useful when working with large datasets, as different languages may have different performance characteristics for different tasks.
+
+To create a multi-language pipeline in Apache Beam, follow these steps:
+
+Choose your SDKs: First, decide which programming languages and corresponding SDKs you'd like to use. Apache Beam currently supports Python, Java, and Go SDKs.
+
+Set up the dependencies: Make sure you have installed the necessary dependencies for each language. For instance, you'll need the Beam Python SDK for Python or the Beam Java SDK for Java.
+
+Create a pipeline: Using the primary language of your choice, create a pipeline object using the respective SDK. This pipeline will serve as the main entry point for your multi-language pipeline.
+
+Use cross-language transforms: To execute transforms written in other languages, use the ExternalTransform class (in Python) or the External class (in Java). This allows you to use a transform written in another language as if it were a native transform in your main pipeline. You'll need to provide the appropriate expansion service address for the language of the transform.
+
+{{if (eq .Sdk "java")}}
+
+#### Start an expansion service
+
+When building a job for a multi-language pipeline, Beam uses an expansion service to expand composite transforms. You must have at least one expansion service per remote SDK.
+
+In the general case, if you have a supported version of Python installed on your system, you can let `PythonExternalTransform` handle the details of creating and starting up the expansion service. But if you want to customize the environment or use transforms not available in the default Beam SDK, you might need to run your own expansion service.

Review Comment:
   In most cases, if you have a supported version of Python installed on your system, you can let `PythonExternalTransform` handle the details of creating and starting up the expansion service. But if you want to customize the environment or use transforms not available in the default Beam SDK, you might need to run your own expansion service.



##########
learning/tour-of-beam/learning-content/cross-language/multi-pipeline/description.md:
##########
@@ -0,0 +1,280 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Multi pipeline
+
+Apache Beam is a popular open-source platform for building batch and streaming data processing pipelines. One of the key features of Apache Beam is its ability to support multi-language pipelines. With Apache Beam, you can write different parts of your pipeline in different programming languages, and they can all work together seamlessly.
+
+Apache Beam supports multiple programming languages, including Java, Python, and Go. This makes it possible to use the language that best suits your needs for each part of your pipeline.
+
+To build a multi-language pipeline with Apache Beam, you can use the following approach:
+
+Define your pipeline using Apache Beam's SDK in your preferred programming language. This defines the data processing steps that need to be executed.
+
+Use Apache Beam's language-specific SDKs to implement the data processing steps in the appropriate programming languages. For example, you could use Java to process some data, Python to process some other data, and Go to perform a specific computation.
+
+Use Apache Beam's cross-language support to connect the different parts of your pipeline together. Apache Beam provides a common data model and serialization format, so data can be passed seamlessly between different languages.
+
+By using Apache Beam's multi-language support, you can take advantage of the strengths of different programming languages, while still building a unified data processing pipeline. This can be especially useful when working with large datasets, as different languages may have different performance characteristics for different tasks.
+

Review Comment:
   Use Apache Beam's cross-language support to connect the different parts of your pipeline. Apache Beam provides a standard data model and serialization format so that data can be passed seamlessly between different languages.
   		
   
   		Using Apache Beam's multi-language support, you can take advantage of the strengths of different programming languages while building a unified data processing pipeline. This can be especially useful when working with large datasets and re-using existing components. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] sirenbyte commented on a diff in pull request #26227: [Tour of Beam] Learning content for "Cross-language Transforms" module

Posted by "sirenbyte (via GitHub)" <gi...@apache.org>.
sirenbyte commented on code in PR #26227:
URL: https://github.com/apache/beam/pull/26227#discussion_r1185034569


##########
learning/tour-of-beam/learning-content/cross-language/sql-transform/description.md:
##########
@@ -0,0 +1,78 @@
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+### Sql transform
+
+In Apache Beam, you can use **SQL** transforms to query and manipulate your data within the pipeline using SQL-like syntax. This is particularly useful if you're already familiar with SQL and want to leverage that knowledge while processing data in your Apache Beam pipeline.
+
+Apache Beam **SQL** is built on top of Calcite, an open-source SQL parser, and optimizer framework.
+
+To use **SQL** transforms in Apache Beam, you'll need to perform the following steps:
+
+Create a `PCollection` of rows. In Apache Beam, **SQL** operations are performed on `PCollection<Row>` objects. You'll need to convert your data into a `PCollection` of rows, which requires defining a schema that describes the structure of your data.

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #26227: [Tour of Beam] Learning content for "Cross-language Transforms" module

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #26227:
URL: https://github.com/apache/beam/pull/26227#issuecomment-1580739883

   Assigning reviewers. If you would like to opt out of this review, comment `assign to next reviewer`:
   
   R: @lostluck added as fallback since no labels match configuration
   
   Available commands:
   - `stop reviewer notifications` - opt out of the automated review tooling
   - `remind me after tests pass` - tag the comment author after tests pass
   - `waiting on author` - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
   
   The PR bot will only process comments in the main thread (not review comments).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] olehborysevych commented on pull request #26227: [Tour of Beam] Learning content for "Cross-language Transforms" module

Posted by "olehborysevych (via GitHub)" <gi...@apache.org>.
olehborysevych commented on PR #26227:
URL: https://github.com/apache/beam/pull/26227#issuecomment-1584465496

   Hey @lostluck I've resolved all the conflicts


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org