You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by ch...@apache.org on 2022/10/21 03:47:08 UTC
[beam] branch master updated: Adds instructions for running the Multi-language Java quickstart from released Beam (#23721)
This is an automated email from the ASF dual-hosted git repository.
chamikara pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/master by this push:
new 69fe1cc86f0 Adds instructions for running the Multi-language Java quickstart from released Beam (#23721)
69fe1cc86f0 is described below
commit 69fe1cc86f0355e06db344a37dc4c69748ae61ca
Author: Chamikara Jayalath <ch...@gmail.com>
AuthorDate: Thu Oct 20 20:47:02 2022 -0700
Adds instructions for running the Multi-language Java quickstart from released Beam (#23721)
* Adds instructions for running the Multi-language Java quickstart from released Beam
* Fix dependencies
* Addressing reviewer comments
---
.../multilanguage/PythonDataframeWordCount.java | 0
examples/multi-language/README.md | 6 ++
examples/multi-language/build.gradle | 1 -
.../sdks/java-multi-language-pipelines.md | 79 ++++++++++++++++++----
4 files changed, 71 insertions(+), 15 deletions(-)
diff --git a/examples/multi-language/src/main/java/org/apache/beam/examples/multilanguage/PythonDataframeWordCount.java b/examples/java/src/main/java/org/apache/beam/examples/multilanguage/PythonDataframeWordCount.java
similarity index 100%
rename from examples/multi-language/src/main/java/org/apache/beam/examples/multilanguage/PythonDataframeWordCount.java
rename to examples/java/src/main/java/org/apache/beam/examples/multilanguage/PythonDataframeWordCount.java
diff --git a/examples/multi-language/README.md b/examples/multi-language/README.md
index 127ab8c30eb..dea314095b4 100644
--- a/examples/multi-language/README.md
+++ b/examples/multi-language/README.md
@@ -166,3 +166,9 @@ of the digit. The second item is the predicted label of the digit.
```
gsutil cat gs://$GCP_BUCKET/multi-language-beam/output*
```
+
+### Python Dataframe Wordcount
+
+This example is covered in the [Java multi-language pipelines quickstart](https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/).
+The pipeline source code is available at
+[PythonDataframeWordCount.java](https://github.com/apache/beam/tree/master/examples/java/src/main/java/org/apache/beam/examples/multilanguage/PythonDataframeWordCount.java).
diff --git a/examples/multi-language/build.gradle b/examples/multi-language/build.gradle
index 61fdb686f4e..b266faeb8f1 100644
--- a/examples/multi-language/build.gradle
+++ b/examples/multi-language/build.gradle
@@ -40,7 +40,6 @@ dependencies {
runtimeOnly project(path: ":runners:portability:java")
implementation library.java.vendored_guava_26_0_jre
implementation project(":sdks:java:expansion-service")
- implementation project(":sdks:java:extensions:python")
permitUnusedDeclared project(":sdks:java:expansion-service") // BEAM-11761
}
diff --git a/website/www/site/content/en/documentation/sdks/java-multi-language-pipelines.md b/website/www/site/content/en/documentation/sdks/java-multi-language-pipelines.md
index 5f1b971f204..4b260e57973 100644
--- a/website/www/site/content/en/documentation/sdks/java-multi-language-pipelines.md
+++ b/website/www/site/content/en/documentation/sdks/java-multi-language-pipelines.md
@@ -138,26 +138,27 @@ default Beam SDK, you might need to run your own expansion service. In such
cases, [start the expansion service](#advanced-start-an-expansion-service)
before running your pipeline.
-Here we've provided commands for running the example pipeline using
-Gradle on a [Beam HEAD Git clone](https://github.com/apache/beam).
-If you need a more stable environment, please
-[setup a Java project](/get-started/quickstart-java/) that uses the latest
-released Beam version and include the necessary dependencies.
+### Run with Dataflow runner at HEAD (Beam 2.41.0 and later)
-### Run with Dataflow runner
+> **Note:** Due to [issue#23717](https://github.com/apache/beam/issues/23717),
+> Beam 2.42.0 requires manually starting up an expansion service (see
+> [these instructions](https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/#advanced-start-an-expansion-service))
+> and using the additional pipeline option `--expansionService=localhost:<PORT>`
+> when executing the pipeline.
The following script runs the example multi-language pipeline on Dataflow, using
example text from a Cloud Storage bucket. You’ll need to adapt the script to
your environment.
```
+export GCP_PROJECT=<project>
export OUTPUT_BUCKET=<bucket>
export GCP_REGION=<region>
export TEMP_LOCATION=gs://$OUTPUT_BUCKET/tmp
-export PYTHON_VERSION=<version>
./gradlew :examples:multi-language:pythonDataframeWordCount --args=" \
--runner=DataflowRunner \
+--project=$GCP_PROJECT \
--output=gs://${OUTPUT_BUCKET}/count \
--region=${GCP_REGION}"
```
@@ -192,10 +193,15 @@ python -m apache_beam.runners.portability.local_job_service_main -p $JOB_SERVER_
5. Run the pipeline.
+> **Note:** Due to [issue#23717](https://github.com/apache/beam/issues/23717),
+> Beam 2.42.0 requires manually starting up an expansion service (see
+> [these instructions](https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/#advanced-start-an-expansion-service))
+> and using the additional pipeline option `--expansionService=localhost:<PORT>`
+> when executing the pipeline.
+
```
export JOB_SERVER_PORT=<port> # Same port as before
export OUTPUT_FILE=<local relative path>
-export PYTHON_VERSION=<version>
./gradlew :examples:multi-language:pythonDataframeWordCount --args=" \
--runner=PortableRunner \
@@ -226,19 +232,64 @@ For example, to start the standard expansion service for a Python transform,
[ExpansionServiceServicer](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/expansion_service.py),
follow these steps:
-1. Activate a Python virtual environment and install Apache Beam, as described
- in the [Python quick start](/get-started/quickstart-py/).
-2. In the **beam/sdks/python** directory of the Beam source code, run the
- following command:
+1. Activate a new virtual environment following
+[these instructions](https://beam.apache.org/get-started/quickstart-py/#create-and-activate-a-virtual-environment).
+
+2. Install Apache Beam with `gcp` and `dataframe` packages.
+
+```
+pip install apache-beam[gcp,dataframe]
+```
+
+4. Run the following command
```
- python apache_beam/runners/portability/expansion_service_main.py -p 18089 --fully_qualified_name_glob "*"
+ python -m apache_beam.runners.portability.expansion_service_main -p <PORT> --fully_qualified_name_glob "*"
```
The command runs
[expansion_service_main.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/expansion_service_main.py), which starts the standard expansion service. When you use
Gradle to run your Java pipeline, you can specify the expansion service with the
-`expansionService` option. For example: `--expansionService=localhost:18089`.
+`expansionService` option. For example: `--expansionService=localhost:<PORT>`.
+
+### Run with Dataflow runner using a Beam release (Beam 2.43.0 and later)
+
+> **Note:** Due to [issue#23717](https://github.com/apache/beam/issues/23717),
+> Beam 2.42.0 requires manually starting up an expansion service (see
+> [these instructions](https://beam.apache.org/documentation/sdks/java-multi-language-pipelines/#advanced-start-an-expansion-service))
+> and using the additional pipeline option `--expansionService=localhost:<PORT>`
+> when executing the pipeline.
+
+* Check out the Beam examples Maven archetype for the relevant Beam version.
+
+```
+export BEAM_VERSION=<Beam version>
+
+mvn archetype:generate \
+ -DarchetypeGroupId=org.apache.beam \
+ -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
+ -DarchetypeVersion=$BEAM_VERSION \
+ -DgroupId=org.example \
+ -DartifactId=multi-language-beam \
+ -Dversion="0.1" \
+ -Dpackage=org.apache.beam.examples \
+ -DinteractiveMode=false
+```
+
+* Run the pipeline.
+
+```
+export GCP_PROJECT=<GCP project>
+export GCP_BUCKET=<GCP bucket>
+export GCP_REGION=<GCP region>
+
+mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.multilanguage.PythonDataframeWordCount \
+ -Dexec.args="--runner=DataflowRunner --project=$GCP_PROJECT \
+ --region=us-central1 \
+ --gcpTempLocation=gs://$GCP_BUCKET/multi-language-beam/tmp \
+ --output=gs://$GCP_BUCKET/multi-language-beam/output" \
+ -Pdataflow-runner
+```
## Next steps