You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by da...@apache.org on 2024/03/27 17:00:57 UTC

(beam) branch master updated: Duet AI Prompt: Test Your Pipeline (#30738)

This is an automated email from the ASF dual-hosted git repository.

damccorm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git


The following commit(s) were added to refs/heads/master by this push:
     new 45444209c77 Duet AI Prompt: Test Your Pipeline  (#30738)
45444209c77 is described below

commit 45444209c776460ba67ac14a1c33e0775485266a
Author: Daria Bezkorovaina <99...@users.noreply.github.com>
AuthorDate: Wed Mar 27 17:00:50 2024 +0000

    Duet AI Prompt: Test Your Pipeline  (#30738)
    
    * Rename 33_coders_data_encoding.md to 34_coders_data_encoding.md
    
    Fix numeration
    
    * Rename 34_coders_data_encoding.md to 35_coders_data_encoding.md
    
    numeration
    
    * Rename 34_ml_data_preprocessing.md to 36_ml_data_preprocessing.md
    
    * Rename 35_ml_data_exploration.md to 37_ml_data_exploration.md
    
    * Rename 36_ml_data_preprocessing.md to 38_ml_data_preprocessing.md
    
    * Rename 37_ml_data_exploration.md to 48_ml_data_exploration.md
    
    * Rename 38_ml_data_preprocessing.md to 49_ml_data_preprocessing.md
    
    * Rename 47_batching_for_external_calls.md to 44_batching_for_external_calls.md
    
    * Rename 47_mltransform.md to 45_mltransform.md
    
    * Rename 48_ml_data_exploration.md to 46_ml_data_exploration.md
    
    * Rename 48_resource_hints.md to 47_resource_hints.md
    
    * Rename 49_ml_data_preprocessing.md to 51_ml_data_preprocessing.md
    
    * Rename 49_transform_service.md to 51_transform_service.md
    
    * Rename 51_transform_service.md to 52_transform_service.md
    
    * Rename 44_dataflow_runner.md to 53_dataflow_runner.md
    
    * Rename 45_spark_runner.md to 54_spark_runner.md
    
    * Rename 46_direct_runner.md to 55_direct_runner.md
    
    * Rename 49_transform_service.md to 52_transform_service.md
    
    * Create 56_unit_test_your_pipeline.md
---
 ...data_encoding.md => 35_coders_data_encoding.md} |   0
 ..._calls.md => 44_batching_for_external_calls.md} |   0
 .../{47_mltransform.md => 45_mltransform.md}       |   0
 ...ta_exploration.md => 46_ml_data_exploration.md} |   0
 .../{48_resource_hints.md => 47_resource_hints.md} |   0
 ...reprocessing.md => 51_ml_data_preprocessing.md} |   0
 ...ransform_service.md => 52_transform_service.md} |   0
 .../56_unit_test_your_pipeline.md                  | 148 +++++++++++++++++++++
 ...ransform_service.md => 52_transform_service.md} |   0
 ...44_dataflow_runner.md => 53_dataflow_runner.md} |   0
 .../{45_spark_runner.md => 54_spark_runner.md}     |   0
 .../{46_direct_runner.md => 55_direct_runner.md}   |   0
 12 files changed, 148 insertions(+)

diff --git a/learning/prompts/documentation-lookup-nolinks/33_coders_data_encoding.md b/learning/prompts/documentation-lookup-nolinks/35_coders_data_encoding.md
similarity index 100%
rename from learning/prompts/documentation-lookup-nolinks/33_coders_data_encoding.md
rename to learning/prompts/documentation-lookup-nolinks/35_coders_data_encoding.md
diff --git a/learning/prompts/documentation-lookup-nolinks/47_batching_for_external_calls.md b/learning/prompts/documentation-lookup-nolinks/44_batching_for_external_calls.md
similarity index 100%
rename from learning/prompts/documentation-lookup-nolinks/47_batching_for_external_calls.md
rename to learning/prompts/documentation-lookup-nolinks/44_batching_for_external_calls.md
diff --git a/learning/prompts/documentation-lookup-nolinks/47_mltransform.md b/learning/prompts/documentation-lookup-nolinks/45_mltransform.md
similarity index 100%
rename from learning/prompts/documentation-lookup-nolinks/47_mltransform.md
rename to learning/prompts/documentation-lookup-nolinks/45_mltransform.md
diff --git a/learning/prompts/documentation-lookup-nolinks/35_ml_data_exploration.md b/learning/prompts/documentation-lookup-nolinks/46_ml_data_exploration.md
similarity index 100%
rename from learning/prompts/documentation-lookup-nolinks/35_ml_data_exploration.md
rename to learning/prompts/documentation-lookup-nolinks/46_ml_data_exploration.md
diff --git a/learning/prompts/documentation-lookup-nolinks/48_resource_hints.md b/learning/prompts/documentation-lookup-nolinks/47_resource_hints.md
similarity index 100%
rename from learning/prompts/documentation-lookup-nolinks/48_resource_hints.md
rename to learning/prompts/documentation-lookup-nolinks/47_resource_hints.md
diff --git a/learning/prompts/documentation-lookup-nolinks/34_ml_data_preprocessing.md b/learning/prompts/documentation-lookup-nolinks/51_ml_data_preprocessing.md
similarity index 100%
rename from learning/prompts/documentation-lookup-nolinks/34_ml_data_preprocessing.md
rename to learning/prompts/documentation-lookup-nolinks/51_ml_data_preprocessing.md
diff --git a/learning/prompts/documentation-lookup-nolinks/49_transform_service.md b/learning/prompts/documentation-lookup-nolinks/52_transform_service.md
similarity index 100%
rename from learning/prompts/documentation-lookup-nolinks/49_transform_service.md
rename to learning/prompts/documentation-lookup-nolinks/52_transform_service.md
diff --git a/learning/prompts/documentation-lookup-nolinks/56_unit_test_your_pipeline.md b/learning/prompts/documentation-lookup-nolinks/56_unit_test_your_pipeline.md
new file mode 100644
index 00000000000..94dcfd09c80
--- /dev/null
+++ b/learning/prompts/documentation-lookup-nolinks/56_unit_test_your_pipeline.md
@@ -0,0 +1,148 @@
+Prompt:
+How can I unit-test my Apache Beam pipeline?
+
+Response:
+Before deploying your Apache Beam pipeline on your chosen runner, it's crucial to conduct thorough unit tests locally. This approach is highly effective for identifying and rectifying bugs within your pipeline code, all while allowing you to leverage your preferred debugging tools.
+
+The Apache Beam Direct Runner is particularly valuable for local testing and development. Once you've tested your pipeline with the Direct Runner, you can seamlessly transition to testing it on a small scale using your preferred runner, such as the Flink runner with either a local or remote Flink cluster.
+
+The Apache Beam SDKs offer three levels of unit testing for pipeline code:
+* ***Testing individual functions***: you can test each function used in your pipeline independently.
+* ***Testing an entire transform***: this involves testing a transform as a cohesive unit.
+* ***Performing an end-to-end test***: this comprehensive test evaluates the entire pipeline.
+
+Both the Java and Python SDKs provide numerous test classes within their testing packages. The following examples demonstrate the most commonly used test classes in Java and Python.
+
+***Testing a Transform***
+
+1. Create a `TestPipeline`. The `TestPipeline` class, included in the Beam SDKs, is specifically designed for testing transforms. Replace `Pipeline` with `TestPipeline` when creating the pipeline object for tests. Unlike `Pipeline.create`, `TestPipeline.create` handles the setting of `PipelineOptions` internally.
+
+Java:
+
+```java
+Pipeline p = TestPipeline.create();
+```
+
+Python:
+
+```python
+with TestPipeline as p:
+    …
+```
+
+2. Create static test input data.
+3. Use the `Create` transform. You can use this transform to create a `PCollection` of your input data from a standard in-memory collection class, such as Java or Python `List`.
+4. Apply your transform. Apply your transform to the input `PCollection` and store the resulting output `PCollection`.
+5. Verify output using `PAssert` (Java) or `assert_that` (Python). These assertion classes ensure that the output `PCollection` contains the expected elements.
+
+Java:
+
+```java
+PCollection<String> output = ...;
+
+// Check whether a PCollection contains some elements in any order.
+PAssert.that(output)
+.containsInAnyOrder(
+  "elem1",
+  "elem3",
+  "elem2");
+```
+
+Python:
+
+```python
+from apache_beam.testing.util import assert_that
+from apache_beam.testing.util import equal_to
+
+output = ...
+
+# Check whether a PCollection contains some elements in any order.
+assert_that(
+    output,
+    equal_to(["elem1", "elem3", "elem2"]))
+```
+
+***Testing a Pipeline End-to-End***
+
+To test an entire pipeline end-to-end:
+* Create static test input data for each source of input data.
+* Prepare static test output data matching the expected final output `PCollection`.
+* Use `TestPipeline` instead of `Pipeline.create`.
+* Replace the pipeline’s `Read` transforms with the `Create` transform to generate `PCollection` objects from static input data.
+* Apply the pipeline’s transforms.
+* Replace the pipeline’s `Write` transforms with `PAssert` (Java) or `assert_that` (Python) to verify that the final `PCollection` objects match the expected values in static output data.
+
+The following examples demonstrate how to test the WordCount example pipeline in Java and Python using these steps.
+
+Java:
+
+```java
+public class WordCountTest {
+    // Static input data for the initial PCollection.
+    static final String[] WORDS_ARRAY = new String[] {
+      "hi there", "hi", "hi sue bob",
+      "hi sue", "", "bob hi"};
+
+    static final List<String> WORDS = Arrays.asList(WORDS_ARRAY);
+
+    // Static output data, expected to match the final PCollection.
+    static final String[] COUNTS_ARRAY = new String[] {
+        "hi: 5", "there: 1", "sue: 2", "bob: 2"};
+
+    // Example test for pipeline's transforms.
+    public void testCountWords() throws Exception {
+      Pipeline p = TestPipeline.create();
+
+      // Create a PCollection from the static input data.
+      PCollection<String> input = p.apply(Create.of(WORDS));
+
+      // Run ALL the pipeline's transforms.
+      PCollection<String> output = input.apply(new CountWords());
+
+      // Assert that the output matches the known static output data.
+      PAssert.that(output).containsInAnyOrder(COUNTS_ARRAY);
+
+      // Execute the pipeline.
+      p.run();
+    }
+}
+```
+
+Python:
+
+```python
+import unittest
+import apache_beam as beam
+from apache_beam.testing.test_pipeline import TestPipeline
+from apache_beam.testing.util import assert_that
+from apache_beam.testing.util import equal_to
+
+class CountWords(beam.PTransform):
+    # CountWords transform omitted for conciseness.
+
+class WordCountTest(unittest.TestCase):
+  # Input data for the initial PCollection.
+  WORDS = [
+      "hi", "there", "hi", "hi", "sue", "bob",
+      "hi", "sue", "", "", "ZOW", "bob", ""
+  ]
+
+  # Expected output data to match the final PCollection.
+  EXPECTED_COUNTS = ["hi: 5", "there: 1", "sue: 2", "bob: 2"]
+
+  # Example test for pipeline's transforms.
+  def test_count_words(self):
+    with TestPipeline() as p:
+      # Create a PCollection from the static input data.
+      input = p | beam.Create(WORDS)
+
+      # Run ALL the pipeline's transforms.
+      output = input | CountWords()
+
+      # Assert that the output matches the expected data.
+      assert_that(output, equal_to(EXPECTED_COUNTS), label='CheckOutput')
+
+    # The pipeline runs and verifies the results.
+```
+
+Typically, WordCount reads lines from a text file for input data. However, the provided sample tests create static input data containing text lines and use the `Create` transform to create an initial `PCollection`. Instead of writing that `PCollection` to an output text file, the test pipelines use `PAssert` (Java) or `assert_that` (Python) to verify that the `PCollection` elements match a static string containing expected output data.
diff --git a/learning/prompts/documentation-lookup/49_transform_service.md b/learning/prompts/documentation-lookup/52_transform_service.md
similarity index 100%
rename from learning/prompts/documentation-lookup/49_transform_service.md
rename to learning/prompts/documentation-lookup/52_transform_service.md
diff --git a/learning/prompts/documentation-lookup/44_dataflow_runner.md b/learning/prompts/documentation-lookup/53_dataflow_runner.md
similarity index 100%
rename from learning/prompts/documentation-lookup/44_dataflow_runner.md
rename to learning/prompts/documentation-lookup/53_dataflow_runner.md
diff --git a/learning/prompts/documentation-lookup/45_spark_runner.md b/learning/prompts/documentation-lookup/54_spark_runner.md
similarity index 100%
rename from learning/prompts/documentation-lookup/45_spark_runner.md
rename to learning/prompts/documentation-lookup/54_spark_runner.md
diff --git a/learning/prompts/documentation-lookup/46_direct_runner.md b/learning/prompts/documentation-lookup/55_direct_runner.md
similarity index 100%
rename from learning/prompts/documentation-lookup/46_direct_runner.md
rename to learning/prompts/documentation-lookup/55_direct_runner.md