You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by da...@apache.org on 2024/03/05 17:48:54 UTC

(beam) branch master updated: Duet AI Prompt - Beam YAML (no links) (#30440)

This is an automated email from the ASF dual-hosted git repository.

damccorm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git


The following commit(s) were added to refs/heads/master by this push:
     new 9ec2d49af98 Duet AI Prompt - Beam YAML (no links) (#30440)
9ec2d49af98 is described below

commit 9ec2d49af98fc44ea962beb2051575fea882aa96
Author: Daria Bezkorovaina <99...@users.noreply.github.com>
AuthorDate: Tue Mar 5 17:48:47 2024 +0000

    Duet AI Prompt - Beam YAML (no links) (#30440)
    
    * Create 50_beam_yaml.md
    
    * Update 50_beam_yaml.md
    
    * Update 50_beam_yaml.md
    
    Implement feedback
    
    * Update 50_beam_yaml.md
    
    * Update 50_beam_yaml.md
---
 .../documentation-lookup-nolinks/50_beam_yaml.md   | 72 ++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/learning/prompts/documentation-lookup-nolinks/50_beam_yaml.md b/learning/prompts/documentation-lookup-nolinks/50_beam_yaml.md
new file mode 100644
index 00000000000..aaedce7b344
--- /dev/null
+++ b/learning/prompts/documentation-lookup-nolinks/50_beam_yaml.md
@@ -0,0 +1,72 @@
+Prompt:
+Can I create Apache Beam pipelines using YAML?
+
+Response:
+The Beam YAML API enables the creation of basic Apache Beam pipelines using YAML files, which can be edited with any text editor. The API provides an intuitive, declarative syntax that simplifies pipeline authoring by eliminating the need for coding experience or SDK familiarity. Manipulating and understanding YAML semantics is easier compared to working directly with Beam protos. Additionally, YAML's readability and ease of authorship make it a suitable intermediate representation for t [...]
+
+The Beam YAML parser is currently integrated into the Apache Beam Python SDK.
+
+The Beam YAML SDK offers a wide range of built-in transforms and allows the creation of custom transforms through the notion of `providers` that leverage expansion services and schema transforms, or by integrating Python packages into your pipeline. You can give names to transforms to help you monitor, debug, and disambiguate identical transforms, enhancing readability.
+
+Beam YAML enables authoring both linear and non-linear pipelines. For linear pipelines, Beam YAML allows the inputs to be implicit by specifying the pipeline as a `chain` type. Additionally, you can denote the first and last transforms as `source` and `sink`, respectively. For non-linear pipelines, inputs must be explicitly named, although `chains` can be nested within them.
+
+The following Beam YAML code snippet demonstrates a linear pipeline that reads data from a CSV file, filters rows based on a specified condition, executes an SQL query, and writes the resulting data to a JSON file:
+
+```yaml
+pipeline:
+  type: chain
+
+  source:
+    type: ReadFromCsv
+    config:
+      path: /path/to/input*.csv
+
+  transforms:
+    - type: Filter
+      config:
+        language: python
+        keep: "col3 > 100"
+
+    - type: Sql
+      name: MySqlTransform
+      config:
+        query: "select col1, count(*) as cnt from PCOLLECTION group by col1"
+
+  sink:
+    type: WriteToJson
+    config:
+      path: /path/to/output.json
+```
+
+With Beam YAML, you can define both streaming and batch pipelines. Beam YAML also supports Apache Beam’s windowing and triggering, allowing windowing to be declared using the standard `WindowInto` transform or by tagging a transform with specified windowing. The latter applies the windowing to its inputs and the transform itself.
+
+The following example demonstrates a streaming Beam YAML pipeline with a chain of transforms. It reads from a Pub/Sub topic, applies a grouping transform with sliding windowing, and then writes the output to another Pub/Sub topic:
+
+```yaml
+pipeline:
+  type: chain
+  transforms:
+    - type: ReadFromPubSub
+      config:
+        topic: myPubSubTopic
+        format: ...
+        schema: ...
+    - type: SomeGroupingTransform
+      config:
+        arg: ...
+      windowing:
+        type: sliding
+        size: 60s
+        period: 10s
+    - type: WriteToPubSub
+      config:
+        topic: anotherPubSubTopic
+        format: json
+```
+
+You can execute a pipeline defined in a YAML file using the standard `python -m` command:
+
+```python
+python -m apache_beam.yaml.main --yaml_pipeline_file=/path/to/pipeline.yaml [other pipeline options such as the runner]
+```
+