You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "robertwb (via GitHub)" <gi...@apache.org> on 2024/02/02 18:22:36 UTC

Re: [PR] [yaml] Add Beam YAML Examples and Getting started docs [beam]

robertwb commented on code in PR #30003:
URL: https://github.com/apache/beam/pull/30003#discussion_r1476495183


##########
sdks/python/apache_beam/yaml/examples/transforms/aggregation/combine_sum.yaml:
##########
@@ -0,0 +1,58 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+pipeline:
+  type: chain
+  transforms:
+    - type: Create
+      name: Create produce
+      config:
+        elements:
+          - recipe: 'pie'
+            fruit: 'raspberry'
+            quantity: 1
+            unit_price: 3.50
+          - recipe: 'pie'
+            fruit: 'blackberry'
+            quantity: 1
+            unit_price: 4.00
+          - recipe: 'pie'
+            fruit: 'blueberry'
+            quantity: 1
+            unit_price: 2.00
+          - recipe: 'muffin'
+            fruit: 'blueberry'
+            quantity: 2
+            unit_price: 2.00
+          - recipe: 'muffin'
+            fruit: 'banana'
+            quantity: 3
+            unit_price: 1.00
+    - type: Combine
+      name: Sum values per key
+      config:
+        language: python
+        group_by: fruit
+        combine:
+          total_quantity: 

Review Comment:
   Yep, this is what I was thinking. 



##########
sdks/python/apache_beam/yaml/examples/examples_test.py:
##########
@@ -0,0 +1,118 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# pytype: skip-file
+import glob
+import logging
+import os
+import unittest
+from unittest import mock
+
+from hamcrest.core import assert_that as hamcrest_assert
+from hamcrest.library.collection import has_items
+
+import apache_beam as beam
+from apache_beam.examples.snippets.util import assert_matches_stdout
+from apache_beam.testing.test_pipeline import TestPipeline
+from apache_beam.testing.util import assert_that
+from apache_beam.yaml import cache_provider_artifacts
+from apache_beam.yaml import main
+
+
+def check_output(expected, matcher):
+  def _check_inner(actual):
+    formatted_actual = actual | beam.Map(
+        lambda row: str(beam.Row(**row._asdict())))
+    matcher(formatted_actual, expected)
+
+  return _check_inner
+
+
+def create_test_method(pipeline_spec_file, custom_matcher=None):
+  @mock.patch('apache_beam.Pipeline', TestPipeline)
+  def test_yaml_example(self):
+    with open(pipeline_spec_file) as f:
+      lines = f.readlines()
+      expected_key = '# Expected:\n'
+      if expected_key in lines:
+        expected = lines[lines.index('# Expected:\n') + 1:]
+      else:
+        raise ValueError(
+            f"Missing '# Expected:' tag in example file '{pipeline_spec_file}'")
+      for i, line in enumerate(expected):
+        expected[i] = line.replace('#  ', '').replace('\n', '')
+
+      cache_provider_artifacts.cache_provider_artifacts()
+      matcher = assert_matches_stdout if not custom_matcher else custom_matcher
+      main.run(
+          argv=[f"--pipeline_spec_file={pipeline_spec_file}"],
+          test=check_output(expected, matcher))
+
+  return test_yaml_example
+
+
+class YamlExamplesTestSuite:
+  _custom_matchers = {}
+
+  def __init__(self, name, path):
+    self._test_suite = self.create_test_suite(name, path)
+
+  def run(self):
+    return self._test_suite
+
+  @classmethod
+  def parse_test_methods(cls, path):
+    files = glob.glob(os.path.join(path, r'*.yaml'))

Review Comment:
   Nit: maybe accept a glob string directly rather than assuming `path/*.yaml` to be more flexible? 



##########
sdks/python/apache_beam/yaml/examples/wordcount_minimal.yaml:
##########
@@ -0,0 +1,74 @@
+# coding=utf-8
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the#  Row(output='License'); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an#  Row(output='AS IS' BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# This examples reads from a public file stores on Google Cloud. This
+# requires authenticating with Google Cloud, or setting the file in
+#`ReadFromText` to a local file.
+#
+# To set up Application Default Credentials,
+# see https://cloud.google.com/docs/authentication/external/set-up-adc for more
+# information
+#
+# This pipeline reads in a text file, maps all words to a value of "1", sums
+# the value of all unique word keys, then logs a formatted word, count pair.
+pipeline:
+  type: chain
+  transforms:
+    - type: ReadFromText
+      config:
+        path: gs://dataflow-samples/shakespeare/kinglear.txt
+    - type: MapToFields
+      config:
+        language: python
+        fields:
+          word:
+            callable: |
+              import re
+              def my_mapping(row):
+                return re.findall(r"[A-Za-z\']+", row.line.lower())
+          count: "1"

Review Comment:
   I wonder if it'd be easier to follow if we exploded the words, then added the "count" field. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org