You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "tvalentyn (via GitHub)" <gi...@apache.org> on 2023/04/28 18:30:53 UTC

[GitHub] [beam] tvalentyn commented on a diff in pull request #25969: Add the example to learn transforms

tvalentyn commented on code in PR #25969:
URL: https://github.com/apache/beam/pull/25969#discussion_r1164661740


##########
examples/notebooks/get-started/learn_beam_transforms_by_doing.ipynb:
##########
@@ -0,0 +1,740 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "cellView": "form",
+        "id": "QgmD1wbmT4mj"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Learn Beam PTransforms\n",
+        "\n",
+        "After this notebook, you should be able to:\n",
+        "1. Use user-defined functions in your `PTransforms`\n",
+        "2. Learn Beam SDK composite transforms\n",
+        "3. Create you own composite transforms to simplify your `Pipeline`\n",
+        "\n",
+        "For basic Beam `PTransforms`, please check out [this Notebook](https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/learn_beam_basics_by_doing.ipynb).\n",
+        "\n",
+        "Beam Python SDK also provides [a list of built-in transforms](https://beam.apache.org/documentation/transforms/python/overview/).\n"
+      ],
+      "metadata": {
+        "id": "RuUHlGZjVt6W"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## How To Approach This Tutorial\n",
+        "\n",
+        "This tutorial was designed for someone who likes to learn by doing. There will be code cells where you can write your own code to test your understanding.\n",
+        "\n",
+        "As such, to get the most out of this tutorial, we strongly recommend typing code by hand as you’re working through the tutorial and not using copy/paste. This will help you develop muscle memory and a stronger understanding."
+      ],
+      "metadata": {
+        "id": "Ldx0Z7nWGopE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "To begin, run the cell below to install and import Apache Beam."
+      ],
+      "metadata": {
+        "id": "jy1zaj4NDE0T"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "pNure-fW8hl3"
+      },
+      "outputs": [],
+      "source": [
+        "# Run a shell command and import beam\n",
+        "!pip install --quiet apache-beam\n",
+        "import apache_beam as beam\n",
+        "beam.__version__"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Set the logging level to reduce verbose information\n",
+        "import logging\n",
+        "\n",
+        "logging.root.setLevel(logging.ERROR)"
+      ],
+      "metadata": {
+        "id": "vyksB2VMtv3m"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "\n",
+        "---\n",
+        "\n",
+        "\n",
+        "\n",
+        "---\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "M1ku4nX_Gutb"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 1. Simple User-Defined Function (UDF)\n",
+        "\n",
+        "Some `PTransforms` allow you to run your own functions and user-defined code to specify how your transform is applied. For example, the below `CombineGlobally` transform,"

Review Comment:
   1. we could link add a context link CombineGlobally 
   2. unfinished sentence. 
    
   Consider: The below [`CombineGlobally`](https://beam.apache.org/documentation/transforms/python/aggregation/combineglobally/) transform uses a custom `bounded_sum` function to specify how the elements shall be aggregated:"



##########
examples/notebooks/get-started/learn_beam_transforms_by_doing.ipynb:
##########
@@ -0,0 +1,755 @@
+{
+  "cells": [
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "cellView": "form",
+        "id": "QgmD1wbmT4mj"
+      },
+      "outputs": [],
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "RuUHlGZjVt6W"
+      },
+      "source": [
+        "# Learn Beam PTransforms\n",
+        "\n",
+        "After this notebook, you should be able to:\n",
+        "1. Use user-defined functions in your `PTransforms`\n",
+        "2. Learn Beam SDK composite transforms\n",
+        "3. Create you own composite transforms to simplify your `Pipeline`\n",
+        "\n",
+        "For basic Beam `PTransforms`, please check out [this Notebook](https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/learn_beam_basics_by_doing.ipynb).\n",
+        "\n",
+        "Beam Python SDK also provides [a list of built-in transforms](https://beam.apache.org/documentation/transforms/python/overview/).\n"
+      ]
+    },
+    {
+      "attachments": {},
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Ldx0Z7nWGopE"
+      },
+      "source": [
+        "## How To Approach This Tutorial\n",
+        "\n",
+        "This tutorial is designed for someone who likes to learn by doing. There will be code cells where you can write your own code to test your understanding.\n",
+        "\n",
+        "As such, to get the most out of this tutorial, we strongly recommend typing code by hand as you’re working through the tutorial and not using copy/paste. This will help you develop muscle memory and a stronger understanding."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "jy1zaj4NDE0T"
+      },
+      "source": [
+        "To begin, run the cell below to install and import Apache Beam."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "pNure-fW8hl3"
+      },
+      "outputs": [],
+      "source": [
+        "# Run a shell command and import beam\n",
+        "!pip install --quiet apache-beam\n",
+        "import apache_beam as beam\n",
+        "beam.__version__"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "vyksB2VMtv3m"
+      },
+      "outputs": [],
+      "source": [
+        "# Set the logging level to reduce verbose information\n",
+        "import logging\n",
+        "\n",
+        "logging.root.setLevel(logging.ERROR)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "M1ku4nX_Gutb"
+      },
+      "source": [
+        "\n",
+        "\n",
+        "---\n",
+        "\n",
+        "\n",
+        "\n",
+        "---\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "0qDeT34SS1_8"
+      },
+      "source": [
+        "## 1. Simple User-Defined Function (UDF)\n",
+        "\n",
+        "Some `PTransforms` allow you to run your own functions and user-defined code to specify how your transform is applied. For example, the below `CombineGlobally` transform,"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "UZTWBGZ0TQWF"
+      },
+      "outputs": [],
+      "source": [
+        "pc = [1, 10, 100, 1000]\n",
+        "\n",
+        "# User-defined function\n",
+        "def bounded_sum(values, bound=500):\n",
+        "  return min(sum(values), bound)\n",
+        "\n",
+        "small_sum = pc | beam.CombineGlobally(bounded_sum)  # [500]\n",
+        "large_sum = pc | beam.CombineGlobally(bounded_sum, bound=5000)  # [1111]\n",
+        "\n",
+        "print(small_sum, large_sum)"
+      ]
+    },
+    {
+      "attachments": {},
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "UBFRcPO06xiV"
+      },
+      "source": [
+        "## 2. Transforms: ParDo and Combine\n",
+        "\n",
+        "A `ParDo` transform considers each element in the input `PCollection`, performs your user code to process each element, and emits zero, one, or multiple elements to an output `PCollection`. `Combine` is another Beam transform for combining collections of elements or values in your data.\n",
+        "Both allow flexible UDFs to define how you process the data."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "P4W-1HIiV-HP"
+      },
+      "source": [
+        "### 2.1 DoFn\n",
+        "\n",
+        "DoFn - a Beam Python class that defines a distributed processing function (used in [ParDo](https://beam.apache.org/documentation/programming-guide/#pardo))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "TjOzWnQd-dan"
+      },
+      "outputs": [],
+      "source": [
+        "data = [1, 2, 3, 4]\n",
+        "\n",
+        "# create a DoFn to multiply each element by five\n",
+        "# you can define the processing code under `process`\n",
+        "# which is required for a DoFn\n",
+        "class MultiplyByFive(beam.DoFn):\n",
+        "  def process(self, element):\n",
+        "    return [element*5]\n",
+        "\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  outputs = (\n",
+        "      pipeline\n",
+        "      | 'Create values' >> beam.Create(data)\n",
+        "      | 'Multiply by 5' >> beam.ParDo(MultiplyByFive())\n",
+        "  )\n",
+        "\n",
+        "  outputs | beam.Map(print)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "1qL2crwNXQXe"
+      },
+      "source": [
+        "### 2.2 CombineFn\n",
+        "\n",
+        "CombineFn - define associative and commutative aggregations (used in [Combine](https://beam.apache.org/documentation/programming-guide/#combine))"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "zxLatbOa9FyA"
+      },
+      "outputs": [],
+      "source": [
+        "data = [1, 2, 3, 4]\n",
+        "\n",
+        "# create a CombineFn to get the product of each element\n",
+        "# you need to provide four operations\n",
+        "class ProductFn(beam.CombineFn):\n",
+        "  def create_accumulator(self):\n",
+        "    # creates a new accumulator to store the initial value\n",
+        "    return 1\n",
+        "\n",
+        "  def add_input(self, current_prod, input):\n",
+        "    # adds an input element to an accumulator\n",
+        "    return current_prod*input\n",
+        "\n",
+        "  def merge_accumulators(self, accumulators):\n",
+        "    # merge several accumulators into a single accumulator\n",
+        "    prod = 1\n",
+        "    for accu in accumulators:\n",
+        "      prod *= accu\n",
+        "    return prod\n",
+        "\n",
+        "  def extract_output(self, prod):\n",
+        "    # performs the final computation\n",
+        "    return prod\n",
+        "\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  outputs = (\n",
+        "      pipeline\n",
+        "      | 'Create values' >> beam.Create(data)\n",
+        "      | 'Multiply by 2' >> beam.CombineGlobally(ProductFn())\n",
+        "  )\n",
+        "  outputs | beam.LogElements()\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "r1Vw1d5vJoIE"
+      },
+      "source": [
+        "Note: The above `DoFn` and `CombineFn` examples are for demonstration purposes. You could easily achieve the same functionality by using the simple function illustrated in section 1."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "bTer_URwS0wb"
+      },
+      "source": [
+        "\n",
+        "\n",
+        "---\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qPnSfU5wLTN5"
+      },
+      "source": [
+        "## 3. Composite Transforms\n",
+        "\n",
+        "Now that you've learned the basic `PTransforms`, Beam allows you to simplify the process of processing and transforming your data through [Composite Transforms](https://beam.apache.org/documentation/programming-guide/#composite-transforms).\n",
+        "\n",
+        "Composite transforms can nest multiple transforms into a single composite transform, making your code easier to understand."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "4tBsLkeatNUU"
+      },
+      "source": [
+        "To see an example of this, let's take a look at how we can improve the `Pipeline` we built to count each word in Shakespeare's *King Lear*.\n",
+        "\n",
+        "Below is that `Pipeline` we built in [WordCount tutorial](https://colab.research.google.com/drive/1_EncqFT_SmwXp7wlRqEf39m9efyrmm9p?usp=sharing):"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "Vokbrhhyto7H"
+      },
+      "outputs": [],
+      "source": [
+        "!mkdir -p data\n",
+        "!gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt data/"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "R-4uyn0Tttr2"
+      },
+      "outputs": [],
+      "source": [
+        "import re\n",
+        "\n",
+        "# Function used to run and display the result\n",
+        "def run(cmd):\n",
+        "  print('>> {}'.format(cmd))\n",
+        "  !{cmd}\n",
+        "  print('')\n",
+        "\n",
+        "inputs_pattern = 'data/*'\n",
+        "outputs_prefix = 'outputs/part'\n",
+        "\n",
+        "# Running locally in the DirectRunner.\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  word_count = (\n",
+        "      pipeline\n",
+        "        | 'Read lines' >> beam.io.ReadFromText(inputs_pattern)\n",
+        "        | 'Find words' >> beam.FlatMap(lambda line: re.findall(r\"[a-zA-Z']+\", line))\n",
+        "        | 'Pair words with 1' >> beam.Map(lambda word: (word, 1))\n",
+        "        | 'Group and sum' >> beam.CombinePerKey(sum)\n",
+        "        | 'Write results' >> beam.io.WriteToText(outputs_prefix)\n",
+        "  )\n",
+        "\n",
+        "# Sample the first 20 results, remember there are no ordering guarantees.\n",
+        "run('head -n 20 {}-00000-of-*'.format(outputs_prefix))"
+      ]
+    },
+    {
+      "attachments": {},
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "wl8wLwZZtbnX"
+      },
+      "source": [
+        "Although the code above is a viable way to design your `Pipeline`, you can see that we use multiple transforms to perform one process:\n",
+        "1. `FlatMap` is used to find words in each line\n",
+        "2. `Map` is used to create key-value pairs with each word where the value is 1\n",
+        "3. `CombinePerKey` is used so that we can then group by each word and count up the sums\n",
+        "\n",
+        "All of these `PTransforms`, in combination, are meant to count each word in *King Lear*. You can simplify the process and combine these three transforms into one by using composite transforms.\n",
+        "\n",
+        "There's two ways you can follow:\n",
+        "1. Using Beam SDK's built-in composite transforms\n",
+        "2. Creating your own composite transforms"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "nPG11AEkNMKK"
+      },
+      "source": [
+        "### 3.1 Beam SDK Composite Transforms\n",
+        "Beam makes it easy for you with its Beam SDK which comes with a package of many useful composite transforms. We will only cover one in this tutorial but to see a list of transforms you can use, see the following API reference page: [Pre-written Beam Transforms for Python](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.html).\n"

Review Comment:
   ```suggestion
           "Beam allows combining a sequence of transforms into a composite transform. Many of the Beam's handy pre-written transforms are composite transforms under the hood. We cover one  in this tutorial but to see other transforms you can use, check out the following API reference pages: [Beam Transforms Package](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.html), [Beam ML Package](https://beam.apache.org/releases/pydoc/current/apache_beam.ml) .\n"
   ```



##########
examples/notebooks/get-started/learn_beam_transforms_by_doing.ipynb:
##########
@@ -0,0 +1,740 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "cellView": "form",
+        "id": "QgmD1wbmT4mj"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Learn Beam PTransforms\n",
+        "\n",
+        "After this notebook, you should be able to:\n",
+        "1. Use user-defined functions in your `PTransforms`\n",
+        "2. Learn Beam SDK composite transforms\n",
+        "3. Create you own composite transforms to simplify your `Pipeline`\n",
+        "\n",
+        "For basic Beam `PTransforms`, please check out [this Notebook](https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/learn_beam_basics_by_doing.ipynb).\n",
+        "\n",
+        "Beam Python SDK also provides [a list of built-in transforms](https://beam.apache.org/documentation/transforms/python/overview/).\n"
+      ],
+      "metadata": {
+        "id": "RuUHlGZjVt6W"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## How To Approach This Tutorial\n",

Review Comment:
   Should we mention the complete beginner tutorial as a prerequisite for this one? Or say that this tutorial assumes some familiarity with the Beam model, see: x, y, z.



##########
examples/notebooks/get-started/learn_beam_transforms_by_doing.ipynb:
##########
@@ -0,0 +1,740 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [

Review Comment:
   let's add `open in colab` button.



##########
examples/notebooks/get-started/learn_beam_transforms_by_doing.ipynb:
##########
@@ -0,0 +1,740 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "cellView": "form",
+        "id": "QgmD1wbmT4mj"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Learn Beam PTransforms\n",
+        "\n",
+        "After this notebook, you should be able to:\n",
+        "1. Use user-defined functions in your `PTransforms`\n",
+        "2. Learn Beam SDK composite transforms\n",
+        "3. Create you own composite transforms to simplify your `Pipeline`\n",
+        "\n",
+        "For basic Beam `PTransforms`, please check out [this Notebook](https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/learn_beam_basics_by_doing.ipynb).\n",
+        "\n",
+        "Beam Python SDK also provides [a list of built-in transforms](https://beam.apache.org/documentation/transforms/python/overview/).\n"
+      ],
+      "metadata": {
+        "id": "RuUHlGZjVt6W"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## How To Approach This Tutorial\n",
+        "\n",
+        "This tutorial was designed for someone who likes to learn by doing. There will be code cells where you can write your own code to test your understanding.\n",
+        "\n",
+        "As such, to get the most out of this tutorial, we strongly recommend typing code by hand as you’re working through the tutorial and not using copy/paste. This will help you develop muscle memory and a stronger understanding."
+      ],
+      "metadata": {
+        "id": "Ldx0Z7nWGopE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "To begin, run the cell below to install and import Apache Beam."
+      ],
+      "metadata": {
+        "id": "jy1zaj4NDE0T"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "pNure-fW8hl3"
+      },
+      "outputs": [],
+      "source": [
+        "# Run a shell command and import beam\n",
+        "!pip install --quiet apache-beam\n",
+        "import apache_beam as beam\n",
+        "beam.__version__"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Set the logging level to reduce verbose information\n",
+        "import logging\n",
+        "\n",
+        "logging.root.setLevel(logging.ERROR)"
+      ],
+      "metadata": {
+        "id": "vyksB2VMtv3m"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "\n",
+        "---\n",
+        "\n",
+        "\n",
+        "\n",
+        "---\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "M1ku4nX_Gutb"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 1. Simple User-Defined Function (UDF)\n",
+        "\n",
+        "Some `PTransforms` allow you to run your own functions and user-defined code to specify how your transform is applied. For example, the below `CombineGlobally` transform,"
+      ],
+      "metadata": {
+        "id": "0qDeT34SS1_8"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "pc = [1, 10, 100, 1000]\n",
+        "\n",
+        "# User-defined function\n",
+        "def bounded_sum(values, bound=500):\n",
+        "  return min(sum(values), bound)\n",
+        "\n",
+        "small_sum = pc | beam.CombineGlobally(bounded_sum)  # [500]\n",
+        "large_sum = pc | beam.CombineGlobally(bounded_sum, bound=5000)  # [1111]\n",
+        "\n",
+        "print(small_sum, large_sum)"
+      ],
+      "metadata": {
+        "id": "UZTWBGZ0TQWF"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 2. Transforms: ParDo and Combine\n",
+        "\n",
+        "A `ParDo` transform considers each element in the input `PCollection`, performs your user code to process each element, and emits zero, one, or multiple elements to an output `PCollection`. `Combine` is another Beam transform for combining collections of elements or values in your data.\n",
+        "Both allow flexible UDF to define how you process the data."
+      ],
+      "metadata": {
+        "id": "UBFRcPO06xiV"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### 2.1 DoFn\n",
+        "\n",
+        "DoFn - a Beam Python class that defines a distributed processing function (used in [ParDo](https://beam.apache.org/documentation/programming-guide/#pardo))"
+      ],
+      "metadata": {
+        "id": "P4W-1HIiV-HP"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "data = [1, 2, 3, 4]\n",
+        "\n",
+        "# create a DoFn to multiply each element by five\n",
+        "# you can define the procesing code under `process`\n",
+        "class MultiplyByFive(beam.DoFn):\n",
+        "  def process(self, element):\n",
+        "    return [element*5]\n",

Review Comment:
    `yield element*5`



##########
examples/notebooks/get-started/learn_beam_transforms_by_doing.ipynb:
##########
@@ -0,0 +1,740 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "code",
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "metadata": {
+        "cellView": "form",
+        "id": "QgmD1wbmT4mj"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Learn Beam PTransforms\n",
+        "\n",
+        "After this notebook, you should be able to:\n",
+        "1. Use user-defined functions in your `PTransforms`\n",
+        "2. Learn Beam SDK composite transforms\n",
+        "3. Create you own composite transforms to simplify your `Pipeline`\n",
+        "\n",
+        "For basic Beam `PTransforms`, please check out [this Notebook](https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/learn_beam_basics_by_doing.ipynb).\n",
+        "\n",
+        "Beam Python SDK also provides [a list of built-in transforms](https://beam.apache.org/documentation/transforms/python/overview/).\n"
+      ],
+      "metadata": {
+        "id": "RuUHlGZjVt6W"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## How To Approach This Tutorial\n",
+        "\n",
+        "This tutorial was designed for someone who likes to learn by doing. There will be code cells where you can write your own code to test your understanding.\n",
+        "\n",
+        "As such, to get the most out of this tutorial, we strongly recommend typing code by hand as you’re working through the tutorial and not using copy/paste. This will help you develop muscle memory and a stronger understanding."
+      ],
+      "metadata": {
+        "id": "Ldx0Z7nWGopE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "To begin, run the cell below to install and import Apache Beam."
+      ],
+      "metadata": {
+        "id": "jy1zaj4NDE0T"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "pNure-fW8hl3"
+      },
+      "outputs": [],
+      "source": [
+        "# Run a shell command and import beam\n",
+        "!pip install --quiet apache-beam\n",
+        "import apache_beam as beam\n",
+        "beam.__version__"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Set the logging level to reduce verbose information\n",
+        "import logging\n",
+        "\n",
+        "logging.root.setLevel(logging.ERROR)"
+      ],
+      "metadata": {
+        "id": "vyksB2VMtv3m"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "\n",
+        "---\n",
+        "\n",
+        "\n",
+        "\n",
+        "---\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "M1ku4nX_Gutb"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 1. Simple User-Defined Function (UDF)\n",
+        "\n",
+        "Some `PTransforms` allow you to run your own functions and user-defined code to specify how your transform is applied. For example, the below `CombineGlobally` transform,"
+      ],
+      "metadata": {
+        "id": "0qDeT34SS1_8"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "pc = [1, 10, 100, 1000]\n",
+        "\n",
+        "# User-defined function\n",
+        "def bounded_sum(values, bound=500):\n",
+        "  return min(sum(values), bound)\n",
+        "\n",
+        "small_sum = pc | beam.CombineGlobally(bounded_sum)  # [500]\n",
+        "large_sum = pc | beam.CombineGlobally(bounded_sum, bound=5000)  # [1111]\n",
+        "\n",
+        "print(small_sum, large_sum)"
+      ],
+      "metadata": {
+        "id": "UZTWBGZ0TQWF"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 2. Transforms: ParDo and Combine\n",
+        "\n",
+        "A `ParDo` transform considers each element in the input `PCollection`, performs your user code to process each element, and emits zero, one, or multiple elements to an output `PCollection`. `Combine` is another Beam transform for combining collections of elements or values in your data.\n",
+        "Both allow flexible UDF to define how you process the data."
+      ],
+      "metadata": {
+        "id": "UBFRcPO06xiV"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### 2.1 DoFn\n",
+        "\n",
+        "DoFn - a Beam Python class that defines a distributed processing function (used in [ParDo](https://beam.apache.org/documentation/programming-guide/#pardo))"
+      ],
+      "metadata": {
+        "id": "P4W-1HIiV-HP"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "data = [1, 2, 3, 4]\n",
+        "\n",
+        "# create a DoFn to multiply each element by five\n",
+        "# you can define the procesing code under `process`\n",
+        "class MultiplyByFive(beam.DoFn):\n",
+        "  def process(self, element):\n",
+        "    return [element*5]\n",
+        "\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  outputs = (\n",
+        "      pipeline\n",
+        "      | 'Create values' >> beam.Create(data)\n",
+        "      | 'Multiply by 5' >> beam.ParDo(MultiplyByFive())\n",
+        "  )\n",
+        "\n",
+        "  outputs | beam.Map(print)"
+      ],
+      "metadata": {
+        "id": "TjOzWnQd-dan"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### 2.2 CombineFn\n",
+        "\n",
+        "CombineFn - define associative and commutative aggregations (used in [Combine](https://beam.apache.org/documentation/programming-guide/#combine))"
+      ],
+      "metadata": {
+        "id": "1qL2crwNXQXe"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "data = [1, 2, 3, 4]\n",
+        "\n",
+        "# create a CombineFn to get the product of each element\n",
+        "# you need to provide four opeations\n",
+        "class ProductFn(beam.CombineFn):\n",
+        "  def create_accumulator(self):\n",
+        "    # creates a new accumulator to store the initial value\n",
+        "    return 1\n",
+        "\n",
+        "  def add_input(self, current_prod, input):\n",
+        "    # adds an input element to an accumulator\n",
+        "    return current_prod*input\n",
+        "\n",
+        "  def merge_accumulators(self, accumulators):\n",
+        "    # merge several accumulators into a single accumulator\n",
+        "    prod = 1\n",
+        "    for accu in accumulators:\n",
+        "      prod *= accu\n",
+        "    return prod\n",
+        "\n",
+        "  def extract_output(self, prod):\n",
+        "    # performs the final computation\n",
+        "    return prod\n",
+        "\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  outputs = (\n",
+        "      pipeline\n",
+        "      | 'Create values' >> beam.Create(data)\n",
+        "      | 'Multiply by 2' >> beam.CombineGlobally(ProductFn())\n",
+        "  )\n",
+        "  outputs | beam.LogElements()\n"
+      ],
+      "metadata": {
+        "id": "zxLatbOa9FyA"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Note: The above `DoFn` and `CombineFn` examples are for demonstration purposes. You could easily achieve the same functionality by using the simple function illustrated in section 1."
+      ],
+      "metadata": {
+        "id": "r1Vw1d5vJoIE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "\n",
+        "---\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "bTer_URwS0wb"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 3. Composite Transforms\n",
+        "\n",
+        "Now that you've learned the basic `PTransforms`, Beam allows you to simplify the process of processing and transforming your data through [Composite Transforms](https://beam.apache.org/documentation/programming-guide/#composite-transforms).\n",
+        "\n",
+        "Composite transforms can nest multiple transforms into a single composite transform, making your code easier to understand."
+      ],
+      "metadata": {
+        "id": "qPnSfU5wLTN5"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "To see an example of this, let's take a look at how we can improve the `Pipeline` we built to count each word in Shakespeare's *King Lear*.\n",
+        "\n",
+        "Below is that `Pipeline` we built in [WordCount tutorial](https://colab.research.google.com/drive/1_EncqFT_SmwXp7wlRqEf39m9efyrmm9p?usp=sharing):"

Review Comment:
   We need to clone this notebook first and add it to the repo. Looks like it might be unfinished content.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org