You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/03/03 03:42:27 UTC

[GitHub] [beam] rosetn commented on a change in pull request #14045: [BEAM-10937] Add reading and writing data notebook

rosetn commented on a change in pull request #14045:
URL: https://github.com/apache/beam/pull/14045#discussion_r586082067



##########
File path: examples/notebooks/tour-of-beam/reading-and-writing-data.ipynb
##########
@@ -0,0 +1,939 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "Reading and writing data -- Tour of Beam",

Review comment:
       General note about this notebook: consider using text formatting more sparingly. 
   
   * If you're using italics to define terms, don't italicize words that you're not trying to define or explain. I'm actually a little confused about why some terms are italicized; it could be possible that you're trying to make a new user aware of important terms. However, pulling these terms out in a bulleted list to highlight them or naming your headers in a way to direct attention to the term might be a better choice.
   * On a similar note with bolding terms, I'm not sure why Source and Sink are bolded. 
   * Here are some general guidelines: https://developers.google.com/style/text-formatting. We don't need to follow them strictly, but we should be deliberate in our choices.

##########
File path: examples/notebooks/tour-of-beam/reading-and-writing-data.ipynb
##########
@@ -0,0 +1,939 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "Reading and writing data -- Tour of Beam",
+      "provenance": [],
+      "collapsed_sections": [],
+      "toc_visible": true
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/tour-of-beam/reading-and-writing-data.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "cellView": "form",
+        "id": "upmJn_DjcThx"
+      },
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "execution_count": 95,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5UC_aGanx6oE"
+      },
+      "source": [
+        "# Reading and writing data -- _Tour of Beam_\n",
+        "\n",
+        "So far we've learned some of the basic transforms like\n",
+        "[`Map`](https://beam.apache.org/documentation/transforms/python/elementwise/map) _(one-to-one)_,\n",
+        "[`FlatMap`](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap) _(one-to-many)_,\n",
+        "[`Filter`](https://beam.apache.org/documentation/transforms/python/elementwise/filter) _(one-to-zero)_,\n",
+        "[`Combine`](https://beam.apache.org/documentation/transforms/python/aggregation/combineglobally) _(many-to-one)_, and\n",
+        "[`GroupByKey`](https://beam.apache.org/documentation/transforms/python/aggregation/groupbykey).\n",
+        "These allow us to transform data in any way, but so far we've created data from an in-memory\n",
+        "[`iterable`](https://docs.python.org/3/glossary.html#term-iterable), like a `List`, using\n",
+        "[`Create`](https://beam.apache.org/documentation/transforms/python/other/create).\n",
+        "\n",
+        "This works well for experimenting with small datasets. For larger datasets we use a **`Source`** transform to read data and a **`Sink`** transform to write data.\n",
+        "\n",
+        "Let's create some data files and see how we can read them in Beam."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "R_Yhoc6N_Flg"
+      },
+      "source": [
+        "# Install apache-beam with pip.\n",
+        "!pip install --quiet apache-beam\n",
+        "\n",
+        "# Create a directory for our data files.\n",
+        "!mkdir -p data"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "sQUUi4H9s-g2"
+      },
+      "source": [
+        "%%writefile data/my-text-file-1.txt\n",
+        "This is just a plain text file, UTF-8 strings are allowed 🎉.\n",
+        "Each line in the file is one element in the PCollection."
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "BWVVeTSOlKug"
+      },
+      "source": [
+        "%%writefile data/my-text-file-2.txt\n",
+        "There are no guarantees on the order of the elements.\n",
+        "ฅ^•ﻌ•^ฅ"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "NhCws6ncbDJG"
+      },
+      "source": [
+        "%%writefile data/penguins.csv\n",
+        "species,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g\n",
+        "0,0.2545454545454545,0.6666666666666666,0.15254237288135594,0.2916666666666667\n",
+        "0,0.26909090909090905,0.5119047619047618,0.23728813559322035,0.3055555555555556\n",
+        "1,0.5236363636363636,0.5714285714285713,0.3389830508474576,0.2222222222222222\n",
+        "1,0.6509090909090909,0.7619047619047619,0.4067796610169492,0.3333333333333333\n",
+        "2,0.509090909090909,0.011904761904761862,0.6610169491525424,0.5\n",
+        "2,0.6509090909090909,0.38095238095238104,0.9830508474576272,0.8333333333333334"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "_OkWHiAvpWDZ"
+      },
+      "source": [
+        "# Reading from text files\n",
+        "\n",
+        "We can use the\n",
+        "[`ReadFromText`](https://beam.apache.org/releases/pydoc/current/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromText)\n",
+        "transform to read text files into `str` elements.\n",
+        "\n",
+        "It takes a\n",
+        "[_glob pattern_](https://en.wikipedia.org/wiki/Glob_%28programming%29)\n",
+        "as an input, and reads all the files that match that pattern.\n",
+        "It returns one element for each line in the file.\n",
+        "\n",
+        "For example, in the pattern `data/*.txt`, the `*` is a wildcard that matches anything. This pattern matches all the files in the `data/` directory with a `.txt` extension."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "xDXdE9uysriw",
+        "outputId": "f5d58b5d-892a-4a42-89c5-b78f1d329cf3"
+      },
+      "source": [
+        "import apache_beam as beam\n",
+        "\n",
+        "input_files = 'data/*.txt'\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  (\n",
+        "      pipeline\n",
+        "      | 'Read files' >> beam.io.ReadFromText(input_files)\n",
+        "      | 'Print contents' >> beam.Map(print)\n",
+        "  )"
+      ],
+      "execution_count": 96,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "There are no guarantees on the order of the elements.\n",
+            "ฅ^•ﻌ•^ฅ\n",
+            "This is just a plain text file, UTF-8 strings are allowed 🎉.\n",
+            "Each line in the file is one element in the PCollection.\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "9-2wmzEWsdrb"
+      },
+      "source": [
+        "# Writing to text files\n",
+        "\n",
+        "We can use the\n",
+        "[`WriteToText`](https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.io.textio.html#apache_beam.io.textio.WriteToText) transform to write `str` elements into text files.\n",
+        "\n",
+        "It takes a _file path prefix_ as an input, and it writes the all `str` elements into one or more files with filenames starting with that prefix. You can optionally pass a `file_name_suffix` as well, usually used for the file extension. Each element goes into its own line in the output files."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "nkPlfoTfz61I"
+      },
+      "source": [
+        "import apache_beam as beam\n",
+        "\n",
+        "output_file_name_prefix = 'outputs/file'\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  (\n",
+        "      pipeline\n",
+        "      | 'Create file lines' >> beam.Create([\n",
+        "          'Each element must be a string.',\n",
+        "          'It writes one element per line.',\n",
+        "          'There are no guarantees on the line order.',\n",
+        "          'The data might be written into multiple files.',\n",
+        "      ])\n",
+        "      | 'Write to files' >> beam.io.WriteToText(\n",
+        "          output_file_name_prefix,\n",
+        "          file_name_suffix='.txt')\n",
+        "  )"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "8au0yJSd1itt",
+        "outputId": "d7e72785-9fa8-4a2b-c6d0-4735aac8e206"
+      },
+      "source": [
+        "# Lets look at the output files and contents.\n",
+        "!head outputs/file*.txt"
+      ],
+      "execution_count": 98,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Each element must be a string.\n",
+            "It writes one element per line.\n",
+            "There are no guarantees on the line order.\n",
+            "The data might be written into multiple files.\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "21CCdZispqYK"
+      },
+      "source": [
+        "# Reading data\n",
+        "\n",
+        "Your data might reside in various input formats. Take a look at the\n",
+        "[Built-in I/O Transforms](https://beam.apache.org/documentation/io/built-in)\n",
+        "page for a list of all the available I/O transforms in Beam.\n",
+        "\n",
+        "If none of those work for you, you might need to create your own input transform.\n",
+        "\n",
+        "> ℹ️ For a more in-depth guide, take a look at the\n",
+        "[Developing a new I/O connector](https://beam.apache.org/documentation/io/developing-io-overview) page."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "7dQEym1QRG4y"
+      },
+      "source": [
+        "## Reading from an `iterable`\n",
+        "\n",
+        "The easiest way to create elements is using\n",
+        "[`FlatMap`](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap).\n",
+        "\n",
+        "A common way is having a [`generator`](https://docs.python.org/3/glossary.html#term-generator) function. This could take an input and _expand_ it into a large amount of elements. The nice thing about `generator`s is that they don't have to fit everything into memory like a `list`, they simply\n",
+        "[`yield`](https://docs.python.org/3/reference/simple_stmts.html#yield)\n",
+        "elements as they process them.\n",
+        "\n",
+        "For example, let's define a `generator` called `count`, that `yield`s the numbers from `0` to `n`. We use `Create` for the initial `n` value(s) and then exapand them with `FlatMap`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "wR6WY6wOMVhb",
+        "outputId": "232e9fb3-4054-4eaf-9bd0-1adc4435b220"
+      },
+      "source": [
+        "import apache_beam as beam\n",
+        "\n",
+        "def count(n):\n",
+        "  for i in range(n):\n",
+        "    yield i\n",
+        "\n",
+        "n = 5\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  (\n",
+        "      pipeline\n",
+        "      | 'Create inputs' >> beam.Create([n])\n",
+        "      | 'Generate elements' >> beam.FlatMap(count)\n",
+        "      | 'Print elements' >> beam.Map(print)\n",
+        "  )"
+      ],
+      "execution_count": 8,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "0\n",
+            "1\n",
+            "2\n",
+            "3\n",
+            "4\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "G4fw7NE1RQNf"
+      },
+      "source": [
+        "## Creating an input transform\n",
+        "\n",
+        "For a nicer interface, we could abstract the `Create` and the `FlatMap` into a custom `PTransform`. This would give a more intuitive way to use it, while hiding the inner workings.\n",
+        "\n",
+        "We create a new class that inherits from `beam.PTransform`. Any input from the generator function, like `n`, becomes a class field. The generator function itself would now become a\n",
+        "[`staticmethod`](https://docs.python.org/3/library/functions.html#staticmethod).\n",
+        "And we can hide the `Create` and `FlatMap` in the `expand` method.\n",
+        "\n",
+        "Now we can use our transform in a more intuitive way, just like `ReadFromText`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "m8iXqE1CRnn5",
+        "outputId": "019f3b32-74c5-4860-edee-1c8553f200bb"
+      },
+      "source": [
+        "import apache_beam as beam\n",
+        "\n",
+        "class Count(beam.PTransform):\n",
+        "  def __init__(self, n):\n",
+        "    self.n = n\n",
+        "\n",
+        "  @staticmethod\n",
+        "  def count(n):\n",
+        "    for i in range(n):\n",
+        "      yield i\n",
+        "\n",
+        "  def expand(self, pcollection):\n",
+        "    return (\n",
+        "        pcollection\n",
+        "        | 'Create inputs' >> beam.Create([self.n])\n",
+        "        | 'Generate elements' >> beam.FlatMap(Count.count)\n",
+        "    )\n",
+        "\n",
+        "n = 5\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  (\n",
+        "      pipeline\n",
+        "      | f'Count to {n}' >> Count(n)\n",
+        "      | 'Print elements' >> beam.Map(print)\n",
+        "  )"
+      ],
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "0\n",
+            "1\n",
+            "2\n",
+            "3\n",
+            "4\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "e02_vFmUg-mK"
+      },
+      "source": [
+        "## Example: Reading CSV files\n",
+        "\n",
+        "Lets say we want to read CSV files get elements as `dict`s. We like how `ReadFromText` expands a file pattern, but we might want to allow for multiple patterns as well.\n",

Review comment:
       I think you're missing a "to" in this sentence

##########
File path: website/www/site/content/en/get-started/tour-of-beam.md
##########
@@ -30,9 +30,18 @@ You can also [try an Apache Beam pipeline](/get-started/try-apache-beam) using t
 ### Learn the basics
 
 In this notebook we go through the basics of what is Apache Beam and how to get started.
+We learn what is a data _pipeline_, a _PCollection_, a _PTransform_, as well as some basic transforms like `Map`, `FlatMap`, `Filter`, `Combine`, and `GroupByKey`.
 
 {{< button-colab url="https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/tour-of-beam/getting-started.ipynb" >}}
 
+### Reading and writing data
+
+Here we go through some examples on how to read and write data to and from different data formats.

Review comment:
       I'd replace "here" with "In this notebook"

##########
File path: website/www/site/content/en/get-started/tour-of-beam.md
##########
@@ -30,9 +30,18 @@ You can also [try an Apache Beam pipeline](/get-started/try-apache-beam) using t
 ### Learn the basics
 
 In this notebook we go through the basics of what is Apache Beam and how to get started.
+We learn what is a data _pipeline_, a _PCollection_, a _PTransform_, as well as some basic transforms like `Map`, `FlatMap`, `Filter`, `Combine`, and `GroupByKey`.

Review comment:
       We learn about data pipelines, PCollections, and PTransforms, as well as some basic transforms like `Map`, `FlatMap`, `Filter`, `Combine`, and `GroupByKey`.
   
   Or keep them in italics.

##########
File path: examples/notebooks/tour-of-beam/reading-and-writing-data.ipynb
##########
@@ -0,0 +1,939 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "Reading and writing data -- Tour of Beam",
+      "provenance": [],
+      "collapsed_sections": [],
+      "toc_visible": true
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/tour-of-beam/reading-and-writing-data.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "cellView": "form",
+        "id": "upmJn_DjcThx"
+      },
+      "source": [
+        "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+        "\n",
+        "# Licensed to the Apache Software Foundation (ASF) under one\n",
+        "# or more contributor license agreements. See the NOTICE file\n",
+        "# distributed with this work for additional information\n",
+        "# regarding copyright ownership. The ASF licenses this file\n",
+        "# to you under the Apache License, Version 2.0 (the\n",
+        "# \"License\"); you may not use this file except in compliance\n",
+        "# with the License. You may obtain a copy of the License at\n",
+        "#\n",
+        "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+        "#\n",
+        "# Unless required by applicable law or agreed to in writing,\n",
+        "# software distributed under the License is distributed on an\n",
+        "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+        "# KIND, either express or implied. See the License for the\n",
+        "# specific language governing permissions and limitations\n",
+        "# under the License."
+      ],
+      "execution_count": 95,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5UC_aGanx6oE"
+      },
+      "source": [
+        "# Reading and writing data -- _Tour of Beam_\n",
+        "\n",
+        "So far we've learned some of the basic transforms like\n",
+        "[`Map`](https://beam.apache.org/documentation/transforms/python/elementwise/map) _(one-to-one)_,\n",
+        "[`FlatMap`](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap) _(one-to-many)_,\n",
+        "[`Filter`](https://beam.apache.org/documentation/transforms/python/elementwise/filter) _(one-to-zero)_,\n",
+        "[`Combine`](https://beam.apache.org/documentation/transforms/python/aggregation/combineglobally) _(many-to-one)_, and\n",
+        "[`GroupByKey`](https://beam.apache.org/documentation/transforms/python/aggregation/groupbykey).\n",
+        "These allow us to transform data in any way, but so far we've created data from an in-memory\n",
+        "[`iterable`](https://docs.python.org/3/glossary.html#term-iterable), like a `List`, using\n",
+        "[`Create`](https://beam.apache.org/documentation/transforms/python/other/create).\n",
+        "\n",
+        "This works well for experimenting with small datasets. For larger datasets we use a **`Source`** transform to read data and a **`Sink`** transform to write data.\n",
+        "\n",
+        "Let's create some data files and see how we can read them in Beam."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "R_Yhoc6N_Flg"
+      },
+      "source": [
+        "# Install apache-beam with pip.\n",
+        "!pip install --quiet apache-beam\n",
+        "\n",
+        "# Create a directory for our data files.\n",
+        "!mkdir -p data"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "sQUUi4H9s-g2"
+      },
+      "source": [
+        "%%writefile data/my-text-file-1.txt\n",
+        "This is just a plain text file, UTF-8 strings are allowed 🎉.\n",
+        "Each line in the file is one element in the PCollection."
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "BWVVeTSOlKug"
+      },
+      "source": [
+        "%%writefile data/my-text-file-2.txt\n",
+        "There are no guarantees on the order of the elements.\n",
+        "ฅ^•ﻌ•^ฅ"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "NhCws6ncbDJG"
+      },
+      "source": [
+        "%%writefile data/penguins.csv\n",
+        "species,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g\n",
+        "0,0.2545454545454545,0.6666666666666666,0.15254237288135594,0.2916666666666667\n",
+        "0,0.26909090909090905,0.5119047619047618,0.23728813559322035,0.3055555555555556\n",
+        "1,0.5236363636363636,0.5714285714285713,0.3389830508474576,0.2222222222222222\n",
+        "1,0.6509090909090909,0.7619047619047619,0.4067796610169492,0.3333333333333333\n",
+        "2,0.509090909090909,0.011904761904761862,0.6610169491525424,0.5\n",
+        "2,0.6509090909090909,0.38095238095238104,0.9830508474576272,0.8333333333333334"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "_OkWHiAvpWDZ"
+      },
+      "source": [
+        "# Reading from text files\n",
+        "\n",
+        "We can use the\n",
+        "[`ReadFromText`](https://beam.apache.org/releases/pydoc/current/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromText)\n",
+        "transform to read text files into `str` elements.\n",
+        "\n",
+        "It takes a\n",
+        "[_glob pattern_](https://en.wikipedia.org/wiki/Glob_%28programming%29)\n",
+        "as an input, and reads all the files that match that pattern.\n",
+        "It returns one element for each line in the file.\n",
+        "\n",
+        "For example, in the pattern `data/*.txt`, the `*` is a wildcard that matches anything. This pattern matches all the files in the `data/` directory with a `.txt` extension."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "xDXdE9uysriw",
+        "outputId": "f5d58b5d-892a-4a42-89c5-b78f1d329cf3"
+      },
+      "source": [
+        "import apache_beam as beam\n",
+        "\n",
+        "input_files = 'data/*.txt'\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  (\n",
+        "      pipeline\n",
+        "      | 'Read files' >> beam.io.ReadFromText(input_files)\n",
+        "      | 'Print contents' >> beam.Map(print)\n",
+        "  )"
+      ],
+      "execution_count": 96,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "There are no guarantees on the order of the elements.\n",
+            "ฅ^•ﻌ•^ฅ\n",
+            "This is just a plain text file, UTF-8 strings are allowed 🎉.\n",
+            "Each line in the file is one element in the PCollection.\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "9-2wmzEWsdrb"
+      },
+      "source": [
+        "# Writing to text files\n",
+        "\n",
+        "We can use the\n",
+        "[`WriteToText`](https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.io.textio.html#apache_beam.io.textio.WriteToText) transform to write `str` elements into text files.\n",
+        "\n",
+        "It takes a _file path prefix_ as an input, and it writes the all `str` elements into one or more files with filenames starting with that prefix. You can optionally pass a `file_name_suffix` as well, usually used for the file extension. Each element goes into its own line in the output files."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "nkPlfoTfz61I"
+      },
+      "source": [
+        "import apache_beam as beam\n",
+        "\n",
+        "output_file_name_prefix = 'outputs/file'\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  (\n",
+        "      pipeline\n",
+        "      | 'Create file lines' >> beam.Create([\n",
+        "          'Each element must be a string.',\n",
+        "          'It writes one element per line.',\n",
+        "          'There are no guarantees on the line order.',\n",
+        "          'The data might be written into multiple files.',\n",
+        "      ])\n",
+        "      | 'Write to files' >> beam.io.WriteToText(\n",
+        "          output_file_name_prefix,\n",
+        "          file_name_suffix='.txt')\n",
+        "  )"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "8au0yJSd1itt",
+        "outputId": "d7e72785-9fa8-4a2b-c6d0-4735aac8e206"
+      },
+      "source": [
+        "# Lets look at the output files and contents.\n",
+        "!head outputs/file*.txt"
+      ],
+      "execution_count": 98,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "Each element must be a string.\n",
+            "It writes one element per line.\n",
+            "There are no guarantees on the line order.\n",
+            "The data might be written into multiple files.\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "21CCdZispqYK"
+      },
+      "source": [
+        "# Reading data\n",
+        "\n",
+        "Your data might reside in various input formats. Take a look at the\n",
+        "[Built-in I/O Transforms](https://beam.apache.org/documentation/io/built-in)\n",
+        "page for a list of all the available I/O transforms in Beam.\n",
+        "\n",
+        "If none of those work for you, you might need to create your own input transform.\n",
+        "\n",
+        "> ℹ️ For a more in-depth guide, take a look at the\n",
+        "[Developing a new I/O connector](https://beam.apache.org/documentation/io/developing-io-overview) page."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "7dQEym1QRG4y"
+      },
+      "source": [
+        "## Reading from an `iterable`\n",
+        "\n",
+        "The easiest way to create elements is using\n",
+        "[`FlatMap`](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap).\n",
+        "\n",
+        "A common way is having a [`generator`](https://docs.python.org/3/glossary.html#term-generator) function. This could take an input and _expand_ it into a large amount of elements. The nice thing about `generator`s is that they don't have to fit everything into memory like a `list`, they simply\n",
+        "[`yield`](https://docs.python.org/3/reference/simple_stmts.html#yield)\n",
+        "elements as they process them.\n",
+        "\n",
+        "For example, let's define a `generator` called `count`, that `yield`s the numbers from `0` to `n`. We use `Create` for the initial `n` value(s) and then exapand them with `FlatMap`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "wR6WY6wOMVhb",
+        "outputId": "232e9fb3-4054-4eaf-9bd0-1adc4435b220"
+      },
+      "source": [
+        "import apache_beam as beam\n",
+        "\n",
+        "def count(n):\n",
+        "  for i in range(n):\n",
+        "    yield i\n",
+        "\n",
+        "n = 5\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  (\n",
+        "      pipeline\n",
+        "      | 'Create inputs' >> beam.Create([n])\n",
+        "      | 'Generate elements' >> beam.FlatMap(count)\n",
+        "      | 'Print elements' >> beam.Map(print)\n",
+        "  )"
+      ],
+      "execution_count": 8,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "0\n",
+            "1\n",
+            "2\n",
+            "3\n",
+            "4\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "G4fw7NE1RQNf"
+      },
+      "source": [
+        "## Creating an input transform\n",
+        "\n",
+        "For a nicer interface, we could abstract the `Create` and the `FlatMap` into a custom `PTransform`. This would give a more intuitive way to use it, while hiding the inner workings.\n",
+        "\n",
+        "We create a new class that inherits from `beam.PTransform`. Any input from the generator function, like `n`, becomes a class field. The generator function itself would now become a\n",
+        "[`staticmethod`](https://docs.python.org/3/library/functions.html#staticmethod).\n",
+        "And we can hide the `Create` and `FlatMap` in the `expand` method.\n",
+        "\n",
+        "Now we can use our transform in a more intuitive way, just like `ReadFromText`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "m8iXqE1CRnn5",
+        "outputId": "019f3b32-74c5-4860-edee-1c8553f200bb"
+      },
+      "source": [
+        "import apache_beam as beam\n",
+        "\n",
+        "class Count(beam.PTransform):\n",
+        "  def __init__(self, n):\n",
+        "    self.n = n\n",
+        "\n",
+        "  @staticmethod\n",
+        "  def count(n):\n",
+        "    for i in range(n):\n",
+        "      yield i\n",
+        "\n",
+        "  def expand(self, pcollection):\n",
+        "    return (\n",
+        "        pcollection\n",
+        "        | 'Create inputs' >> beam.Create([self.n])\n",
+        "        | 'Generate elements' >> beam.FlatMap(Count.count)\n",
+        "    )\n",
+        "\n",
+        "n = 5\n",
+        "with beam.Pipeline() as pipeline:\n",
+        "  (\n",
+        "      pipeline\n",
+        "      | f'Count to {n}' >> Count(n)\n",
+        "      | 'Print elements' >> beam.Map(print)\n",
+        "  )"
+      ],
+      "execution_count": 9,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "text": [
+            "0\n",
+            "1\n",
+            "2\n",
+            "3\n",
+            "4\n"
+          ],
+          "name": "stdout"
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "e02_vFmUg-mK"
+      },
+      "source": [
+        "## Example: Reading CSV files\n",
+        "\n",
+        "Lets say we want to read CSV files get elements as `dict`s. We like how `ReadFromText` expands a file pattern, but we might want to allow for multiple patterns as well.\n",

Review comment:
       Replace `dict`s with "dictionary objects", "`dict` objects", or "Python dictionaries"




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org