You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "bzablocki (via GitHub)" <gi...@apache.org> on 2023/08/04 15:31:13 UTC
[GitHub] [beam] bzablocki commented on a diff in pull request #27284: Yaml API: Day Zero tutorial notebook

bzablocki commented on code in PR #27284:
URL: https://github.com/apache/beam/pull/27284#discussion_r1284568439


##########
examples/notebooks/get-started/try-apache-beam-yaml.ipynb:
##########
@@ -0,0 +1,556 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+  "colab": {
+   "name": "Try Apache Beam - Python",
+   "version": "0.3.2",
+   "provenance": [],
+   "collapsed_sections": [],
+   "toc_visible": true,
+   "include_colab_link": true
+  },
+  "kernelspec": {
+   "name": "python2",
+   "display_name": "Python 2"
+  }
+ },
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "view-in-github",
+    "colab_type": "text"
+   },
+   "source": [
+    "<a href=\"https://colab.research.google.com/github/apache/beam/blob/master/examples/notebooks/get-started/try-apache-beam-yaml.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "source": [
+    "#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n",
+    "\n",
+    "# Licensed to the Apache Software Foundation (ASF) under one\n",
+    "# or more contributor license agreements. See the NOTICE file\n",
+    "# distributed with this work for additional information\n",
+    "# regarding copyright ownership. The ASF licenses this file\n",
+    "# to you under the Apache License, Version 2.0 (the\n",
+    "# \"License\"); you may not use this file except in compliance\n",
+    "# with the License. You may obtain a copy of the License at\n",
+    "#\n",
+    "#   http://www.apache.org/licenses/LICENSE-2.0\n",
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing,\n",
+    "# software distributed under the License is distributed on an\n",
+    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "# KIND, either express or implied. See the License for the\n",
+    "# specific language governing permissions and limitations\n",
+    "# under the License."
+   ],
+   "outputs": [],
+   "metadata": {
+    "cellView": "form"
+   }
+  },
+  {
+   "metadata": {
+    "id": "lNKIMlEDZ_Vw",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Try Apache Beam - YAML\n",
+    "\n",
+    "While Beam provides powerful APIs for authoring sophisticated data processing pipelines, it still has a high barrier for getting started and authoring simple pipelines. Even setting up the environment, installing the dependencies, and setting up the project can be an overwhelming amount of boilerplate.\n",
+    "\n",
+    "Here we provide a simple YAML syntax for describing pipelines that does not require coding experience or learning how to use an SDK&mdash;any text editor will do.\n",
+    "\n",
+    "Please note: YAML API is still EXPERIMENTAL and subject to change.\n",
+    "\n",
+    "In this notebook, we set up your development environment and write a simple pipeline using YAML. We'll run it locally, using the [DirectRunner](https://beam.apache.org/documentation/runners/direct/). You can explore other runners with the [Beam Capatibility Matrix](https://beam.apache.org/documentation/runners/capability-matrix/).\n",
+    "\n",
+    "To navigate through different sections, use the table of contents. From **View**  drop-down list, select **Table of contents**.\n",
+    "\n",
+    "To run a code cell, you can click the **Run cell** button at the top left of the cell, or by select it and press **`Shift+Enter`**. Try modifying a code cell and re-running it to see what happens.\n",
+    "\n",
+    "To learn more about Colab, see [Welcome to Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb)."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "Fz6KSQ13_3Rr",
+    "colab_type": "text"
+   },
+   "cell_type": "markdown",
+   "source": [
+    "# Setup\n",
+    "\n",
+    "First, you need to set up your environment, which includes installing `apache-beam` and downloading files from Cloud Storage to your local file system. We'll use these files as an input to the pipelines in this guide."
+   ]
+  },
+  {
+   "metadata": {
+    "id": "GOOk81Jj_yUy",
+    "colab_type": "code",
+    "outputId": "d283dfb2-4f51-4fec-816b-f57b0cb9b71c",
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 170
+    }
+   },
+   "cell_type": "code",
+   "source": [
+    "# Run and print a shell command.\n",
+    "def run(cmd):\n",
+    "  print('>> {}'.format(cmd))\n",
+    "  !{cmd}\n",
+    "  print('')\n",
+    "\n",
+    "def save_to_file(content, file_name):\n",
+    "  with open(file_name, 'w') as f:\n",
+    "    f.write(content)\n",
+    "\n",
+    "# Install apache-beam.\n",
+    "run('pip install --quiet apache-beam')\n",
+    "\n",
+    "# Copy the input files into the local file system.\n",
+    "run('mkdir -p data')\n",
+    "run('gsutil cp gs://dataflow-samples/shakespeare/kinglear.txt data/kinglear.txt')\n",
+    "run('gsutil cp gs://apache-beam-samples/SMSSpamCollection/SMSSpamCollection data/SMSSpamCollection.csv')"
+   ],
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Inspect the data\n",
+    "We'll be working with 2 datasets. We'll use `kinglear.txt` for the first example - word count, and `SMSSpamCollection.csv` for the second and third.\n",
+    "Let's first take a loot at the `kinglear.txt` dataset."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "run('head data/kinglear.txt')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This is just a `txt` file - it contains lines of text.\n",
+    "Let's take a look at the other dataset."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "run('head data/SMSSpamCollection.csv')\n",
+    "run('wc -l data/SMSSpamCollection.csv')"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "This dataset is a `csv` file with 5,574 rows and 2 columns recording the following attributes separated by a tab sign:\n",
+    "1. `Column 1`: The label (either `ham` or `spam`)\n",
+    "2. `Column 2`: The SMS as raw text (type `string`)"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Example 1: word count\n",
+    "In this popular introductory exercise, we will build a pipeline that reads lines of text from the input dataset `kinglear.txt` and counts the number of times each word appears in the text.\n",
+    "To start, we'll create a `.yaml` file specifying our pipeline."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "pipeline = '''\n",

Review Comment:
   Raw jupyter will not highlight the `%%writefile` nor a String version, so I'd still opt for my solution, given it is better in Intellij and it doesn't matter for the raw Jupyter



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org